Post

Replies

Boosts

Views

Activity

Reply to Add FFT in Metal Performance Shaders
Edit : I found out I wasn't using FFT2 method correctly. Actually what was slow during my test was trying to pass the full tensor as a (N1 x N3) x N2 matrix in the FFT2 function, resulting in both LOGN0 and LOGN1 parameters being too large. Batching N3 times the FFT2 over each matrix of the tensor is faster. However I'm still looking for a way to get it even faster if possible.
Jan ’22
Reply to Add FFT in Metal Performance Shaders
I had to perform 2D FFT on very large tensors. So I did some research about vDSP's fft routines, and find out the following things : There's no batch function for fft 2D (something like fftm for 1D), so if I wanted to perform fft2 method on all my tensor I had to put it in a loop and manually batching it by moving my pointers across the tensor for each call of the function, which was obviously pretty slow. I couldn't made just one call of fft2 on all the tensor because the log2N parameter would be too big. I find a trick, doing an fftm, then transposing the tensor to have the columns becoming rows and so becoming contiguous in memory, then doing another fftm. This way was the fastest I could find, even if the transpose operation cost some time too. Basically I followed all the tips I find on the documentation to have the best performances with vDSP : using a stride of 1 as much as possible (that's why I transposed my tensor between the two fftm) and allocating memory 16 bytes align, using posix_memalign method. However I am a beginner developper and I definitely could have missed something that made my vDSP's fft too slow, but the fact is my current results didn't match performance of some other GPU framework, like the cuFFT in CUDA. That is why I thought that a highly-optimized Metal FFT could exist in MPS.
Jan ’22