Add FFT in Metal Performance Shaders

I am working on the implementation of a highly-demanding signal processing algorithm, and I am not able to reach an acceptable execution time with vDSP's routines.

I am now having a look at Metal and learn how to use it. It seems like Metal Performance Shaders as well as MPS Graph could replace almost all of my vDSP calls, but not the Fast Fourier Transform (which is the most time consuming part of the algorithm).

I was wondering if there's a way for FFT methods to be included in MPS, because it could be insanely fast if optimized for unified architecture of the M1.

Thanks !

Thanks for the suggestion. Can you create a request using Feedback Assistant? (The more of these we get the higher priority we'll give it).

Repost the number here and I can route it the MPS team ASAP.

It has the number FB9791504.

Thank you for taking this request in consideration.

I'm thinking of adding 1D, 2D, and 3D FFT transforms to an open-source project. They'll either end up in a Metal backend for Swift for TensorFlow, or in a related project. I am wondering whether the MPS team could use my open-source work to save time for themselves. Right now, the MPS team could postpone making the FFT shaders, using their time for another project. When I have open-sourced my implementation, they could use it as a reference, jump-starting their efforts and saving time.

From my experience with bug FB9653639, the Metal team is very slow to implement changes. In addition, they may need to rigorously test the shaders for bugs, which are very frequent and difficult to solve in GPGPU contexts. @CaptainHarlock my open-source effort might solve your needs before FFT shaders are added to MPS. We could discuss this more off of developer forums if it's time-sensitive - my GitHub account is "philipturner".

Another thing I'd like to see in MPS is support for encoding into indirect compute commands. I recently thought of plans for how to add a Metal backend to DL4S, a deep learning framework for Swift. This requires commands to be dispatched semi-eagerly, where you can't pre-compile them into graphs like with MPSGraph. Being able to utilize indirect command buffers in a JIT compiler like XLA (tensorflow.org/xla) would provide opportunities to reduce encoding overhead.

This isn't encouraged by Apple, but I found a way to load the raw MPS shaders by peering into a private Metallib directory accessible from public APIs. I'll go into as little detail as possible for obvious reasons, but it was possible to create compute pipeline states from MPS shaders. If I had studied them longer, I could have made an indirect command buffer workflow using them. However, there are numerous details about MPS's internals that I don't know, so I might accidentally do something unsafe. The reason I'm saying this is because it proves the MPS team can theoretically pull this off - they just need to expose a safe public API for it. There is also a precedent for unique features geared toward rare performance use cases - MTLCommandQueue.makeCommandBufferWithUnretainedReferences().

I ended up scrapping plans for ICBs in because I would need entirely custom shaders to securely execute GPU work, and Apple's MMX shaders far outperformed mine. With that restriction gone, I readily changed my plans to use MPS. For more context on how this played out, you can check out some of the closed issues under the DL4S Evolution repository. I later shifted my efforts to Swift for TensorFlow, so that repo shouldn't experience major updates in the future.

I'm debating whether I should jump-start MetalFFT now, while I wait for the S4TF project to gain momentum in the Swift community (also to help out @CaptainHarlock). I would structure its API similarly to MPS, but you need to input either a MTLComputeCommandEncoder or a MTLIndirectComputeCommand instead of kernel.encode(commandBuffer:, ...). Perhaps the completion of MetalFFT will help the MPS team better understand my suggestion about ICBs. To the Graphics and Games Engineer responding to this post - could you route the info about MetalFFT and ICBs to the MPS team?

I just published over a week worth of work on implementing Fast Fourier Transforms in Metal: MetalFFT. @CaptainHarlock worked with me throughout the process, and this thread is effectively resolved.

I have one more request for the MPS team, which is listed in my repository's README. I have no way of knowing whether any Apple engineers review a specific issue in the Feedback Assistant, and I especially do not want this one to be ignored. Graphics and Games Engineer, please relay this development directly to the MPS team (sorry for this being the third time you are asked that on this thread). I would like them to carry on my work and integrate it into Metal Performance Shaders, but we must establish communication first.

Curious as to why vDSPs routines weren't acceptable. Was the CPU version of the algorithms too slow? On Apple Silicon machines, does the vDSP routines automatically use the GPU or is that only through Metal API?

I had to perform 2D FFT on very large tensors. So I did some research about vDSP's fft routines, and find out the following things :

There's no batch function for fft 2D (something like fftm for 1D), so if I wanted to perform fft2 method on all my tensor I had to put it in a loop and manually batching it by moving my pointers across the tensor for each call of the function, which was obviously pretty slow. I couldn't made just one call of fft2 on all the tensor because the log2N parameter would be too big.I find a trick, doing an fftm, then transposing the tensor to have the columns becoming rows and so becoming contiguous in memory, then doing another fftm. This way was the fastest I could find, even if the transpose operation cost some time too.

Basically I followed all the tips I find on the documentation to have the best performances with vDSP : using a stride of 1 as much as possible (that's why I transposed my tensor between the two fftm) and allocating memory 16 bytes align, using posix_memalign method.

However I am a beginner developper and I definitely could have missed something that made my vDSP's fft too slow, but the fact is my current results didn't match performance of some other GPU framework, like the cuFFT in CUDA. That is why I thought that a highly-optimized Metal FFT could exist in MPS.

thank you

Hi,

Please could you tell us the FFT sizes that you're using that are running slowly under vDSP?

Thanks!

Edit : I found out I wasn't using FFT2 method correctly. Actually what was slow during my test was trying to pass the full tensor as a (N1 x N3) x N2 matrix in the FFT2 function, resulting in both LOGN0 and LOGN1 parameters being too large.

Batching N3 times the FFT2 over each matrix of the tensor is faster. However I'm still looking for a way to get it even faster if possible.

Hello,

I can add Metal backend for VkFFT (https://github.com/DTolm/VkFFT) as it is already abstracted to support different API code generation for FFTs. However, I don't have any Apple machine to test this on currently. There are reports from people who launched the VkFFT Vulkan backend with MoltenVK and OpenCL seems to be still supported by Apple as well.

If Apple engineers are interested in this, feel free to contact me.

Best regards, Dmitrii

Add FFT in Metal Performance Shaders
 
 
Q