Keep the GPU busy
The GPU clock slows way down when it is asleep and takes a long time to come back up again. If you are submitting a lot of short jobs with small breaks in between — just the time to get the job back on the CPU look at it and queue up the next one is enough to cause problems— then it will go to sleep and take a very long time to come back. We have measured 2-4x performance loss due to this in the lab on even extremely large machine learning workloads. These are enormous. Your workload is not going to be any better off. You need to be pushing work to the GPU in a way such that when one command buffer completes, there is already the next one fully queued up and ready to go so that the GPU can seamlessly skip from one to the next without skipping a beat. Small ad hoc perf experiments invariably get this wrong. The GPU cycles down, takes a long time to spin back up again, not to mention the time to just wake it up, and all you measure is overhead.
Use MTLHeaps
It can very easily take longer for the CPU to allocate and wire down memory than it will take the GPU to run its full workload using that memory. While developing Metal Performance Shaders, we found that the even hand-tuned kernels would still run slower than the CPU if we did not keep the memory allocation under control. This is why MPS goes through a lot of effort to provide temporary MPSImages and buffers. Temporary images and buffers are backed by a MTLHeap. The MPS heap is used to recycle memory over the course of the command buffer, and can also migrate from command buffer to command buffer if the time interval is short enough. Even if you don’t use MPSKernels, you can use it in your program by making use of MPSTemporaryImages and buffers.
Why is a heap critical? Would you write your ordinary CPU based application by setting up all of your storage needs up front at compile time as global arrays? No. Of course, you wouldn’t! Not only is this a major hassle to anticipate everything that might happen ever, you would probably also waste tons of memory statically allocating for the worst case and more memory by failing to do enough analysis on your application workflows to find ways to reuse and alias memory whenever possible to keep the overall allocation size down. This reuse also is beneficial for the caches. For a complex enough program, it is quite possible your memory needs might be indeterminable or so large that the program will be jetsammed for consuming too much. Consider: why is so much energy devoted to memory safe languages online as if nothing could otherwise be done about the heap? I mean, you could static allocate everything up front, and thereby never leak any memory again! This has always been possible in C…. Well, the reason is that the heap is in fact AWESOME, and it is inconceivable not to use it. The question is really just how to use it safely. <Insert unending religious argument here> So, it should not be a surprise to any GPU programmer that statically allocating writable MTLResources up front is a bad idea. Just because it is easy doesn’t mean it is a good idea. Your application should use MTLHeaps to allocate and deallocate MTLResources over the course of the command buffer or multiple command buffers as appropriate. In this way, memory can be reused and the cost of allocating tons of memory per command buffer eliminated. Only then can the GPU shine.
For MPS, which can’t know the full nature of its workload in advance, complicated by the fact that the MTLHeap is not dynamically resizable, what this meant was solving the problem at two levels. For simple usage, a largish heap is speculatively allocated ahead of time, in a fashion similar to how malloc allocates large chunks of memory as needed and then sub allocates from there for smaller malloc calls. We attached it to the MTLCommandBuffer, which provides a nice linear timeline for memory usage so that mere code can reason about when each bit is used and for how long, as long as no kernels are running concurrently. (This can be problematic when both render and compute encoders are running, unfortunately.) It also provides a time, command buffer completion, when we can safely tear down the whole thing and return the memory to the system. For more complicated workloads like MPSNNGraph, the entire workload is introspected ahead of time, a high water mark is determined, only then the heap is allocated, and if the estimate proves incorrect more heaps are allocated as needed to back additional MTLResources. This can occur because MPSTemporaryImages and buffers do not allocate their backing store at creation, but defer it to first use and of course retire their exclusive use right on backing store when the readCount reaches 0. The MPSTemporaryImage does know however how big its allocation will be before this occurs, so it is possible to traverse the entire graph, making all MPS resources, then determine how big they are, then make a MTLHeap to hold them and only then allocate all the underlying MTLResource objects just in time for encoding. I have long felt the MTLCommandBuffer should have a feature that does just this! But until it does, this is your job.
Compile offline
Your CPU code is compiled offline long before the user sees it. This can take quite a while, and is certainly not something you’d want to attempt every time your app is launched. So, don’t do it on the GPU either. Just as on the CPU, jitting from source to ready to run code at the time you need it could easily take more time than it takes to run the code. To avoid this problem, compile your kernels to a .metallib ahead fo time and load them as needed. If you think your code would benefit from jitting to remove expensive but unused special cases, for example, then make use of Metal function constants to turn that behavior on and off. This will let you avoid the expensive front end of the compiler, which is most of the cost, and enjoy the benefit of jitting the code without paying for jitting the code from source.
Get these overheads out of the way, and we can begin to have a discussion about how to write a fast kernel.