Metal Compute Never beats 2.5ms?

I'm trying to figure out if Metal compute is a viable solution for an application. I'm finding that even on an iPhone6sPlus with A9, an empty compute encoder executing in loop never beats 2.5ms in execution. The simplest test I could concoct was:


- (void)runtest {

id <MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];

id <MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];

[computeEncoder endEncoding];

[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> _Nonnull) {

runcounter ++ ;

NSLog(@"count: %d", runcounter);

[self runtest];

}];

[commandBuffer commit];

}


The loop was run for 10seconds. I did a simple division of runs/seconds. No other work was being done by the app (display loop, etc). The breakdown was 2.5ms between iterations.


For comparison, something like a NEON sum of 1024 numbers avg'd 0.04 ms and of course executed immediately.


I realize this doesn't mean it wastes 2.5ms of resources and could just be scheduling, but for very low latency app requirements (camera processing) it does mean that NEON can process immediately but Metal can not. Can someone confirm this finding or correct the test? Thanks.

"NEON can process immediately but Metal can not"

Exactly. NEON is just SIMD registers and instructions on the CPU, therefore NEON code executes immediately. Metal requires CPU to GPU and back communication, therefore it will always carry certain penalty. If you're looking for immediate results, you're better off with NEON, or even plain C/ObjC/Swift


Having said that, processing of 1024 numbers appears to be lightweight when compared to actual camera image processing. Camera images these days are in millions of pixels, and then your costs may skyrocket. Perhaps then GPU processing would make sense.


Sometimes you can amortize costs of CPU/GPU roundtrip, for example by processing, say, 16 frames at once. Then GPU solution could get faster.

regards

MIchal

Thanks. The sum of 1024 numbers was just a trivial example of comparing doing nothing (Metal compute) at 2.5ms to doing something (Neon) at no cost. It looks like I'll settle on NEON again until the latency of GPU is resolved. I know there will always be some latency given that it part of the system. But 2.5ms on an a9 is about 4 million instruction cycles (single core) that could have happened by the time I'd get a response from Compute.


Again, would love to hear from graphics architecture team if this is incorrect.

The useful window for the GPU appears to be gated from both ends.  On one side, there is substantial time to copying all your data to wired down memory in MTLResources, encoding your workload to a command buffer, jitting code, waiting for the GPU to become available, possibly including waking up the GPU, actually running the code, copying the data back to system memory (if discrete) and then waiting for your CPU thread to be swapped in to receive the data.  2.5 ms does not at all sound unreasonable for this to occur on older devices.   On the other end, there is a watchdog timer running that will kill your GPU workload if it runs more than a few seconds. This is there to keep UI frame rates up. The GPU doesn’t preemptively multitask well, so if your job is running a long time, the user interface may not be able to refresh and the machine appears to freeze. Ideally, your workload should be done in the time it takes to refresh the screen, less than 1/60th of a second.  So the useful workload appears to be one that is large enough to not suffer unduly from 2.5 ms overhead for involving the GPU and one that does not run so long as to trigger the watchdog time or damage UI interactivity.  This is a fairly narrow window, and you would not be blamed if you naively come to believe that GPU compute is doomed

Importantly, however, nearly all of the GPU overhead described above occurs because of poor program design!   The REAL lesson here is that you should pick one device, CPU or GPU, and stick with it!  Don’t bounce data back and forth all the time. (There is a similar lesson to be learned for people learning to use the CPU vector units, though of course it manifests on much smaller time scales.)   Which should you choose?  Start with where your data is and where it will be consumed.  If the data starts and ends on the CPU, except for the largest workloads, stick with the CPU. Look to see what easy wins you can get from Accelerate.framework and GCD. Maybe even try your hand at writing some vector code. If that still isn’t enough, or you need the CPU for other things, then you might need to go to the GPU. If the data starts and ends life on the GPU, then obviously use the GPU for everything if you can.  To be clear, the GPU is not good at inherently serial workloads like Huffman decoding, so some things that are not parallelizable just don’t belong there. 

If you do decide to use the GPU, you have to understand that you are working on a high throughput machine with consequently less tolerance for overhead from other factors, so additional work will be needed on your end to make sure these do not become a problem. In many cases on the CPU, these things also could pose problems on the CPU, but the working environment is structured to make it really hard to do these things so you don’t run into them and they aren’t a problem. Metal makes all of these things possible, and more often than not does not go out of its way to make practices deemed harmful difficult. Any one of these has the potential to cause large (factor of 2-10) losses in performance, and alone make the GPU run slower than the CPU. In combination, well… it doesn’t look good.

I’ll detail essential GPU survival strategies to follow:
Keep the GPU busy
The GPU clock slows way down when it is asleep and takes a long time to come back up again.  If you are submitting a lot of short jobs with small breaks in between — just the time to get the job back on the CPU look at it and queue up the next one is enough to cause problems— then it will go to sleep and take a very long time to come back. We have measured 2-4x performance loss due to this in the lab on even extremely large machine learning workloads. These are enormous. Your workload is not going to be any better off.  You need to be pushing work to the GPU in a way such that when one command buffer completes, there is already the next one fully queued up and ready to go so that the GPU can seamlessly skip from one to the next without skipping a beat.  Small ad hoc perf experiments invariably get this wrong. The GPU cycles down, takes a long time to spin back up again, not to mention the time to just wake it up, and all you measure is overhead. 

Use MTLHeaps
It can very easily take longer for the CPU to allocate and wire down memory than it will take the GPU to run its full workload using that memory.  While developing Metal Performance Shaders, we found that the even hand-tuned kernels would still run slower than the CPU if we did not keep the memory allocation under control.  This is why MPS goes through a lot of effort to provide temporary MPSImages and buffers. Temporary images and buffers are backed by a MTLHeap. The MPS heap is used to recycle memory over the course of the command buffer, and can also migrate from command buffer to command buffer if the time interval is short enough. Even if you don’t use MPSKernels, you can use it in your program by making use of MPSTemporaryImages and buffers. 

Why is a heap critical? Would you write your ordinary CPU based application by setting up all of your storage needs up front at compile time as global arrays?  No. Of course, you wouldn’t!  Not only is this a major hassle to anticipate everything that might happen ever, you would probably also waste tons of memory statically allocating for the worst case and more memory by failing to do enough analysis on your application workflows to find ways to reuse and alias memory whenever possible to keep the overall allocation size down.  This reuse also is beneficial for the caches.  For a complex enough program, it is quite possible your memory needs might be indeterminable or so large that the program will be jetsammed for consuming too much.   Consider: why is so much energy devoted to memory safe languages online as if nothing could otherwise be done about the heap? I mean, you could static allocate everything up front, and thereby never leak any memory again!  This has always been possible in C….  Well, the reason is that the heap is in fact AWESOME, and it is inconceivable not to use it. The question is really just how to use it safely. <Insert unending religious argument here>  So, it should not be a surprise to any GPU programmer that statically allocating writable MTLResources up front is a bad idea. Just because it is easy doesn’t mean it is a good idea.  Your application should use MTLHeaps to allocate and deallocate MTLResources over the course of the command buffer or multiple command buffers as appropriate. In this way, memory can be reused and the cost of allocating tons of memory per command buffer eliminated. Only then can the GPU shine. 

For MPS, which can’t know the full nature of its workload in advance, complicated by the fact that the MTLHeap is not dynamically resizable, what this meant was solving the problem at two levels.  For simple usage, a largish heap is speculatively allocated ahead of time, in a fashion similar to how malloc allocates large chunks of memory as needed and then sub allocates from there for smaller malloc calls. We attached it to the MTLCommandBuffer, which provides a nice linear timeline for memory usage so that mere code can reason about when each bit is used and for how long, as long as no kernels are running concurrently. (This can be problematic when both render and compute encoders are running, unfortunately.) It also provides a time, command buffer completion, when we can safely tear down the whole thing and return the memory to the system. For more complicated workloads like MPSNNGraph, the entire workload is introspected ahead of time, a high water mark is determined, only then the heap is allocated, and if the estimate proves incorrect more heaps are allocated as needed to back additional MTLResources.  This can occur because MPSTemporaryImages and buffers do not allocate their backing store at creation, but defer it to first use and of course retire their exclusive use right on backing store when the readCount reaches 0. The MPSTemporaryImage does know however how big its allocation will be before this occurs, so it is possible to traverse the entire graph, making all MPS resources, then determine how big they are, then make a MTLHeap to hold them and only then allocate all the underlying MTLResource objects just in time for encoding.  I have long felt the MTLCommandBuffer should have a feature that does just this! But until it does, this is your job.

Compile offline
Your CPU code is compiled offline long before the user sees it. This can take quite a while, and is certainly not something you’d want to attempt every time your app is launched. So, don’t do it on the GPU either. Just as on the CPU, jitting from source to ready to run code at the time you need it could easily take more time than it takes to run the code. To avoid this problem, compile your kernels to a .metallib ahead fo time and load them as needed. If you think your code would benefit from jitting to remove expensive but unused special cases, for example, then make use of Metal function constants to turn that behavior on and off. This will let you avoid the expensive front end of the compiler, which is most of the cost, and enjoy the benefit of jitting the code without paying for jitting the code from source. 

Get these overheads out of the way, and we can begin to have a discussion about how to write a fast kernel.

Metal Compute Never beats 2.5ms?
 
 
Q