OS choosing performance state poorly for GPU use case

I am building a MacOS desktop app (https://anukari.com) that is using Metal compute to do real-time audio/DSP processing, as I have a problem that is highly parallelizable and too computationally expensive for the CPU.

However it seems that the way in which I am using the GPU, even when my app is fully compute-limited, the OS never increases the power/performance state. Because this is a real-time audio synthesis application, it's a huge problem to not be able to take advantage of the full clock speeds that the GPU is capable of, because the app can't keep up with real-time.

I discovered this issue while profiling the app using Instrument's Metal tracing (and Game tracing) modes. In the profiling configuration under "Metal Application" there is a drop-down to select the "Performance State." If I run the application under Instruments with Performance State set to Maximum, it runs amazingly well, and all my problems go away.

For comparison, when I run the app on its own, outside of Instruments, the expensive GPU computation it's doing takes around 2x as long to complete, meaning that the app performs half as well.

I've done a ton of work to micro-optimize my Metal compute code, based on every scrap of information from the WWDC videos, etc. A problem I'm running into is that I think that the more efficient I make my code, the less it signals to the OS that I want high GPU clock speeds!

I think part of why the OS is confused is that in most use cases, my computation can be done using only a small number of Metal threadgroups. I'm guessing that the OS heuristics see that only a small fraction of the GPU is saturated and fail to scale up the power/clock state.

I'm not sure what to do here; I'm in a bit of a bind. One possibility is that I intentionally schedule busy work -- spin threadgroups just to waste energy and signal to the OS that I need higher clock speeds. This is obviously a really bad idea, but it might work.

Is there any other (better) way for my app to signal to the OS that it is doing real-time latency-sensitive computation on the GPU and needs the clock speeds to be scaled up?

Note that game mode is not really an option, as my app also runs as an AU plugin inside hosts like Garageband, so it can't be made fullscreen, etc.

Answered by DTS Engineer in 818333022

Hello and thank you both for raising these concerns.

It seems we have an opportunity to improve API and performance so we'd like to better understand your needs.

Please create an enhancement request using the Feedback Assistant including the details above along with instructions on how to run the workload described in your blog.

Until a better solution exists, we recommend the threading techniques in Understanding Audio Workgroups to improve your app's performance rather than a "waste-makes-haste" strategy on the GPU which is indeed wasteful and not future-proof.

How much work do you queue in advance? Can you have at least 30ms of GPU work queued at any given time for instance?

Thanks for your reply!

For real-time audio, latency from user input to audio changes is quite important, and when running as an AU plugin inside GarageBand for example, processing is done in ~3ms chunks to minimize this latency. So I can't trigger useful work more than that far in advance.

It's an interesting and useful question though; there may be ways to restructure the work so that I queue 30ms of kernels, but at 3ms out the kernel is just busy-waiting on a spinlock until work arrives. This would have the advantage that it would alleviate the latency overhead of scheduling a kernel for each 3ms chunk of work.

Do you know something specific about this 30ms number? Is Apple looking at something like the GPU's load average for power throttling?

I've done a ton of experimenting with this, and it appears to me from the outside that the heuristic that MacOS is using to determine whether the GPU needs to be clocked up is something like "is any GPU command buffer fully saturating." It does not seem to matter what percentage of the GPU's full parallelism is being used, etc -- if there's computation that is only as wide as a single warp, but that warp is saturated, the GPU will clock up.

In general, this means that any computational process that hands back-and-forth between non-overlapping phases on the CPU and GPU is unlikely to get clocked up appropriately, because while the CPU is doing work, the GPU is idle (and vice-versa), indicating to this heuristic that a higher clock rate is not needed.

Admittedly this is a kind of odd situation, but in the realm of audio unit plugins, this is actually the default situation if you are trying to use the GPU for audio computation, because you need to compute little bits of audio as quickly as possible, and hand them off to the host application (GarageBand, etc) for processing that is typically done on the CPU.

The workaround for this is horrible, but extremely effective: simply spin a GPU threadgroup warp (the minimum unit of power wastage) in a busy loop 100% of the time that the plugin is running, to signal to the OS that it needs to clock up the GPU.

I implemented this, and it works perfectly, albeit wastefully. I describe the performance gains here: https://anukari.com/blog/devlog/waste-makes-haste

I tried many other approaches, including simply keeping a deeper queue of the "real" work I am doing on the GPU. But that queue had to be blocked using SharedEvents when there was no work to do, which defeated the benefit of having a deep queue: the load average was still not high enough for the OS to clock it up.

My suggestion to Apple would be to allow apps to signal that they are GPU latency-sensitive and need higher clocks to meet user needs. This would be less wasteful than spinning a GPU core, and also would allow the OS to prompt the user for permission, etc.

Hello and thank you both for raising these concerns.

It seems we have an opportunity to improve API and performance so we'd like to better understand your needs.

Please create an enhancement request using the Feedback Assistant including the details above along with instructions on how to run the workload described in your blog.

Until a better solution exists, we recommend the threading techniques in Understanding Audio Workgroups to improve your app's performance rather than a "waste-makes-haste" strategy on the GPU which is indeed wasteful and not future-proof.

Please create an enhancement request using the Feedback Assistant including the details above along with instructions on how to run the workload described in your blog.

Thanks for the reply! I will happily file feedback with these details.

OS choosing performance state poorly for GPU use case
 
 
Q