Execution time profiling of Metal compute kernels.

In our use case, there is a Background Mac App (running on Mac M1) that is responsible for receiving data from a companion iOS App via WebSocket connection (client-side Apple Swift API, Vapor4 server side API) and perform computations using Metal Compute APIs and our custom kernels. In order to optimize execution time of these compute kernels we are looking for a way to profile their execution time i.e. how much combined GPU execution time (compute and memory accesses) is taken by each instance? As may be obvious, our primary focus is not the waiting time spent in the kernel scheduling queues before execution begins, but this may be helpful as an extra. We are not sure whether Instruments in XCode will be helpful in above scenario (partly in iOS, partly 3rd party WebSocket API, and partly background Mac App (command line App))? Also, is Metal frame capturing method dependent on presence of Metal graphics APIs and hence will not work for Background Apps? Can we get desired info using GPU Counter Sample Buffers, or are we looking at the wrong places? Any assistance wrt above (measurement of Metal compute kernel execution times in the context of a Mac Background App) will be highly appreciated.

For profiling of your GPU pipeline, you have Metal System Trace in Instruments: https://developer.apple.com/documentation/metal/performance_tuning/using_metal_system_trace_in_instruments_to_profile_your_app For profiling of the shaders themselves, along with metrics about what is limiting their speed, you'll want to use GPU frame capture in Xcode: https://developer.apple.com/documentation/metal/debugging_tools

Note that GPU frame capture can be triggered manually from Xcode when you have frames displayed, but in your case you can also use MTLCaptureManager in your code to start & stop this capture around your compute workload. So no need to have a graphic pipeline to use these tools.

Execution time profiling of Metal compute kernels.
 
 
Q