Metal with M1 Ultra and parallel execution of kernels (MTLCommandQueue/MTLCommandBuffer/MTLComputeCommandEncoder)

Hello All,

I have code on CUDA, and I can create several CUDA streams and run my kernels in parallel and get a performance boost for my task. Next, I rewrote the code for Metal and try to parallelize the task in the same way. But I ran into a problem, for some reason all the kernels on Compute are always executed sequentially.

I tried to create several MTLCommandBuffer in 1 MTLCommandQueue. Also created several MTLCommandQueue with more MTLCommandBuffer. Or I used several CPU threads. But the result is always the same. In the profiler, I always observe that CommandBuffer works in order. Screenshots from profilers for CUDA and Metal are below.

CUDA Profiler

Metal Profiler

Metal Profiles

I even created a simple kernel that does the sum of some numbers, I run this kernel with dispatchThreads((1,1,1),(1,1,1)) parameters, and I also cannot get these kernels to work in parallel.

Anyone can help me? Is there a solution or is this the specifics of Metal on M1 work?

Answered by abdyla_v in 738564022
  1. Computing do not depend on each other.
  2. About MTLHazardTrackingMode: Yes, by default I set MTLResourceStorageModeShared, changed to MTLResourceStorageModeShared | MTLResourceHazardTrackingModeUntracked and really saw the parallelism in the profiler (see picture).

But the performance has not changed and with 1 MTLCommandBuffer in sum is better than when I create 2 or more. Also, as you can see from the figure, it does not matter what kind of parallelism I create, in the profiler it draws only two parallel streams (timeline Compute).

Do your compute commands depend on each other? What if you set untracked MTLHazardTrackingMode for the resources bound to your compute commands? It'll probably give incorrect output but just to see if they get executed in parallel.

Accepted Answer
  1. Computing do not depend on each other.
  2. About MTLHazardTrackingMode: Yes, by default I set MTLResourceStorageModeShared, changed to MTLResourceStorageModeShared | MTLResourceHazardTrackingModeUntracked and really saw the parallelism in the profiler (see picture).

But the performance has not changed and with 1 MTLCommandBuffer in sum is better than when I create 2 or more. Also, as you can see from the figure, it does not matter what kind of parallelism I create, in the profiler it draws only two parallel streams (timeline Compute).

Metal with M1 Ultra and parallel execution of kernels (MTLCommandQueue/MTLCommandBuffer/MTLComputeCommandEncoder)
 
 
Q