Hello All,
I have code on CUDA, and I can create several CUDA streams and run my kernels in parallel and get a performance boost for my task. Next, I rewrote the code for Metal and try to parallelize the task in the same way. But I ran into a problem, for some reason all the kernels on Compute are always executed sequentially.
I tried to create several MTLCommandBuffer in 1 MTLCommandQueue. Also created several MTLCommandQueue with more MTLCommandBuffer. Or I used several CPU threads. But the result is always the same. In the profiler, I always observe that CommandBuffer works in order. Screenshots from profilers for CUDA and Metal are below.
CUDA Profiler
Metal Profiler
Metal Profiles
I even created a simple kernel that does the sum of some numbers, I run this kernel with dispatchThreads((1,1,1),(1,1,1)) parameters, and I also cannot get these kernels to work in parallel.
Anyone can help me? Is there a solution or is this the specifics of Metal on M1 work?
- Computing do not depend on each other.
- About MTLHazardTrackingMode: Yes, by default I set MTLResourceStorageModeShared, changed to MTLResourceStorageModeShared | MTLResourceHazardTrackingModeUntracked and really saw the parallelism in the profiler (see picture).
But the performance has not changed and with 1 MTLCommandBuffer in sum is better than when I create 2 or more. Also, as you can see from the figure, it does not matter what kind of parallelism I create, in the profiler it draws only two parallel streams (timeline Compute).