Hello All,
I have code on CUDA, and I can create several CUDA streams and run my kernels in parallel and get a performance boost for my task.
Next, I rewrote the code for Metal and try to parallelize the task in the same way. But I ran into a problem, for some reason all the kernels on Compute are always executed sequentially.
I tried to create several MTLCommandBuffer in 1 MTLCommandQueue. Also created several MTLCommandQueue with more MTLCommandBuffer. Or I used several CPU threads. But the result is always the same. In the profiler, I always observe that CommandBuffer works in order.
Screenshots from profilers for CUDA and Metal are below.
CUDA Profiler
Metal Profiler
Metal Profiles
I even created a simple kernel that does the sum of some numbers, I run this kernel with dispatchThreads((1,1,1),(1,1,1)) parameters, and I also cannot get these kernels to work in parallel.
Anyone can help me? Is there a solution or is this the specifics of Metal on M1 work?
Post
Replies
Boosts
Views
Activity
Hello All,
I have code on CUDA, and I can create several CUDA streams and run my kernels in parallel and get a performance boost for my task. Next, I rewrote the code for Metal and try to parallelize the task in the same way.
CUDA Streams
Metal device: Mac Studio with M1 Ultra. (write the code on Metal-cpp)
I creating several MTLCommandBuffer in 1 MTLCommandQueue or several MTLCommandQueue with more MTLCommandBuffer.
Regarding Metal resources, there are two options:
Buffers (MTLBuffer) was created with an option MTLResourceStorageModeShared. In the profiler, all Command buffers are performed sequentially on the timeline of Compute.
Buffers (MTLBuffer) was created with an option "MTLResourceStorageModeShared | MTLResourceHazardTrackingModeUntracked". In the profiler, I really saw the parallelism. But the maximum number of threads in the Compute timeline is always no more than 2 (see pictures). Also weird.
Computing commands do not depend on each other.
METAL Compute timeline
About performance:
[1] In the first variant, the performance is the same for different amounts of MTLCommandQueue and MTLCommandBuffer.
[2] In the second variant, the performance for one MTLCommandBuffer is greater than for 2 or more.
Question: why is this happening? How to parallelize the work of the compute kernels to get an increase performance?
Addition information:
Also, the CUDA code is rewritten in OpenCL, and it is perfectly parallelized in Windows(NVIDIA/AMD/Intel) if several OpenCL queues are running. The same code running on M1 Ultra works the same way with 1 or with many OpenCL queues. In turn, Metal is faster than OpenCL, so I am trying to figure out exactly Metal, and make the kernels work in parallel on Metal.