Hello All,
I have code on CUDA, and I can create several CUDA streams and run my kernels in parallel and get a performance boost for my task. Next, I rewrote the code for Metal and try to parallelize the task in the same way.
CUDA Streams
Metal device: Mac Studio with M1 Ultra. (write the code on Metal-cpp)
I creating several MTLCommandBuffer in 1 MTLCommandQueue or several MTLCommandQueue with more MTLCommandBuffer.
Regarding Metal resources, there are two options:
-
Buffers (MTLBuffer) was created with an option MTLResourceStorageModeShared. In the profiler, all Command buffers are performed sequentially on the timeline of Compute.
-
Buffers (MTLBuffer) was created with an option "MTLResourceStorageModeShared | MTLResourceHazardTrackingModeUntracked". In the profiler, I really saw the parallelism. But the maximum number of threads in the Compute timeline is always no more than 2 (see pictures). Also weird.
Computing commands do not depend on each other.
METAL Compute timeline
About performance:
[1] In the first variant, the performance is the same for different amounts of MTLCommandQueue and MTLCommandBuffer.
[2] In the second variant, the performance for one MTLCommandBuffer is greater than for 2 or more.
Question: why is this happening? How to parallelize the work of the compute kernels to get an increase performance?
Addition information: Also, the CUDA code is rewritten in OpenCL, and it is perfectly parallelized in Windows(NVIDIA/AMD/Intel) if several OpenCL queues are running. The same code running on M1 Ultra works the same way with 1 or with many OpenCL queues. In turn, Metal is faster than OpenCL, so I am trying to figure out exactly Metal, and make the kernels work in parallel on Metal.