Post

Replies

Boosts

Views

Activity

Metal with M1 Ultra and parallel execution of kernels (MTLCommandQueue/MTLCommandBuffer/MTLComputeCommandEncoder)
Hello All, I have code on CUDA, and I can create several CUDA streams and run my kernels in parallel and get a performance boost for my task. Next, I rewrote the code for Metal and try to parallelize the task in the same way. But I ran into a problem, for some reason all the kernels on Compute are always executed sequentially. I tried to create several MTLCommandBuffer in 1 MTLCommandQueue. Also created several MTLCommandQueue with more MTLCommandBuffer. Or I used several CPU threads. But the result is always the same. In the profiler, I always observe that CommandBuffer works in order. Screenshots from profilers for CUDA and Metal are below. CUDA Profiler Metal Profiler Metal Profiles I even created a simple kernel that does the sum of some numbers, I run this kernel with dispatchThreads((1,1,1),(1,1,1)) parameters, and I also cannot get these kernels to work in parallel. Anyone can help me? Is there a solution or is this the specifics of Metal on M1 work?
2
0
1.2k
Dec ’22
Metal and low performance with parallel execution of kernels (MTLComputeCommandEncoder)
Hello All, I have code on CUDA, and I can create several CUDA streams and run my kernels in parallel and get a performance boost for my task. Next, I rewrote the code for Metal and try to parallelize the task in the same way. CUDA Streams Metal device: Mac Studio with M1 Ultra. (write the code on Metal-cpp) I creating several MTLCommandBuffer in 1 MTLCommandQueue or several MTLCommandQueue with more MTLCommandBuffer. Regarding Metal resources, there are two options: Buffers (MTLBuffer) was created with an option MTLResourceStorageModeShared. In the profiler, all Command buffers are performed sequentially on the timeline of Compute. Buffers (MTLBuffer) was created with an option "MTLResourceStorageModeShared | MTLResourceHazardTrackingModeUntracked". In the profiler, I really saw the parallelism. But the maximum number of threads in the Compute timeline is always no more than 2 (see pictures). Also weird. Computing commands do not depend on each other. METAL Compute timeline About performance: [1] In the first variant, the performance is the same for different amounts of MTLCommandQueue and MTLCommandBuffer. [2] In the second variant, the performance for one MTLCommandBuffer is greater than for 2 or more. Question: why is this happening? How to parallelize the work of the compute kernels to get an increase performance? Addition information: Also, the CUDA code is rewritten in OpenCL, and it is perfectly parallelized in Windows(NVIDIA/AMD/Intel) if several OpenCL queues are running. The same code running on M1 Ultra works the same way with 1 or with many OpenCL queues. In turn, Metal is faster than OpenCL, so I am trying to figure out exactly Metal, and make the kernels work in parallel on Metal.
4
0
2.2k
Dec ’22