I have a project that solves the viscoelastic equation for sound transmission in biological media https://github.com/ProteusMRIgHIFU/BabelViscoFDTD. This code supports CUDA, OpenCL, Metal, and OpenMP backends. We have done a lot of fine-tuning for each backend to get the best performance possible for each platform. Details of the numerical simulation and hardware used are detailed in the link above. Here you can see a summary of the results:
First of all, the M1 Max is a knockout to both AMD and Nvidia, but only if using OpenCL. Worth noting, the OpenMP performance of the M1 Max is also more than excellent. It is simply mindblowing the M1 Max is neck to neck to an Nvidia RTX A6000 that cost more than the Macbook Pro that was used for the test. Metal results, on the other hand, are a bit inconsistent. Metal shows excellent results on AMD W6800 Pro (the best computing time of all tested GPUs), but not so much with a Vega 56 or the M1 Max. For all Metal-capable processors, we used the first formula recommended at https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes.
Further tests trying different domain sizes showed that the M1 Max with OpenCL can get even better results than the A6000, but Metal remains lagging by a lot.
Is there something else for the M1 Max with Metal that I could be missing or worth exploring? I want to be sure our applications are future-proof, given it was even surprising OpenCL is still alive in Monterey, but we know it is supposed to be discontinued at some point.
Post
Replies
Boosts
Views
Activity
Hi,
Related to another post I made trying to explore better ways to increase the efficiency of Metal compute kernels for very large calculations, we gave a shot to use indirect compute command buffers. When we implemented the app mentioned in post with indirect buffers, we always got our final result memory buffers just with 0s. Then we implemented a much simpler code where we just do a simple operation adding two buffers to try to understand where there could be the issue. In this operation, we use the following kernel as a proof of concept, which just does an accumulation task in a cubic 3D array:
#define size (*size_pr)
kernel void metalswift_add(const device float *Buffer1 [[ buffer(0) ]],
const device float *Buffer2[[ buffer(1) ]],
device float *OutputBuffer[[ buffer(2) ]],
const device int *size_pr[[ buffer(3) ]],
uint3 gid[[thread_position_in_grid]]) {
int i = gid.x;
int j = gid.y;
int k = gid.z;
if (i < size && j < size && k < size)
{
int index=i*size*size+j*size+k;
OutputBuffer[index]+= Buffer1[index] + Buffer2[index];
}
}
The indirect command buffer works as expected for memory buffers with a modest size; for example, matrices smaller than 8 x 8 x 8 = 512 float entries. But then, when the buffer becomes a bit bigger (for example, anything equal to or larger than 16 x 16 x 16 = 4096 float entries), then the indirect command buffer seems to stop working as our result buffer just has 0s. When running using a standard command pipeline, we obtain the desired results. Below there is the Swift implementation of the indirect command buffer (the standard pipeline is not shown for post space limitations):
...
@_cdecl("metalswift_add_indirect")
public func metalswift_add_indirect(array1: UnsafeMutablePointer<Float>,array2: UnsafeMutablePointer<Float>, ldim: Int) -> UnsafeMutablePointer<Float> {
print("USING INDIRECT COMMAND BUFFER")
var device : MTLDevice!
device = MTLCreateSystemDefaultDevice()!
let defaultLibrary = try! device.makeLibrary(source: computeKernel, options: nil)
let kernel_function = defaultLibrary.makeFunction(name: "kernel_operation")!
let descriptor = MTLComputePipelineDescriptor()
descriptor.computeFunction = kernel_function
descriptor.supportIndirectCommandBuffers = true
let computePipelineState = try! device.makeComputePipelineState(descriptor: descriptor, options: .init(), reflection: nil)
let Ref1 : UnsafeMutablePointer<Float> = UnsafeMutablePointer(array1)
let Ref2 : UnsafeMutablePointer<Float> = UnsafeMutablePointer(array2)
var size = ldim
let SizeBuffer : UnsafeMutablePointer = UnsafeMutablePointer(&size)
let ll = MemoryLayout<Float>.stride * ldim*ldim*ldim
let Buffer1:MTLBuffer! = device.makeBuffer(bytes:Ref1, length: ll, options: MTLResourceOptions.storageModeShared)
let Buffer2:MTLBuffer! = device.makeBuffer(bytes:Ref2, length: ll, options: MTLResourceOptions.storageModeShared)
let OutputBuffer:MTLBuffer! = device.makeBuffer(length: ll, options: MTLResourceOptions.storageModeShared)
let Size:MTLBuffer! = device.makeBuffer(bytes: SizeBuffer, length: MemoryLayout<Int>.size, options: MTLResourceOptions.storageModeShared)
let icbDescriptor:MTLIndirectCommandBufferDescriptor = MTLIndirectCommandBufferDescriptor()
icbDescriptor.commandTypes.insert(MTLIndirectCommandType.concurrentDispatchThreads)
icbDescriptor.inheritBuffers = false
icbDescriptor.inheritPipelineState = false
icbDescriptor.maxKernelBufferBindCount = 4
let indirectCommandBuffer = device.makeIndirectCommandBuffer(descriptor: icbDescriptor, maxCommandCount: 1)!
let icbCommand = indirectCommandBuffer.indirectComputeCommandAt(0)
icbCommand.setComputePipelineState(computePipelineState)
icbCommand.setKernelBuffer(Buffer1, offset: 0, at: 0)
icbCommand.setKernelBuffer(Buffer2, offset: 0, at: 1)
icbCommand.setKernelBuffer(OutputBuffer, offset: 0, at: 2)
icbCommand.setKernelBuffer(Size, offset: 0, at: 3)
let w = computePipelineState.threadExecutionWidth
let h = Int(computePipelineState.maxTotalThreadsPerThreadgroup / w)
let z = 1
icbCommand.concurrentDispatchThreads(MTLSize(width:ldim, height: ldim, depth: ldim), threadsPerThreadgroup:MTLSize(width:w, height: h, depth: z))
let commandQueue = device.makeCommandQueue()!
for _ in 0..<10 {
let commandBuffer = commandQueue.makeCommandBuffer()!
let computeCommandEncoder = commandBuffer.makeComputeCommandEncoder()!
computeCommandEncoder.executeCommandsInBuffer(indirectCommandBuffer, range:0..<1)
computeCommandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
}
print("Last entry of buffer (it should not be 0)",OutputBuffer!.contents().assumingMemoryBound(to: Float.self)[ldim*ldim*ldim-1])
return(OutputBuffer!.contents().assumingMemoryBound(to: Float.self))
}
@_cdecl("metalswift_add_standard")
public func metalswift_add_standard(array1: UnsafeMutablePointer<Float>,array2: UnsafeMutablePointer<Float>, ldim: Int) -> UnsafeMutablePointer<Float> {
//Rest of code not shown for space constrains in post
We run this Swift code as part of a C Python extension, with a Python code such as:
import numpy as np
import myModule # this wraps the Swift library
Side3DArray=12
a = np.arange(Side3DArray**3, dtype=np.single).reshape(Side3DArray,Side3DArray,Side3DArray)
res=myModule.addition_command_buffer(a*2, a)
print(res.flatten())
res2=myModule.addition_standard(a*2, a)
print(res2.flatten())
print(np.all((10*(a*2+a) )==res))
print(np.all((10*(a*2+a) )==res2))
When we run with Side3DArray=12, the last test result returns True for both indirect command buffer and the standard pipeline :
USING INDIRECT COMMAND BUFFER
Last entry of buffer (it should not be 0) 51810.0
[0.000e+00 3.000e+01 6.000e+01 ... 5.175e+04 5.178e+04 5.181e+04]
USING STANDARD COMMAND
Last entry of buffer (it should not be 0) 51810.0
[0.000e+00 3.000e+01 6.000e+01 ... 5.175e+04 5.178e+04 5.181e+04]
True
True
but when running with Side3DArray=16 the indirect buffer approach returns 0s in the output:
USING INDIRECT COMMAND BUFFER
Last entry of buffer (it should not be 0) 0.0
[0. 0. 0. ... 0. 0. 0.]
USING STANDARD COMMAND
Last entry of buffer (it should not be 0) 122850.0
[0.0000e+00 3.0000e+01 6.0000e+01 ... 1.2279e+05 1.2282e+05 1.2285e+05]
False
True
Tests with MTL_DEBUG_LAYER=1 and MTL_SHADER_VALIDATION=1 did not indicate any issue.
Then we'd like to know if:
Is something else we are missing for a correct indirect buffer command execution?
Is there some sort of limitation with indirect buffer commands that would prevent using more demanding compute kernels?
Thanks for andy advice,
Sam
PS: Forgot to mention, we tried to capture the instrument trace, but XCode crashes when trying opening the trace
Hi,
I have an issue that I have experienced with either a newly released metalcompute library for Python (https://github.com/baldand/py-metal-compute) but also in my own C-extensions of Metal through Swift for Python. It seems that the allocated buffers do not get
released. Because of space constraints in these forums, you can take look at a working example at https://github.com/baldand/py-metal-compute/issues/19. My own implementations are relatively simple: A Swift function declared with a @_cdecl is imported in Python via ctypes.CDLL. This Swift function creates MTLBuffers, copy C arrays as input, executes the Metal kernel and copies results in C arrays. The MTLbuffers only live inside the Swift function, with no need to be managed by Python GC. When running and the Swift function is called N-times (as in thousands of times), the allocated memory visible in ActivityMontior continues to grow until running out of memory and a MTLBuffers cannot be created anymore.
I wonder if the Swift deallocator is having some sort of blockage while being called as a C extension in Python and if there is a way to force the deallocation once the call to Metal compute has been completed.
Any help would be highly appreciated, as I plan to use Metal for very long calculations, calling thousands of times a kernel during the life of a single process.
Cheers
Sam