Hi,
Related to another post I made trying to explore better ways to increase the efficiency of Metal compute kernels for very large calculations, we gave a shot to use indirect compute command buffers. When we implemented the app mentioned in post with indirect buffers, we always got our final result memory buffers just with 0s. Then we implemented a much simpler code where we just do a simple operation adding two buffers to try to understand where there could be the issue. In this operation, we use the following kernel as a proof of concept, which just does an accumulation task in a cubic 3D array:
#define size (*size_pr)
kernel void metalswift_add(const device float *Buffer1 [[ buffer(0) ]],
const device float *Buffer2[[ buffer(1) ]],
device float *OutputBuffer[[ buffer(2) ]],
const device int *size_pr[[ buffer(3) ]],
uint3 gid[[thread_position_in_grid]]) {
int i = gid.x;
int j = gid.y;
int k = gid.z;
if (i < size && j < size && k < size)
{
int index=i*size*size+j*size+k;
OutputBuffer[index]+= Buffer1[index] + Buffer2[index];
}
}
The indirect command buffer works as expected for memory buffers with a modest size; for example, matrices smaller than 8 x 8 x 8 = 512 float entries. But then, when the buffer becomes a bit bigger (for example, anything equal to or larger than 16 x 16 x 16 = 4096 float entries), then the indirect command buffer seems to stop working as our result buffer just has 0s. When running using a standard command pipeline, we obtain the desired results. Below there is the Swift implementation of the indirect command buffer (the standard pipeline is not shown for post space limitations):
...
@_cdecl("metalswift_add_indirect")
public func metalswift_add_indirect(array1: UnsafeMutablePointer<Float>,array2: UnsafeMutablePointer<Float>, ldim: Int) -> UnsafeMutablePointer<Float> {
print("USING INDIRECT COMMAND BUFFER")
var device : MTLDevice!
device = MTLCreateSystemDefaultDevice()!
let defaultLibrary = try! device.makeLibrary(source: computeKernel, options: nil)
let kernel_function = defaultLibrary.makeFunction(name: "kernel_operation")!
let descriptor = MTLComputePipelineDescriptor()
descriptor.computeFunction = kernel_function
descriptor.supportIndirectCommandBuffers = true
let computePipelineState = try! device.makeComputePipelineState(descriptor: descriptor, options: .init(), reflection: nil)
let Ref1 : UnsafeMutablePointer<Float> = UnsafeMutablePointer(array1)
let Ref2 : UnsafeMutablePointer<Float> = UnsafeMutablePointer(array2)
var size = ldim
let SizeBuffer : UnsafeMutablePointer = UnsafeMutablePointer(&size)
let ll = MemoryLayout<Float>.stride * ldim*ldim*ldim
let Buffer1:MTLBuffer! = device.makeBuffer(bytes:Ref1, length: ll, options: MTLResourceOptions.storageModeShared)
let Buffer2:MTLBuffer! = device.makeBuffer(bytes:Ref2, length: ll, options: MTLResourceOptions.storageModeShared)
let OutputBuffer:MTLBuffer! = device.makeBuffer(length: ll, options: MTLResourceOptions.storageModeShared)
let Size:MTLBuffer! = device.makeBuffer(bytes: SizeBuffer, length: MemoryLayout<Int>.size, options: MTLResourceOptions.storageModeShared)
let icbDescriptor:MTLIndirectCommandBufferDescriptor = MTLIndirectCommandBufferDescriptor()
icbDescriptor.commandTypes.insert(MTLIndirectCommandType.concurrentDispatchThreads)
icbDescriptor.inheritBuffers = false
icbDescriptor.inheritPipelineState = false
icbDescriptor.maxKernelBufferBindCount = 4
let indirectCommandBuffer = device.makeIndirectCommandBuffer(descriptor: icbDescriptor, maxCommandCount: 1)!
let icbCommand = indirectCommandBuffer.indirectComputeCommandAt(0)
icbCommand.setComputePipelineState(computePipelineState)
icbCommand.setKernelBuffer(Buffer1, offset: 0, at: 0)
icbCommand.setKernelBuffer(Buffer2, offset: 0, at: 1)
icbCommand.setKernelBuffer(OutputBuffer, offset: 0, at: 2)
icbCommand.setKernelBuffer(Size, offset: 0, at: 3)
let w = computePipelineState.threadExecutionWidth
let h = Int(computePipelineState.maxTotalThreadsPerThreadgroup / w)
let z = 1
icbCommand.concurrentDispatchThreads(MTLSize(width:ldim, height: ldim, depth: ldim), threadsPerThreadgroup:MTLSize(width:w, height: h, depth: z))
let commandQueue = device.makeCommandQueue()!
for _ in 0..<10 {
let commandBuffer = commandQueue.makeCommandBuffer()!
let computeCommandEncoder = commandBuffer.makeComputeCommandEncoder()!
computeCommandEncoder.executeCommandsInBuffer(indirectCommandBuffer, range:0..<1)
computeCommandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
}
print("Last entry of buffer (it should not be 0)",OutputBuffer!.contents().assumingMemoryBound(to: Float.self)[ldim*ldim*ldim-1])
return(OutputBuffer!.contents().assumingMemoryBound(to: Float.self))
}
@_cdecl("metalswift_add_standard")
public func metalswift_add_standard(array1: UnsafeMutablePointer<Float>,array2: UnsafeMutablePointer<Float>, ldim: Int) -> UnsafeMutablePointer<Float> {
//Rest of code not shown for space constrains in post
We run this Swift code as part of a C Python extension, with a Python code such as:
import numpy as np
import myModule # this wraps the Swift library
Side3DArray=12
a = np.arange(Side3DArray**3, dtype=np.single).reshape(Side3DArray,Side3DArray,Side3DArray)
res=myModule.addition_command_buffer(a*2, a)
print(res.flatten())
res2=myModule.addition_standard(a*2, a)
print(res2.flatten())
print(np.all((10*(a*2+a) )==res))
print(np.all((10*(a*2+a) )==res2))
When we run with Side3DArray=12
, the last test result returns True for both indirect command buffer and the standard pipeline :
USING INDIRECT COMMAND BUFFER
Last entry of buffer (it should not be 0) 51810.0
[0.000e+00 3.000e+01 6.000e+01 ... 5.175e+04 5.178e+04 5.181e+04]
USING STANDARD COMMAND
Last entry of buffer (it should not be 0) 51810.0
[0.000e+00 3.000e+01 6.000e+01 ... 5.175e+04 5.178e+04 5.181e+04]
True
True
but when running with Side3DArray=16
the indirect buffer approach returns 0s in the output:
USING INDIRECT COMMAND BUFFER
Last entry of buffer (it should not be 0) 0.0
[0. 0. 0. ... 0. 0. 0.]
USING STANDARD COMMAND
Last entry of buffer (it should not be 0) 122850.0
[0.0000e+00 3.0000e+01 6.0000e+01 ... 1.2279e+05 1.2282e+05 1.2285e+05]
False
True
Tests with MTL_DEBUG_LAYER=1 and MTL_SHADER_VALIDATION=1 did not indicate any issue.
Then we'd like to know if:
- Is something else we are missing for a correct indirect buffer command execution?
- Is there some sort of limitation with indirect buffer commands that would prevent using more demanding compute kernels?
Thanks for andy advice,
Sam PS: Forgot to mention, we tried to capture the instrument trace, but XCode crashes when trying opening the trace