I'm implementing a bitonic sort in Metal with a Swift app. This requires 100's kernel dispatch calls for each of the swap stages which touch the whole array, the work required by the GPU is small. I haven't been able to get this to run fast enough in Swift and it seems its due to a high overhead for each dispatchThread command. I rewrote the test program in Objective C with a super-simple kernel function and its runs 25x faster from Objective C!
Kernel function
kernel void fill(device uint8_t *array [[buffer(0)]],
const device uint32_t &N [[buffer(1)]],
const device uint8_t &value [[buffer(2)]],
uint i [[thread_position_in_grid]])
{
if (i < N) {
array[i] = value;
}
}
The Swift code is:
func fill(pso:MTLComputePipelineState, buffer:MTLBuffer, N: Int, passes: Int) {
guard let commandBuffer = commandQueue.makeCommandBuffer() else { return }
let gridSize = MTLSizeMake(N, 1, 1)
var threadGroupSize = pso.maxTotalThreadsPerThreadgroup
if (threadGroupSize > N) {
threadGroupSize = N;
}
let threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1);
for pass in 0..<passes {
guard let computeEncoder = commandBuffer.makeComputeCommandEncoder() else { return }
var value:UInt8 = UInt8(pass);
var NN:UInt32 = UInt32(N);
computeEncoder.setComputePipelineState(pso)
computeEncoder.setBuffer(buffer, offset: 0, index: 0)
computeEncoder.setBytes(&NN, length: MemoryLayout<UInt32>.size, index: 1)
computeEncoder.setBytes(&value, length: MemoryLayout<UInt8>.size, index: 2)
computeEncoder.dispatchThreadgroups(gridSize, threadsPerThreadgroup: threadgroupSize)
computeEncoder.endEncoding()
}
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
}
let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let commandQueue = device.makeCommandQueue()!
let funcFill = library.makeFunction(name: "fill")!
let pso = try? device.makeComputePipelineState(function: funcFill)
var N = 16384
let passes = 100
let buffer = device.makeBuffer(length:N, options: [.storageModePrivate])!
for _ in 1...10 {
let startTime = DispatchTime.now()
fill(pso:pso!, buffer:buffer, N:N, passes:passes)
let endTime = DispatchTime.now()
let elapsedTime = endTime.uptimeNanoseconds - startTime.uptimeNanoseconds
print("Elapsed time:", Float(elapsedTime)/1_000_000, "ms");
}
and the Objective C code (which should be almost identical) is
void fill(id<MTLCommandQueue> commandQueue,
id<MTLComputePipelineState> funcPSO,
id<MTLBuffer> A,
uint32_t N,
int passes) {
id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
MTLSize gridSize = MTLSizeMake(N, 1, 1);
NSUInteger threadGroupSize = funcPSO.maxTotalThreadsPerThreadgroup;
if (threadGroupSize > N) {
threadGroupSize = N;
}
MTLSize threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1);
for(uint8_t pass=0; pass<passes; pass++)
{
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setComputePipelineState:funcPSO];
[computeEncoder setBuffer:A offset:0 atIndex:0];
[computeEncoder setBytes:&N length:sizeof(uint32_t) atIndex:1];
[computeEncoder setBytes:&pass length:sizeof(uint8_t) atIndex:2];
[computeEncoder dispatchThreads:gridSize threadsPerThreadgroup:threadgroupSize];
[computeEncoder endEncoding];
}
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
}
int main() {
NSError *error;
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLLibrary> library = [device newDefaultLibrary];
id<MTLCommandQueue> commandQueue = [device newCommandQueue];
id<MTLFunction> funcFill = [library newFunctionWithName:@"fill"];
id<MTLComputePipelineState> pso = [device newComputePipelineStateWithFunction:funcFill error:&error];
// Prepare data
int N = 16384;
int passes = 100;
id<MTLBuffer> bufferA = [device newBufferWithLength:N options:MTLResourceStorageModePrivate];
for(int it=1; it<=10; it++)
{
CFTimeInterval startTime = CFAbsoluteTimeGetCurrent();
fill(commandQueue, pso, bufferA, N, passes);
CFTimeInterval duration = CFAbsoluteTimeGetCurrent() - startTime;
NSLog(@"Elapsed time: %.1f ms", 1000*duration);
}
}
The Swift output is:
Elapsed time: 89.35556 ms
Elapsed time: 63.243744 ms
Elapsed time: 62.39568 ms
Elapsed time: 62.183224 ms
Elapsed time: 63.741913 ms
Elapsed time: 63.59463 ms
Elapsed time: 62.378654 ms
Elapsed time: 61.746098 ms
Elapsed time: 61.530384 ms
Elapsed time: 60.88774 ms
The objective C output is
2024-04-18 19:27:45.704 compute_test[3489:92754] Elapsed time: 3.6 ms
2024-04-18 19:27:45.706 compute_test[3489:92754] Elapsed time: 2.6 ms
2024-04-18 19:27:45.709 compute_test[3489:92754] Elapsed time: 2.6 ms
2024-04-18 19:27:45.712 compute_test[3489:92754] Elapsed time: 2.6 ms
2024-04-18 19:27:45.714 compute_test[3489:92754] Elapsed time: 2.7 ms
2024-04-18 19:27:45.717 compute_test[3489:92754] Elapsed time: 2.8 ms
2024-04-18 19:27:45.720 compute_test[3489:92754] Elapsed time: 2.8 ms
2024-04-18 19:27:45.723 compute_test[3489:92754] Elapsed time: 2.7 ms
2024-04-18 19:27:45.726 compute_test[3489:92754] Elapsed time: 2.5 ms
2024-04-18 19:27:45.728 compute_test[3489:92754] Elapsed time: 2.5 ms
I compile the Swift code for Release, optimised for speed.
I can't believe there should be a difference here, so what could be different, and what might I be doing wrong?
thanks
Adrian