Post

Replies

Boosts

Views

Activity

High Kernel Dispatch Overhead for Metal for Swift
I'm implementing a bitonic sort in Metal with a Swift app. This requires 100's kernel dispatch calls for each of the swap stages which touch the whole array, the work required by the GPU is small. I haven't been able to get this to run fast enough in Swift and it seems its due to a high overhead for each dispatchThread command. I rewrote the test program in Objective C with a super-simple kernel function and its runs 25x faster from Objective C! Kernel function kernel void fill(device uint8_t *array [[buffer(0)]], const device uint32_t &N [[buffer(1)]], const device uint8_t &value [[buffer(2)]], uint i [[thread_position_in_grid]]) { if (i < N) { array[i] = value; } } The Swift code is: func fill(pso:MTLComputePipelineState, buffer:MTLBuffer, N: Int, passes: Int) { guard let commandBuffer = commandQueue.makeCommandBuffer() else { return } let gridSize = MTLSizeMake(N, 1, 1) var threadGroupSize = pso.maxTotalThreadsPerThreadgroup if (threadGroupSize > N) { threadGroupSize = N; } let threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1); for pass in 0..<passes { guard let computeEncoder = commandBuffer.makeComputeCommandEncoder() else { return } var value:UInt8 = UInt8(pass); var NN:UInt32 = UInt32(N); computeEncoder.setComputePipelineState(pso) computeEncoder.setBuffer(buffer, offset: 0, index: 0) computeEncoder.setBytes(&NN, length: MemoryLayout<UInt32>.size, index: 1) computeEncoder.setBytes(&value, length: MemoryLayout<UInt8>.size, index: 2) computeEncoder.dispatchThreadgroups(gridSize, threadsPerThreadgroup: threadgroupSize) computeEncoder.endEncoding() } commandBuffer.commit() commandBuffer.waitUntilCompleted() } let device = MTLCreateSystemDefaultDevice()! let library = device.makeDefaultLibrary()! let commandQueue = device.makeCommandQueue()! let funcFill = library.makeFunction(name: "fill")! let pso = try? device.makeComputePipelineState(function: funcFill) var N = 16384 let passes = 100 let buffer = device.makeBuffer(length:N, options: [.storageModePrivate])! for _ in 1...10 { let startTime = DispatchTime.now() fill(pso:pso!, buffer:buffer, N:N, passes:passes) let endTime = DispatchTime.now() let elapsedTime = endTime.uptimeNanoseconds - startTime.uptimeNanoseconds print("Elapsed time:", Float(elapsedTime)/1_000_000, "ms"); } and the Objective C code (which should be almost identical) is void fill(id<MTLCommandQueue> commandQueue, id<MTLComputePipelineState> funcPSO, id<MTLBuffer> A, uint32_t N, int passes) { id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer]; MTLSize gridSize = MTLSizeMake(N, 1, 1); NSUInteger threadGroupSize = funcPSO.maxTotalThreadsPerThreadgroup; if (threadGroupSize > N) { threadGroupSize = N; } MTLSize threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1); for(uint8_t pass=0; pass<passes; pass++) { id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder]; [computeEncoder setComputePipelineState:funcPSO]; [computeEncoder setBuffer:A offset:0 atIndex:0]; [computeEncoder setBytes:&N length:sizeof(uint32_t) atIndex:1]; [computeEncoder setBytes:&pass length:sizeof(uint8_t) atIndex:2]; [computeEncoder dispatchThreads:gridSize threadsPerThreadgroup:threadgroupSize]; [computeEncoder endEncoding]; } [commandBuffer commit]; [commandBuffer waitUntilCompleted]; } int main() { NSError *error; id<MTLDevice> device = MTLCreateSystemDefaultDevice(); id<MTLLibrary> library = [device newDefaultLibrary]; id<MTLCommandQueue> commandQueue = [device newCommandQueue]; id<MTLFunction> funcFill = [library newFunctionWithName:@"fill"]; id<MTLComputePipelineState> pso = [device newComputePipelineStateWithFunction:funcFill error:&error]; // Prepare data int N = 16384; int passes = 100; id<MTLBuffer> bufferA = [device newBufferWithLength:N options:MTLResourceStorageModePrivate]; for(int it=1; it<=10; it++) { CFTimeInterval startTime = CFAbsoluteTimeGetCurrent(); fill(commandQueue, pso, bufferA, N, passes); CFTimeInterval duration = CFAbsoluteTimeGetCurrent() - startTime; NSLog(@"Elapsed time: %.1f ms", 1000*duration); } } The Swift output is: Elapsed time: 89.35556 ms Elapsed time: 63.243744 ms Elapsed time: 62.39568 ms Elapsed time: 62.183224 ms Elapsed time: 63.741913 ms Elapsed time: 63.59463 ms Elapsed time: 62.378654 ms Elapsed time: 61.746098 ms Elapsed time: 61.530384 ms Elapsed time: 60.88774 ms The objective C output is 2024-04-18 19:27:45.704 compute_test[3489:92754] Elapsed time: 3.6 ms 2024-04-18 19:27:45.706 compute_test[3489:92754] Elapsed time: 2.6 ms 2024-04-18 19:27:45.709 compute_test[3489:92754] Elapsed time: 2.6 ms 2024-04-18 19:27:45.712 compute_test[3489:92754] Elapsed time: 2.6 ms 2024-04-18 19:27:45.714 compute_test[3489:92754] Elapsed time: 2.7 ms 2024-04-18 19:27:45.717 compute_test[3489:92754] Elapsed time: 2.8 ms 2024-04-18 19:27:45.720 compute_test[3489:92754] Elapsed time: 2.8 ms 2024-04-18 19:27:45.723 compute_test[3489:92754] Elapsed time: 2.7 ms 2024-04-18 19:27:45.726 compute_test[3489:92754] Elapsed time: 2.5 ms 2024-04-18 19:27:45.728 compute_test[3489:92754] Elapsed time: 2.5 ms I compile the Swift code for Release, optimised for speed. I can't believe there should be a difference here, so what could be different, and what might I be doing wrong? thanks Adrian
3
0
662
Apr ’24