Currently, since this project is a work-in-progress, only a single execution of the image pipeline executes. During the execution, theMTLCaptureManager captures the execution of the command buffer. There is no loop: it executes exactly once, and its execution is analyzed. Within the execution of the image processing pipeline, this is the only spot where the GPU-CPU synchronization occurs with the shared event. The shared event resource, as well as the other resources in the pipeline, are created before the creation of the command buffer. The resources used in the pipeline are all tracked by Metal (hazardTrackingMode = .tracked) (though I hope to change this in the future and use heaps for more efficiency)
Here is a brief overview of how the code is organized:
preloadResources()
// 1. Let CoreImage render the CGImage into the metal texture
let commandBufferDescriptor = /// ... enable `encoderExecutionStatus` to capture errors
let ciCommandBuffer = commandQueue..makeCommandBuffer(descriptor: commandBufferDescriptor)
let ciSourceImage = CIImage(cgImage: sourceImage)
ciContext.render(ciSourceImage,
to: sourceImageTexture,
commandBuffer: ciCommandBuffer,
bounds: sourceImageTexture.bounds2D,
colorSpace: CGColorSpaceCreateDeviceRGB())
ciCommandBuffer.commit()
// 2. Do the rest of the image processing
let commandBuffer = commandQueue.makeCommandBuffer(descriptor: commandBufferDescriptor)!
try imageProcessorA.encode(commandBuffer: commandBuffer,
sourceTexture: sourceImageTexture,
destinationTexture: sourceImageIntermediateTexture)
try imageProcessorA.encode(commandBuffer: curveDetectionCommandBuffer,
sourceTexture: sourceImageIntermediateTexture,
destinationTexture: destinationImageTexture)
commandBuffer.commit()
imageProcessorA contains kernelA and kernelB and performs the synchronization as described above.
I suppose I could schedule a technical review session with an engineer to provide more details of the project if more context is needed to resolve the problem.
Post
Replies
Boosts
Views
Activity
extension MTLCommandBuffer {
func encodeCPUExecution(for sharedEvent: MTLSharedEvent, listener: MTLSharedEventListener, work: @escaping () -> Void) {
let value = sharedEvent.signaledValue
sharedEvent.notify(listener, atValue: value + 1) { event, _ in
work()
event.signaledValue = value + 2
}
encodeSignalEvent(sharedEvent, value: value + 1)
encodeWaitForEvent(sharedEvent, value: value + 2)
}
}
This is the code for encodeCPUExecution my mistake for not making it clear enough. In fact the GPU does wait on value + 2 as you described, yet the behavior still exists. The issue is that the computation is quite suited for CPU execution (it can actually take advantage of dynamic programming for O(n) time) and is not suited for GPU execution, though I suppose you could have a single thread write the result out in a similar way the CPU does (which is probably more performant even)
I would still like to figure out why this behavior exists in the first place, even if the computation is pushed to a single thread on the GPU
I found the tech talk "Discover advances in A15 Bionic" which describes one use case of quadgroups and quadgroup functions at around the 21:00 minute mark where they're used to reduce texture reads. If anyone has any other use cases let us know.
Two reasons:
Sheer curiosity
I was worried that I could run out of tile memory if ray_data payloads were stored there. I was hoping to implement something like Rich Forster mentioned in the associated talk "Get to know Metal function pointers." Around 18:00 is the relevant section. There, he talks about divergence as a result of the different threads invoking different functions. The solution (described around 19:00) was to use threadgroup memory to pass around relevant data, which I thought could be constrained by the size of the payload
Of course, I should maybe figure that this would have been mentioned somewhere if it were something to consider and so I don't have to worry, but it's interesting nonetheless.
P.S. I haven't written the ray tracing kernel yet, nor the intersection/visible function(s). But it was something I considered as I was designing my program
I see. It seems for some reason that I incorrectly assumed that the value would be reset somehow after the buffer finished executing (so that it was monotonically increasing within the "scope" of a single buffer encoding). This works as expected
So unfortunately I will not be able to submit a debug request since I updated to Big Sur before seeing this post. However, I am happy to say that the debugger is working flawlessly in Xcode 12.3 with Big Sur 11.1.
macOS 10.15 is the deployment target
My sessions also crash occasionally. Hopefully the next version fixes this critical debug tool as it's nearly impossible to debug shaders without it.