Unexpected behavior for shared MTLBuffer during CPU work

I have an image processing pipeline that performs some work on the CPU after the GPU processes a texture and then writes its result into a shared buffer (i.e. storageMode = .shared) used by the CPU for its computation. After the CPU does its work, it similarly writes at a different offset into the same shared MTLBuffer object. The buffer is arranged as so:

uint | uint | .... | uint | float

offsets (contiguous): 0 | ...

where the floating point slot is written into by the CPU and later used by the GPU in subsequent compute passes.

I haven't been able to explain or find documentation on the following strange behavior. The compute pipeline with the above buffer (call it buffer A) is as follows (without the force unwraps):

let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let commandBuffer = commandQueue.makeCommandBuffer()!


let sharedEvent = device.makeSharedEvent()!
let sharedEventQueue = DispatchQueue(label: "my-queue")
let sharedEventListener = MTLSharedEventListener(dispatchQueue: sharedEventQueue)

// Compute pipeline
kernelA.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationBuffer: bufferA)

        commandBuffer.encodeCPUExecution(for: sharedEventObject, listener: sharedEventListener) { [self] in
        var value = Float(0.0)
         bufferA.unsafelyWrite(&value, offset: Self.targetBufferOffset)
 }

 kernelB.setTargetBuffer(histogramBuffer, offset: Self.targetBufferOffset)

kernelB.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationTexture: destinationTexture)

Note that commandBuffer.encodeCPUExecution simply is a convenience function around the shared event object (encodeSignalEvent and encodeWaitEvent) that signals and waits on event.signaledValue + 1 and event.signaledValue + 2 respectively.

In the example above, kernel B does not see the writes made during the CPU execution. It can however see the values written into the buffer from kernelA.

The strange part: if you write to that same location in the buffer before the GPU schedules this work (e.g. during the encoding instead of in the middle of the GPU execution or whenever before), kernelB does see the value of the writes by the CPU.

This is odd behavior that to me suggests there is undefined behavior. If the buffer were .managed I could understand the behavior since changes on each side must be made explicit; but with a .shared buffer this behavior seems quite unexpected, especially considering that the CPU can read the values made by the preceding kernel (viz. kernelA)

What explains this strange behavior with Metal?

Note: This behavior occurs on an M1 Mac running MacCatalyst and an iPad Pro (5th generation) running iOS 15.3

The shared event signal handler does not block kernelB from running during the execution of the CPU side handler. It only synchronizes the two batches of GPU work. The completion handler is just an observer of the completion. If both the kernelA work and kernelB work are in committed command buffers, kernelA will signal the CPU on completion but kernelB will also be immediately allowed to proceed. The CPU write will be "racily" visible to kernelB.

If you need to make a CPU write that's visible to B you'll need to set the signaledValue to the waited-upon value in your CPU handler for kernelA. Alternatively, you could schedule the kernelB work in your handler. If you know what value you need to write you could do it as part of kernelA's work and avoid the bounce to CPU though.

In general terms, you should be able to have kernelB wait for a higher value that is only signaled from the CPU callback. For instance if B waits for +2 you could have the CPU handler wait for +1 and signal +2 from there.

extension MTLCommandBuffer {

    func encodeCPUExecution(for sharedEvent: MTLSharedEvent, listener: MTLSharedEventListener, work: @escaping () -> Void) {

        let value = sharedEvent.signaledValue

        sharedEvent.notify(listener, atValue: value + 1) { event, _ in

            work()

            event.signaledValue = value + 2

        }

        encodeSignalEvent(sharedEvent, value: value + 1)

        encodeWaitForEvent(sharedEvent, value: value + 2)
    }
}

This is the code for encodeCPUExecution my mistake for not making it clear enough. In fact the GPU does wait on value + 2 as you described, yet the behavior still exists. The issue is that the computation is quite suited for CPU execution (it can actually take advantage of dynamic programming for O(n) time) and is not suited for GPU execution, though I suppose you could have a single thread write the result out in a similar way the CPU does (which is probably more performant even)

I would still like to figure out why this behavior exists in the first place, even if the computation is pushed to a single thread on the GPU

Please check you're not signaling value + 2 elsewhere in your codebase. You could be registering another CPU block to run on value + 2 and seeing when it gets signaled with respect to the command buffer's lifecycle (i.e. if it happens before commit, then there's definitely a problem).

We suggest keeping track of the counter values on their own rather than reading the signaled value from the event. If you're pipelining your code properly, reading from the signaled value may be race-y. So we recommend maintaining your own counter to generate the values that are signaled and waited upon.

Currently, since this project is a work-in-progress, only a single execution of the image pipeline executes. During the execution, theMTLCaptureManager captures the execution of the command buffer. There is no loop: it executes exactly once, and its execution is analyzed. Within the execution of the image processing pipeline, this is the only spot where the GPU-CPU synchronization occurs with the shared event. The shared event resource, as well as the other resources in the pipeline, are created before the creation of the command buffer. The resources used in the pipeline are all tracked by Metal (hazardTrackingMode = .tracked) (though I hope to change this in the future and use heaps for more efficiency)

Here is a brief overview of how the code is organized:

preloadResources()

 // 1. Let CoreImage render the CGImage into the metal texture

let commandBufferDescriptor = /// ... enable `encoderExecutionStatus` to capture errors
let ciCommandBuffer = commandQueue..makeCommandBuffer(descriptor: commandBufferDescriptor) 
        let ciSourceImage = CIImage(cgImage: sourceImage)

        ciContext.render(ciSourceImage,

                                          to: sourceImageTexture,

                                          commandBuffer: ciCommandBuffer,

                                          bounds: sourceImageTexture.bounds2D,

                                          colorSpace: CGColorSpaceCreateDeviceRGB())
        ciCommandBuffer.commit()

// 2. Do the rest of the image processing
let commandBuffer = commandQueue.makeCommandBuffer(descriptor: commandBufferDescriptor)!
        try imageProcessorA.encode(commandBuffer: commandBuffer,

                                     sourceTexture: sourceImageTexture,

                                     destinationTexture: sourceImageIntermediateTexture)

        try  imageProcessorA.encode(commandBuffer: curveDetectionCommandBuffer,

                                  sourceTexture: sourceImageIntermediateTexture,

                                  destinationTexture: destinationImageTexture)

        commandBuffer.commit()

imageProcessorA contains kernelA and kernelB and performs the synchronization as described above.

I suppose I could schedule a technical review session with an engineer to provide more details of the project if more context is needed to resolve the problem.

Unexpected behavior for shared MTLBuffer during CPU work
 
 
Q