Coherency, synchronization, scheduling with Metal Command Buffers?

The Metal documentation makes it pretty clear that command buffers are executed serially, in the order they are enqueued (...and that committing without a prior call to enqueue implies the enqueue call).


What is less clear to me, is if and how the Metal API allows one to establish dependencies between one command buffer and the next, for cases where the output of one command buffer is needed by the next one in the pipeline. Suppose that you have a simple sequence of render passes, implemented through separate command buffers issued on the same command queue:


Pass 1: Render "something" to a texture.

Pass 2: Use the texture rendered in pass 1 to do something else.


Which would – at a very high level – translate into two serial calls to each command buffer:


[commandBuffer1 commit];

[commandBuffer2 commit];


So where is the problem? I have ran into situation where the commands issued by commandBuffer2 seemingly start executing before the commands in the previous buffer are completely executed. It is an issue of timing: sometimes the texture produced by commandBuffer1 is ready by the time commandBuffer2 needs it, sometimes it isn't, and you get garbage or black frames.


One solution would be to insert a call to -(void)waitUntilCompleted, like so:


[commandBuffer1 commit];

[commandBuffer1 waitUntilCompleted];

[commandBuffer2 commit];


But that is a terrible use of a CPU: it blocks the current thread until the GPU is done executing the first command buffer. Yes, potentially a solution exists through this other method:


- (void)addCompletedHandler:(MTLCommandBufferHandler)block;


...but it doesn't really fit this type of problem. The code doesn't need to delay committing the 2nd buffer. The code simply needs to ensure that – on the GPU side – the 2nd command buffer doesn't execute until the previous one is done. Is this implied/guaranteed at all by the Metal API? Does serial execution of command buffers also imply that a command buffer’s status is – by API contract – guaranteed to be completed when the second buffer begins execution?


And if not, is there a way to declare that the execution of one command buffer relies on the completion of a previous command buffer? I can make a comparison with a conceptually similar design: NSOperationQueues. If we treat each NSOperationQueue as a "unit of work" (the same way a Metal command buffer is) you get the following:


[operationQueue addOperation:operation1];

[operationQueue addOperation:operation2];


NSOperationQueue – depending on its configuration – will execute operation1 and operation2 serially or concurrently ...but, should you need to establish a dependency, it is super-simple:


[operation2 addDependency:operation1];

[operationQueue addOperation:operation1];

[operationQueue addOperation:operation2];


Would there be anything similar for the Metal execution model? Is it even required, or – as asked earlier – are command buffers belonging to the same queue serially executed with no overlap in their execution?


Thanks!

Replies

Did you try

textureBarrier?


But I definitely agree with you that Metal is quite limited in this regards. I would like some more explicit control options (or even manual responsibility for managing fences and resource residency aka. Vulkan).

I admit that I tried textureBarrier before the official El Capitan release, but I could not see any desirable side effects (so maybe the picture changed). The explanation for my failures might be that textureBarrier is a method specific to a MTLRenderCommandEncoder, so it will work only in that specific (and yes very common case) described by its documentation: render-to-texture, use that texture as input to a fragment shader in a subsequent pass.

What if the texture is generated by a compute shader via Core Image (MTLComputeCommandEncoder) and then fed as a fragment shader input to a MTLRenderCommandEncoder? Or the opposite: texture generated via MTLRenderCommandEncoder that is required by a MTLComputeCommandEncoder? Or two separate Core Image passes, where the first pass generates the output CIImage backed by a MTLTexture that is subsequently fed as an input CIImage to another Core Image filter chain?

I hope to be wrong, but as far as I can see there is no "encoder-agnostic" texture barrier. (Or maybe the current texture barrier can be misused for this purpose?)

I am updating this for the benefit of others. Apparently there are no APIs in Metal (other than textureBarrier) that allow you to create encoder-agnostic barriers, or to create dependencies between the results of one command buffer and the next one.


The above holds true as of OS X 10.11.1. It is now up to us to file enhacement requests to have something of the sort be implemented as part of the Metal framework. Let’s get busy then :-)

FxFactory, I'm sorry for the late reply, but what you suggested in your original thread should work: Anything executed on a MTLCommandQueue should run in order, regardless of if the work is Render v. Compute or otherwise.


That you are observing otherwise is most likely a bug in the underlying implementation. Do you know if this only happens if you switching from a Compute to Render pass, or some other transition?


Can you get a bug report filed, with anything attached that would help us reproduce? (your binary is best), and a system configuration?

You should start with Radar # 23556515. It includes an app with source code, allowing you to see the problem I keep running into. There is also a screencast to show the problem as it happens on the two systems I tested it on. I suspect it is a timing issue, so you may not get the problem immediately.

The original content of this thread has also been turned into a feature request, Radar # 23537810.

Thank you!

I can't see the Radar(s), but I do want to up-vote this as an issue.


For many situations, it's critical to know that:

  1. The queue is ensuring the seriality of the command buffers that have been submitted to it.
  2. Once the last command buffer has invoked its callback, all previously submitted command buffers on the same queue have completed as well, which is important for things like coordinating resource sychronization dispatching, and presentation of the drawable.

Because of the uncertainty, we have ended up creating a semi-elaborate scheme of tracking multiple command buffer completion callbacks to determine ordering and finalization activities.


Thanks...


...Bill

In theory, as far as my improved (but still poor) understanding of Metal internals goes, both conditions 1. and 2. are already satisfied by the current incarnation of the Metal framework. Would I move on that assumption for my own code? Probably not. I think your prudence is well justified.


The problems I encountered – which prompted me to write this post and file bugs – are still there. Maybe the issues have nothing to do with synchronization, seriality of buffers and coherency, maybe they do. I filed another bug with sample code, and put my project on hold until I hear back.


Metal still seems to me a far, far more elegant abstraction than OpenGL/OpenCL ever was. I hope it will mature to perfection, rather than join the multitude of graphics APIs that are been announced with great fanfare, only to die from indifference.

Hi


I believe I have similar issue here. Porting OpenGL application to Metal. Application has certain "update cycle", which boils down to following operations:

1. Upload some data via vertex buffer

2. Create compute command encoder, call two kernels using data from aforementioned buffer. These update "data textures" with data (scatter, basically)

3. Create (sequentially) several render command encoders (around five of them) and perform some rendering to textures. One render to texture per encoder, this is because of render target switching. Image in last render target texture is corrupted unless I...


...perform above steps in separate command buffer and wait for said command buffer completion immediately after submitting it.


Alas, this kills performance. Application in question does 60Hz easily in GL on my laptop. Port is not even completed and framerate drops to something around 40Hz when I do separate command buffers as described above. I'll gladly do anything I could to help fix this issue. Source code is not mine to give away, but I could prepare several binaries with Metal debug labels and such, to easily reproduce the problem.


Best regards

Michal


PS. There are no dependencies between rendering in step 3 other than via depth buffer contents. There is no direct reading from render targets involved. So I believe the "clash" to be not within rendering passes in one command buffer (steps 1-3) but between command buffers! Tried to adding textureBarriers here and there, that changed nothing. Only waiting for command buffer to end will do :-(

Is it exactly 40fps?

Nope, more like "averages at 40". But depending on what my application (kind of painting program) does, I have to perform from zero (when idle) up to several "update cycles". Unfortunately, I can't batch them into one big update - they're kind of "flush" operation when certain resources get exhausted.

For the record, problem described above turned out to be my own fault and not a Metal bug. Sorry, guys.

The problem was that I update most textures with RTT or compute kernel, but simplest one was updated via replaceRegion. It turns out that replace Region may collide with what was previously encoded/enqueued (and rightly so, as it works without CommandQueue). And what I did was causing change of texture contents right in the middle of computing operations...not the best idea in the world :-)


Waiting for command buffer completion "solved" the problem because then replaceRegion had nothing to interfere with (compute operations being finished with command buffer completion). Stupid of me, I know.

Have you resolved your problem yet?


I've had a similar problem lately and it seems that in Metal, Compute and Vertex/Fragment stages can run simultaneously. To create dependencies between individual command encoders, try using MTLFences.

I don't want to open a new thread because I think it perfectly fits into this one. Even some years past already.


I develop a metal macOS app on a 2018 MacBook Pro. Metal System Trace shows me five different channels for my second GPU. Maybe someone has an idea how the GPU execution model in Metal works in detail.


For example I have one MTLCommandBuffer with three different encoders.


a) render encoder which renders a texture

b) this texture is the input for the next compute encoder which writes its results into a MTLBuffer

c) a blit encoder uses this buffer for syncronization with the CPU memory


In one WWDC 2019 video it was showed how the GPU channels work and that the depencies will result in the correct execution order.

But what I see in instruments is that sometimes the execution order is exactly like I encoded the commands, but sometimes not. Instead of a-b-c there is also b-a-c. And then the result is completely wrong.


As I saw textureBarrier() is now deprecated. But I am also not sure if this is the correct way of syncronization. I still want parallism in the GPU channels. But outputs from one encoded command should be used as inputs in the second command or render encoder as I encoded them into the MTLCommandBuffer.

As I understand the scheduling of metal command buffers seperating the single tasks into several command buffers will not be a solution for this problem. The command buffers will be executed in the order they are enqueed but not syncronized with their outputs (correct me if I am wrong).


So how can I enqueue commands into one ore several command buffers and be sure their execution order in time is exactly the one I can see in the Xcode depency graph?

You can use `encodeSignalEvent` and `encodeWaitForEvent` to synchronize command buffers (since iOS 12, MaxOS 10.14).


BTW can you link the WWDC video you mentioned that talks about GPU channels? Or do you remember the session name/number?

Thanks.