MTLFence detailed behavior?

Hello everyone


The documentation on fences in Metal is very limited and we are having difficulties to figure out how to use them correctly in more complicated setups. We use heaps thus fences are mandatory according to the documentation. Here are some questions for which we currently don't have answers and we would be interested to hear other opinions.


A) Does the exact location of updateFence/waitForFence in the encoder matter?


The documentation states:


"Fences are evaluated at command encoder boundaries. Waits occur at the beginning of an encoder and updates occur at the end of the encoder."


This could be read as 'place them wherever you like (except for the one rule stated in the docs), we will make sure the evaluation happens at the boundaries'. After having had a look at the disassembly of these functions I have some doubts that things work that way in practice. These functions go immediately down to the driver which then immediately inserts a token into the current command stream. This doesn't prove anything but it is suggestive.


B) Do fences work across command buffers in the same queue?


Apparently yes, there are samples using fences with multiple command buffers. On the other hand I have found differing opinions on forum threads.


There is certainly some complication since waits apparently only consider those updates which were encoded prior to the wait. With multiple command buffers this scheme only makes sense if we think of multiple command buffers as being chronologically ordered, which is indeed the case when they are part of the same queue, the order being given by the order of commits. If all this holds then the conclusion would be that fences only work (if at all) across two command buffers in one direction, namely with updates in the first submitted command buffer and the corresponding waits in the second commited command buffer. Does all this make sense?


C) What exactly can potentially execute in parallel, making fences a necessity?


On our iMac Pro we have tried to force a situation where parallel execution leads to corrupted results due to missing fences. Our tests involved render encoders and blit encoders but no matter what we did, the later encoder was always executed strictly after the former, even if we artificially made the first encoder last much longer than the second (i.e. by adding a large loop to the shader). If all encoders would run strictly after each other, then fences would be superfluous. But since the fence API exists for all three encoder types there must be situations where multiple encoders can run in parallel. We first thought that different encoder types could potentially run in parallel but for render/blit encoders this doesn't seem to work on our iMac Pro. Has anyone observed such parallel execution on Macs? On iOS we know that vertex and fragment shader of two consecutive encoders can run in parallel, but we don't know about blit and compute encoders.


One reason for this question is obviously that we would like to exploit the possibility of parallel execution across encoders, i.e. by allowing texture uploads (blits) and render passes to work in parallel. We know that Unreal Engine has a sophisticated fencing setup exactly to allow such asynchronous blits, therefore we would be interested to know which hardware/drivers supports these kind of async blits. It's also important for testing purposes. Our code without any fences runs perfectly glitch-free on multiple Macs so far, but if there is any hardware with more parallel execution things can change quickly and we will need to test the code on such hardware.

Post not yet marked as solved Up vote post of starbird1975 Down vote post of starbird1975
7.4k views

Replies

Plus another one:


D) How does one correctly use the beforeStages/afterStages parameter?


Apparently these parameters are only relevant for iOS, at least that's what the documentation suggests. On iOS vertex and fragment shader can run concurrently and I have been able to verify that using the profiler. When it comes to fencing things become somewhat unclear. Let us assume we simply want trivial fencing, so every encoder syncs with the previous and the next one, but allowing for concurrent execution of vertex and fragment shader. That means that the vertex shader of encoder 2 needs to wait until the vertex shader of encoder 1 has finished and the same again for the fragment shader. But to achieve this we would need two fence pairs, one with MTLRenderStageVertex for both update and wait, and the other with MTLRenderStageFragment for both update and wait. That's four fence encodings to be added to every encoder. Is that really necessary? Of course when resources are tracked exactly then fences could be placed when needed, but implementing such a resource tracking is a complicated task, so there should be an easy way to achieve trivial fencing without introducing too much overhead.


On Mac, things are unclear as well. Unreal Engine uses MTLRenderStageVertex|MTLRenderStageFragment. But isn't that just equivalent to using MTLRenderStageFragment for the update and MTLRenderStageVertex for the wait? What is officially recommended for the Mac?

Fences aren't required if you use a heap, but you do need to use them if you're creating, aliasing, or destorying resources in a heap. (So if you have resources in a heap and just keep them there, you don't need to use a fence)


A) No it doesn't matter where you put the fence in the encoder. The update is at the end of the encoder and the wait is at the beginning (at least when you're talking about fences with blit and compute. See answer to D for fences with rendering)


B) Fences do not work across command buffers. Our samples don't a fence on multiple command buffers. However, you can use an MTLEvent in the same way as a fence which will work acorss command buffer boundaries.


C) By default, all read and writes (and rendering) to a resource between different encoders are treated as serial. If you write to one resource in an encoder and read from it in the next you don't need to use a fence or an event. However if you use MTLResourceHazardTrackingModeUntracked or you allocate the resource from a heap, The GPU can perform operations on that resource out-of-order. Of course, the GPU may not perform it out-of-order which may have been the case you saw with the iMac Pro. Specifically, the AMD Vega GPU in iMac Pro is good at performing compute kernels in parallel, so the driver is more likely to schedule those in parallel if it can. Some of the blit commands are actually implemented using the AMD Vega's compute pipeline so it's possible that you blits could occur out of order.


D) You're correct that beforeStages/afterStages is particularly used with the iOS GPUs. Due to the TBDR architecture of these GPUs, the vertex shader may be executed on all vertices in an encoder before any fragment shader from that encoder begins execution. So, if only your vertex shader writes something, then you can indicate that the fence should be updated after MTLRenderStageVertex. If you had a compute kernel that read from resource written by a vertex shader, it could execute before or in parallel with the fragment shader paired with the vertex shader.


In the example you give. It really just depends on whether you're writing/rendering to a resource in the vertex or fragement shader (or both). If it's just the vertex shader then you just need to update after MTLRenderStageVertex. If it's in the fragment shader (or if it's in both), then you need use update with MTLRenderStageFragment. If you read from the resource in a vertex shader (or both a vertex and fragment shader) of encoder 2 then you need to wait wkth MTLRenderStageVertex. If you read from the resource only in the fragment shader you can wait with MTLRenderStageFragment. So you only need 2 encoding. (You could have more if you have multiple untracked resource which have different depenancies)

Hello Dan


Thank you very much for your detailed response which made the topic much clearer to us.


Concerning B), the sample I was referring to is the one on the page 'Image Filter Graph with Heaps and Fences'.

The functions 'executeFilterGraph' and 'drawInMTKView' both create a new command buffer, the first one is

used for blit/compute and the second one for rendering. The render encoder uses a fence to wait for the

compute encoder. Is the conclusion thus that the sample doesn't handle the fence business correctly or

does it show a use case where the use of fences with multiple command buffers is allowed?


Concerning D): It was already clear to us prior to this discussion that two fence encodings per encoder are sufficient

IF we do complete resource tracking. The problem is that we don't want to do full resource tracking. Instead we

would like to do trivial fencing by default and possibly optimize things if we have certain knowledge about resource

dependencies. Now on iOS the vertex stage of encoder 2 can run in parallel with the fragment stage of encoder 1

(we have verified this using Metal System Trace). If we were to implement the fences according to the rules stated

in your answer we would update after the fragment stage and wait before the vertex stage, simply because we

in general don't have knowledge about resource dependencies.


But doesn't that mean that the fragment shader will wait until the vertex shader of the previous encoder has fully

executed? Then the parallelism between fragment and vertex shader would be broken and such a setup would

have to be avoided unless absolutely necessary. So what we want to achieve is that all vertex stages are executed

strictly sequentially and the same for fragment stages, but allowing for parallel execution of fragment shader and

vertex shader of two consecutive encoders. That's how we came up with the double fence idea and we are still

wondering what the standard approach is for trivial fencing on iOS without breaking parallelism.

Concerning B) 'Image Filter Graph with Heaps and Fences' does not wait for a signal on a seperate command buffesr. If command buffer are executed on the same queue, there is not explicit dependancy tracking necessary. However if command buffers are executed in parallel on seperate queues (or even seperate devices), the app must perform some dependency tracking. It cannot use fences for this case however. It must use events to track dependencies between command buffers on seperate queues.


Concerning D) to allow vertex shader 2 to execute in parallel with fragment shader 1 you need to ensure you're not writting to something in fragment shader 1 that needs to be read by vertex shader 2 since this would require a a fence to update after fragment shader 1 and wait before vertex shader 2.

Hello Dan


Thanks again for your reply!


B) "If command buffer are executed on the same queue, there is not explicit dependancy tracking necessary."


Why is the sample then using fences at all? The whole point of fences is to deal with dependency tracking. And the whole point of the sample is to demonstrate this. Isn't your answer saying the exactly same thing as 'If command buffers are executed on the same queue, fences are not necessary' ?


D) You misunderstood my question, which was about a very specific scenario and not about the basics of fencing. If you are still interested in answering that question and need more information, please let me know which part of my detailed explanation wasn't clear.

Concerning C) It is clear that encoders can be reordered when using MTLResourceHazardTrackingModeUntracked, but what happens to dispatch calls inside the same compute encoder? I am not talking about macOS 10.14+ where dispatchType was introduced. The API provides no synchronization methods for this case, so it's logical to assume they have to be executed serially. However, it seems I've come up with a sample which proves this assumption wrong on Intel HD 4000 (even without using MTLResourceHazardTrackingModeUntracked). If it is the case, how can these calls be synchronized? I have tried splitting every dispatch call into its own encoder, but this did not work (is Intel caching encoders somehow?).

Automatic dependancy is the default behaviour, but the sample allocates objects from a heap which causes them to be untracked. It uses the untracked behaviour to save memory: Texture objects are created and destoryed multiple times during the execution of the command buffer but they alias the same memory from the heap (i.e. otherwise each texture would need its own chunk of memory even if one is destoryed before the other since Metal doesn't know that the memory can be freed as soon as the object is destroyed).


So, in general, if memory consumption isn't a concern or your not repurposing memory during the execution of a command buffer, there isn't much need to use the untracked behaviour which will require the use of fences. The automatic tracking behaviour is the default

Reads and writes to resources from kernel within the same compute encoder should always be tracked. If that's not what you're seeing, I would consider that a but in Metal or the driver and you should create a report at bugreport.apple.com. (Respond with the radar number you get and I can take a quick look)

Hello Dan


Going back to the original question 'Do fences work across command buffers':


Originally I was explaining that the sample we have been discussing uses fences across command buffers which, if true, would invalidate your initial answer:


'Fences do not work across command buffers.'


The two command buffer creations can be found on lines 326 and 383 in AAPLRenderer.m and it is easy to see that there is a wait on a fence in the render encoder (command buffer 2) which is updated in the blit/compute encoder (command buffer 1). Both command buffers are in the same queue, but they are still two different command buffers and that's what my original question is all about.


You claimed:


'Image Filter Graph with Heaps and Fences' does not wait for a signal on a seperate command buffesr.


We have an immediate contradiction here between your statement and mine and I would be happy if you could help resolving it which will hopefully bring us nearer to the final answer.

Thanks for the clarification! This thread is so far the most complete piece of documentation on MTLFences 🙂.

BTW it seems that slightly changing the sample leads to system crashes on Intel HD 4000, macOS 10.13. Should I still report it if I'm unable to reproduce it on 10.14 (which I will try on Monday)?

It would also really appreciate it if you could take a look at some other buffer-related Intel bugs I'm going to report (I discovered about 5 of them this week while playing around with heaps in our project and wanted to report them all in a batch after retesting on 10.14)

Is this also true for resources in render encoders, with the exception of render targets, right?

Seems this bug was fixed in 10.14, so I am to sure whether I should report it. However I found a bug with fences and setVertexBytes (could only reproduce it on Intel HD 4000): 45125195. I also reported a bunch of other bugs I mentioned before after confirming they are reproduced on 10.14: 45087107, 45086977, 45087046, 45087238. And as a bonus: 45087625, 45090141, 39681865, 39686945.

Sorry I should have been more clear.


1) You must use a fence to sychronize untracked resource across command buffers from the same queue. So if command buffer 1 writes to an untracked texture and later excute command buffer 2 which reads from it, you need a fence.


2) However, you cannot use a fence to synchronize untracked resources access by two command buffers running in parallell from seperate queue (you need to use an MTLEvent for this). So if command buffer 1 and 2 created from different queues have both been commited, and command buffer 1 writes to a texture (regardless of whether it is tracked or untracked) and command buffer 2 needs to read what's written by command buffer 1, you cannot use a fence but rather need to use an event.

Dan.... Here's my scenario: I want to pass the output of one MTKView draw call to another on MacOSX 10.13.6. I scoop the output MTLTexture from a kernel shader and save that in a rotating 2-deep semaphore-protected queue, (NSMutableArray of MTLTexture objects) as a property on the original MTKView. In the completion handler of that view's command buffer, I essentially dispatch a draw call on a second MTKView (with a different command queue & buffer) for further processing on that ancillary view, with my queue as input. All access to the queue is semaphore-protected.


Based on your explanation ("[It] must use events to track dependencies between command buffers on seperate queues."), I would assume that MTLEvent usage would be required here (rather than semaphores) -- and MTLFence usage would be wholly inappropriate.
I'm currently attempting to implement this with semaphores and it works about 95% of the time but my semaphores are occasionally blocking otherwise. My semaphore wait/signal calls are being flagged as 'race conditions' by the thread sanitizer, but that's another story.

Hey Dan,


Thank you so much for helping clarify things here.


According to the documentation of MTLCommandBuffer:

All command buffers sent to a single command queue are guaranteed to execute in the order in which the command buffers were enqueued.


This would suggest that no synchronization is required between command buffers from the same queue (just like you mentioned in your initial answer where you wrote that there's no dependency tracking necessary in this situation).


However, the sample code in question as well as your second answer here suggests that there is need to synchronize resources within the same queue (when they are using MTLResourceHazardTrackingModeUntracked, of course). How does that sit with the documentation of MTLCommandBuffer or even the naming convention of the MTLQueue?


Thanks!