Hello everyone
The documentation on fences in Metal is very limited and we are having difficulties to figure out how to use them correctly in more complicated setups. We use heaps thus fences are mandatory according to the documentation. Here are some questions for which we currently don't have answers and we would be interested to hear other opinions.
A) Does the exact location of updateFence/waitForFence in the encoder matter?
The documentation states:
"Fences are evaluated at command encoder boundaries. Waits occur at the beginning of an encoder and updates occur at the end of the encoder."
This could be read as 'place them wherever you like (except for the one rule stated in the docs), we will make sure the evaluation happens at the boundaries'. After having had a look at the disassembly of these functions I have some doubts that things work that way in practice. These functions go immediately down to the driver which then immediately inserts a token into the current command stream. This doesn't prove anything but it is suggestive.
B) Do fences work across command buffers in the same queue?
Apparently yes, there are samples using fences with multiple command buffers. On the other hand I have found differing opinions on forum threads.
There is certainly some complication since waits apparently only consider those updates which were encoded prior to the wait. With multiple command buffers this scheme only makes sense if we think of multiple command buffers as being chronologically ordered, which is indeed the case when they are part of the same queue, the order being given by the order of commits. If all this holds then the conclusion would be that fences only work (if at all) across two command buffers in one direction, namely with updates in the first submitted command buffer and the corresponding waits in the second commited command buffer. Does all this make sense?
C) What exactly can potentially execute in parallel, making fences a necessity?
On our iMac Pro we have tried to force a situation where parallel execution leads to corrupted results due to missing fences. Our tests involved render encoders and blit encoders but no matter what we did, the later encoder was always executed strictly after the former, even if we artificially made the first encoder last much longer than the second (i.e. by adding a large loop to the shader). If all encoders would run strictly after each other, then fences would be superfluous. But since the fence API exists for all three encoder types there must be situations where multiple encoders can run in parallel. We first thought that different encoder types could potentially run in parallel but for render/blit encoders this doesn't seem to work on our iMac Pro. Has anyone observed such parallel execution on Macs? On iOS we know that vertex and fragment shader of two consecutive encoders can run in parallel, but we don't know about blit and compute encoders.
One reason for this question is obviously that we would like to exploit the possibility of parallel execution across encoders, i.e. by allowing texture uploads (blits) and render passes to work in parallel. We know that Unreal Engine has a sophisticated fencing setup exactly to allow such asynchronous blits, therefore we would be interested to know which hardware/drivers supports these kind of async blits. It's also important for testing purposes. Our code without any fences runs perfectly glitch-free on multiple Macs so far, but if there is any hardware with more parallel execution things can change quickly and we will need to test the code on such hardware.