GPU Synchronisation?

Hi,


I'm using MTLTexture.replaceRegion to update textures dynamically. On MacOS the call will return as soon as the data has been copied on the CPU side, but the texture will then be synchronised asynchronously with the GPU. (As far as I can tell.)


This updating is happening in a worker thread where we do not have a command encoder available.


How can I know when that GPU synchronisation is complete so that the texture can be used for rendering?

Replies

You can sample or read from a texture in a shader a soon as replace region has returned.


There are some special considerations necessary if you're reading texture data or you're continually changing the texture's contents. So if you have a specific use case in mind, beyond just created a static texture to render with, I can give you more info.

Hi Dan,


The situation is this:


I have a cache of decoded video frames, and I'm selectively uploading the most likely frames to be rendered next to textures on the GPU. Textures are pulled from a cache (or created if the cache is empty) and their contents updated. This happens in a worker thread.


This system is working fine on the following platforms:
Windows and Android with OpenGL

iOS/tvOS with Metal (The decoder outputs Metal compatible buffers, and textures are just mapped without any copying.)

MacOS with OpenGL


On MacOS with Metal, however I'm seeing frames being displayed out of order.


As I see it, it could be one of two things:


1) Textures are reused (and updated) while they're still queued to be rendered for a previous frame.

2) Textures are used before they've been uploaded to the GPU.


I've sort of discarded point 1 because I would expect to see the same issue on other platforms. (This could be an incorrect assumption.) Which leaves point 2, hence my question. The Open GL renderer/uploader sets up a fence for each upload, so I know a texture is definitely available before attempting to render it.


Typically there are three or four video frames available for display at any given time. When a frame is deemed no longer needed, its textures are returned to the cache.

As mentioned, continually changing the texture's contents (like for video), does require soem special considerations.


Unlike with OpenGL, with Metal, apps need to explicitly manage synchronization to ensure that data uploaded to textures and buffer is avaliable to the GPU at the time the command buffer intended to use that data is executing. This is typically done by using semaphores which wait at the beginning of the frame and signal when the dcommand buffer has completed (we have a number of samples that do this in case you haven't see this).


If you already think you're managing synchronization properly, here are a couple of questions.


1) Can I have some clarifications on this: "I have a cache of decoded video frames, and I'm selectively uploading the most likely frames". You're uploading decoded video data to textures in this cache? Or are you uploading textures to seperate textures? If seperate textures, what's the cache for?

2) Are these textures created using shared storage or managed storage?

3) What GPUs are you seeing this problem with? (AMD or Intel Graphics).


Also, in general for video, if you're decoding video frames with the CPU, I recommend decoding them to a Metal buffer and creating a texture from the Metal buffer using MTLBuffer.newTextureWithDescriptor. This will not fix your problem and will likely make it worse. But, once you have the timing issues resolved, it will avoid the CPU copy of the frame which occurs when you call replaceRegion.

Hi Dan,


I figured there should be some synchronisation required, which is why I was asking my question in the first place. 🙂
I noticed the examples which show how to synchronise on the main thread, however, my problem really is that this is all happening in a worker thread. The synchronsiation shouldn't be tied to the frame rate of the main thread, nor should it really be dependent on a command buffer. (Should uploads require any GPU execution resources?) As it happens I don't have full access to the view resources used for rendering (this is deep inside a library). So if that's the only way, then I guess I'm just going to have to try and put in an artificial delay before using the texture, which seems hackish to me.

To answer your questions: I have raw frame data stored in one cache, and a set of reusable textures in another. When a frame needs to be uploaded, I fetch a suitable texture from the texture cache, and update its contents with the data from the frame cache. When that texture is no longer needed, it's returned to the texture cache for later reuse.


Textures on macOS are created with managed storage.


I'm seeing this issue on an nVidia GPU. (Actually it's not my Mac but a colleague's - the Mac I'm using for development is too old to support Metal, apparently.)


I'm actually decoding with the hardware decoder using the Core Video system library. On iOS it's outputting Metal compatible buffers, and textures are mapped onto the memory in situ. I don't think that was working for some reason on macOS. (It's disabled for now anyway, I forget why.) As an aside, I'm working on a software decoder at the moment which will actually render this whole system obsolete, but until that's up and running I need to get this fixed.

You can think of the GPU as processing commands in parallel with your code. The GPU will process these command in parallel with your code whether that code runs on the main thread or a worker thread.


The synchronization in our samples (such as what's done in the CPU and GPU Synchronization sample) is done to synchronize data reads and writes between the CPU and GPU, not to sycnhronize between multiple CPU threads. So it doesn't matter whether your issuing Metal commands on a worker thread, you sitll need to perform this type of synchronization. The main difference between your case and that sample is that you're writing to a texture while the sample writes to a buffer. So while that sample uses 3 buffers to implement a "ring buffer" feeding the GPU, you'd need to use 3 textures to implement this ring.

Hi Dan,


I understand what a GPU is, and how it works.


What I still don't understand is why the Metal API is forcing me to synchronise worker thread uploads (which don't require any GPU processing) in the main thread. No data uploads, whether they're buffers or textures should require any GPU processing. At least, not by the GPU itself. The card that holds the GPU will (most likely) be a PCI Master, and the upload request should be converted to a DMA request for the controller chip on the card. I can accept that a driver might have to marshall PCI command requests through a single thread for communication with the card, but that still doesn't explain why there's no facility in the Metal API for synchronising uploads within a single thread. (I.e., why can't the driver marshall such requests in both directions?)


Basically, it would be a million times more helpful if I could attach a synchronisation object to a texture or buffer when I know that that buffer or texture is going to be updated, and just be able to query that object from any thread to know when the transfer is complete. If I had that, the rest of our system (which is already providing the necessary synchronisation between many CPU threads) would just work.


To be honest, if Metal makes simple things like this harder than they should be, fewer people will be inclined to put the effort into supporting it natively.

Sorry, didn't mean to be patronizing (but many don't understand that).


You shouldn't have to perform buffer/texture uploads on the main thread. What happens when you perform these uploads on another thread? Is old data read by the GPU?


If the resource uses shared memory you should just have to write the data (either to the contents poitner of a buffer or using replace region on a texture). If the resource uses managed memory, for buffer you do need to call didModifyRange after you write to the contents pointer. But it shouln't matter what thread you do this from.

Hi Dan,


I finally found some time earlier this week to test out some more things on the machine in question.


It turns out, rather embarrassingly for me, that my first assumption wasn't actually correct. I.e., we do appear to be reusing textures before the GPU has finished rendering with them. I removed all caching and reuse of textures from the system, and the problem seems to have disappeared.


Exactly how this hasn't been spotted on other systems, I don't know. All of the code related to this part of the system is completely cross platform. (I don't know anything about the renderer setup in the application layer, so it's possible Metal is triple buffered instead of double buffered (E.g., for Open GL), or something, which might explain it.)


Either way though, it looks like I need to track when a frame (and its resources) have been retired, and not reuse anything until after that time.


Thanks for your help, anyway.