We have a video rendering engine we're currently transitioning from OpenGL to Metal.
Things are working fine and the overall performance is great but we’re hitting a bottleneck which we also hit with OpenGL which is downloading pixel data from the GPU to host memory.
Our app offers playback functionality through third party manufacturers’ PCIe and Thunderbolt video devices. The respective SDK hands us a buffer pointer and it’s our job to fill that buffer with image data.
Since we’re compositing on the GPU we at some point always need to do a GPU->CPU transition which takes several milliseconds.
With an RGBA 8 Bit buffer we measured about 25 ms for a 4K video frame which would be a deal breaker with 50 or 60 Hz playback. We can of course try moving this to YUV to get the number of bytes down but it will still be a problem when we go to e.g 10 bits.
But since we're rather new to Metal it may be that we're missing something. Our current way of getting the pixel data is this:
[_resultTexture getBytes:data
bytesPerRow:_resultTexture.width * 4
fromRegion:MTLRegionMake2D(0, 0, _resultTexture.width, _resultTexture.height)
mipmapLevel:0];
But we have also tried to copy the data with a MTLBlitCommandEncoder as described here: https://developer.apple.com/documentation/metal/onscreen_presentation/reading_pixel_data_from_a_drawable_texture with similar results.
Since the manufacturers of these devices now also start supporting 8K workflows I'm really starting to wonder if there's a better and much more performant way of getting the content of a GPU texture into a host memory block.
Any help would be appreciated!
The best way to avoid data copy is https://developer.apple.com/documentation/metal/mtlbuffer/1613852-maketexture . This allows you to create a linear metal texture that shares storage with a buffer (no extra copy required). However, if the destination pointer comes in externally, this is going to be difficult. You might be able to use https://developer.apple.com/documentation/metal/mtldevice/1433382-makebuffer to create a buffer from the pointer if it satisfies the requirements of being page-aligned and in a single VM region etc.