We recently noticed that copying pixel data from a meta texture to memory is a lot slower on the new iPhones equipped with the A14 Bionic.
We tracked down the guilty function on MTLTexture and found that getBytes(_:bytesPerRow:from:mipmapLevel: runs 8 to 20 times slower than 2 years old iPhones (iPhone XR). To measure how long it takes, we used signposts.
We've created a dummy demo project where we convert a MTLTexture to a CVPixelBuffer in this project: https://github.com/alikaragoz/UsingARenderPipelineToRenderPrimitives
The interesting part is located at this line: https://github.com/alikaragoz/UsingARenderPipelineToRenderPrimitives/blob/41f7f4385a490e889b94ee2c8913ce532a43aacb/Renderer/MetalUtils.swift#L40
Do you guys have an idea about what could be the issue?
I submit a feedback and get the reply:
This likely has to do with the internal representation of the texture data, which on certain newer Apple Silicon GPU can be compressed so as to save on bandwidth and power. However, when the CPU needs to make a copy into user memory (ie: via getBytes), it needs to perform decompression, which is what the perf issue you found likely is. There is several ways to deal with this, the best one depends on how the texture is being used by your application, which we don’t know, so we’ll just list a few options:
-
Instead of using getBytes into user memory, allocate a MTLBuffer of the same size and issue a GPU blit from the texture into the buffer right after the texture contents you want to get have been computed on the GPU. Then, instead of calling getBytes, just read through the .contents pointer of the buffer. Additional tips for this case: create and reuse from a pool of MTLBuffer to avoid resource creation and destruction repeatedly.
-
Keep using getBytes as you already do. However, make the GPU change the representation of the texture to be friendly to the CPU after the texture contents have been computed on the GPU. See https://developer.apple.com/documentation/metal/mtlblitcommandencoder/2966538-optimizecontentsforcpuaccess. This burns some GPU cycles, but is probably the least intrusive change. To avoid burning the GPU cycles, see the next option.
-
Adjust the texture creation (this assumes you are creating the MTLTexture instance in your code, if it occurs elsewhere outside of your control, this option may not be possible). On the MTLTextureDescriptor, set this property to NO: https://developer.apple.com/documentation/metal/mtltexturedescriptor/2966641-allowgpuoptimizedcontents. This will make the GPU never use compressed internal representation for this texture (and you lose the GPU badwidth/power benefits, but if your usecase involves frequent CPU access, it can be a good tradeoff).
Since all of these options are essentially performance tradeoffs, you should review the app performance before and after the change to verify you see the expected upside, and no (or acceptable) downsides elsewhere.
(end)
So I build a demo project to test the solutions, you can check it here: Github