Yeah it turns out the private texture formats work exactly how CVPixelBuffer should work when getting a MTLTexture: You pass the formats into CVMetalTextureCacheCreateTextureFromImage for plane 0, ignore plane 1, and the MTLTexture handles all of the YUV->RGB format conversion for you, and has a fairly significant performance/power benefit as well.
Here's an example of their usage: https://gist.github.com/shinyquagsire23/81c86f4bf670aaa68b5804080ff964a0. Might not be kosher for App Store submission, but if it's good enough for WebKit it's good enough for me. If Apple doesn't want it used directly they should provide an actual API/abstraction tbh.
Post
Replies
Boosts
Views
Activity
I attempted to read the packed 10-bit data out with the following code (does not work):
half readPackedY(texture2d<uint> in_tex_y, uint2 xyPx, uint stride) {
uint idx = (xyPx.x + xyPx.y * stride);
uint idxPackedBase = ((xyPx.x + xyPx.y * stride) / 4) * 5;
uint which = idx % 4;
uint px0Lut[16] = {0, 1, 2, 3};
uint px0Mask[16] = {0xFF, 0x3F, 0xF, 3};
int px0Shift[16] = {2, 4, 6, 8};
uint px1Lut[16] = {1, 2, 3, 4};
uint px1Mask[16] = {0xC0, 0xF0, 0xFC, 0xFF};
int px1Shift[16] = {6, 4, 2, 0};
uint px0Idx = idxPackedBase + px0Lut[which];
uint px1Idx = idxPackedBase + px1Lut[which];
uint2 px0XY = uint2(px0Idx % stride, px0Idx / stride);
uint2 px1XY = uint2(px1Idx % stride, px1Idx / stride);
uint8_t px0 = (in_tex_y.read(px0XY).r & px0Mask[which]);
uint8_t px1 = (in_tex_y.read(px1XY).r & px1Mask[which]);
uint px = (px0 << px0Shift[which]) | (px1 >> px1Shift[which]);
return half(float(px) / 1023.0);
}
however, Metal reads the texture in a swizzled format (seems to be 4px x 4px) but I'd rather not add undefined behavior to my app.
Signs seem to be pointing to 'actually there is no API for this' because WebKit has its own fancy MTLPixelFormatYCBCR10_420_2P_PACKED (boooooooo private texture formats 🍅🍅🍅)
I've also had this issue on this network on TF 2.9.1 (both tensorflow-macos and built from source), tensorflow-metal 0.4.1 and 0.5.0. It's mitigable by restarting the python script every 10-30 epochs (usually corresponding to 20 mins of time), transferring weights and then resuming training, but sometimes it would hang right off the bat. Eventually I had to switch to CPU training, and I can also concur that it seems to be faster for some reason.
Hardware info: Mac Studio w/ M1 Max, 10 CPU cores 24 GPU cores, 32GB RAM
Software info: macOS 12.4, Python 3.9.13 (homebrew)