I need to deal very wide images, beyond the 16384-wide limit that Metal has. So, I've resorted to using Metal buffers to reorganize my pixel data. Even using a MTLTexture for output was failing due to purported issues with getBytes and texture syncronization on NVidia hardware.
Anyways, below is a code snippet that works just fine for converting 32-bit RGBA data (with forced alpha) into my desired compact form. Both my input and output buffers are formatted for 32-bit RGBA data. My output buffer is actually a CVPIxelBuffer.
If I change line 18 to deal with 24-bit BGR data instead -- the noted one, multiplying by 3 bytes/pixel instead of 4, all I get is a black image.
I'm baffled as to why things are failing.
kernel void stripe_Kernel(device const uchar *inBuffer [[ buffer(0) ]],
device uchar4 *outBuffer [[ buffer(1) ]],
device const ushort *imgWidth [[ buffer(2) ]],
device const ushort *imgHeight [[ buffer(3) ]],
device const ushort *packWidth [[ buffer(4) ]],
uint2 gid [[ thread_position_in_grid ]])
{
const ushort imgW = imgWidth[0]; // eg. 18000+
const ushort imgH = imgHeight[0]; // eg. 2048
const ushort packW = packWidth[0]; // eg. 1024
uint32_t posX = gid.x;
uint32_t posY = gid.y;
uint32_t sourceX = ((int)(posY/imgH)*packW + posX) % imgW;
uint32_t sourceY = (int)(posY%imgH);
uint32_t ptr = (sourceY*imgW + sourceX)*4; // change this to "*3" for 24-bit
uchar4 pixel = uchar4(inBuffer[ptr],inBuffer[ptr+1],inBuffer[ptr+2],255);
outBuffer[posY*packW + posX] = pixel;
}
I should mention that I allocate the input Buffer thusly:
posix_memalign((void *)&diskFrame,0x4000,imgHeight*imgWidth*4);
I've even left this as-is when dealing with 24-bit data thinking I may be having memory alignment issues on the GPU.
I was previously using the Accelerate framework to convert my 24-bit source data to 32-bit source data (inBuffer) for use by Metal, but doing this conversion on the GPU should anecdotally be about 3x faster. This code could be much shorter if, say, inBuffer was also defined to be uchar4 or uint32_t, but I'm demonstrating a failure case.