Metal kernel issues with 24-bit data...

I need to deal very wide images, beyond the 16384-wide limit that Metal has. So, I've resorted to using Metal buffers to reorganize my pixel data. Even using a MTLTexture for output was failing due to purported issues with getBytes and texture syncronization on NVidia hardware.


Anyways, below is a code snippet that works just fine for converting 32-bit RGBA data (with forced alpha) into my desired compact form. Both my input and output buffers are formatted for 32-bit RGBA data. My output buffer is actually a CVPIxelBuffer.


If I change line 18 to deal with 24-bit BGR data instead -- the noted one, multiplying by 3 bytes/pixel instead of 4, all I get is a black image.


I'm baffled as to why things are failing.



kernel void stripe_Kernel(device const uchar *inBuffer [[ buffer(0) ]],
                          device uchar4 *outBuffer [[ buffer(1) ]],
                          device const ushort *imgWidth  [[ buffer(2) ]],
                          device const ushort *imgHeight [[ buffer(3) ]],
                          device const ushort *packWidth  [[ buffer(4) ]],
                          uint2 gid [[ thread_position_in_grid ]])
{
    const ushort imgW  = imgWidth[0];  // eg. 18000+
    const ushort imgH  = imgHeight[0]; // eg. 2048
    const ushort packW = packWidth[0]; // eg. 1024

    uint32_t posX = gid.x;
    uint32_t posY = gid.y;

    uint32_t sourceX = ((int)(posY/imgH)*packW + posX) % imgW;
    uint32_t sourceY = (int)(posY%imgH);

    uint32_t ptr = (sourceY*imgW + sourceX)*4;  // change this to "*3" for 24-bit

    uchar4 pixel = uchar4(inBuffer[ptr],inBuffer[ptr+1],inBuffer[ptr+2],255);
    outBuffer[posY*packW + posX] = pixel;
}


I should mention that I allocate the input Buffer thusly:

posix_memalign((void *)&diskFrame,0x4000,imgHeight*imgWidth*4);


I've even left this as-is when dealing with 24-bit data thinking I may be having memory alignment issues on the GPU.


I was previously using the Accelerate framework to convert my 24-bit source data to 32-bit source data (inBuffer) for use by Metal, but doing this conversion on the GPU should anecdotally be about 3x faster. This code could be much shorter if, say, inBuffer was also defined to be uchar4 or uint32_t, but I'm demonstrating a failure case.

Replies

I tried hard to find a bug in your code and couldn't. So it is either that something else is wrong (for example, the way you're invoking the kernel for 32/24 bit data, or 24 bit data order, or whatever else) or it is a Metal bug. What I'd do it to try to prove the latter. For example: 1) Test it on different GPU/device, if it works somewhere else, then it is a bug 2) Test result of that line 18 in "24" bit version, for example change line 19 to Uchar4 pixel = uchar4(ptr%256, ptr/256,255,255); and look at the patterns it produces. You can also inspect outBuffer contents in Metal debugger - remember to clear it to some distinct pattern first, so you'll see where your kernel really writes to and what? Try to write some constant/pattern from kernel, too or maybe sourceX/sourceY values. You have to break "black box" and try to determine what it is really doing. Good luck! Michał

I was having a very similar problem. I was trying to use Metal to convert an BGRA8unorm format to a compact BGR888 format.

I created a MTLBuffer for compact BGR888:

_tryBufferSize = _inputTexture.width * _inputTexture.height * 3;    // note *3 for BGR888
size_t aligned_size = ((_tryBufferSize + 4095) / 4096) * 4096;
int ret = posix_memalign(&_tryBufferRaw, 4096, aligned_size);
NSAssert(ret == 0 && _tryBufferRaw != NULL, @"posix_memalign failed.");
_tryBuffer = [_device newBufferWithBytesNoCopy:_tryBufferRaw
                                        length:aligned_size
                                       options:MTLResourceStorageModeShared
                                   deallocator:^(void * _Nonnull pointer, NSUInteger length) { free(pointer); }];

and compute with: (inTexture BGRA8888 -> outBuffer BGR888) (reduced demo code)

kernel void
tryKernel(texture2d<half, access::read>  inTexture  [[texture(0)]],
          device uchar3                  *outBuffer [[buffer(0)]],
          uint2                          gid        [[thread_position_in_grid]])
{
    if((gid.x >= inTexture.get_width()) || (gid.y >= inTexture.get_height()))
    {
        return;
    }

    uint index = gid.y * inTexture.get_width() + gid.x;
    outBuffer[index] = uchar3(1, 1, 1);    // note uchar3 for BGR888
}

This code failed.

The following code is OK to run:

_tryBufferSize = _inputTexture.width * _inputTexture.height * 4;
...
device uchar4 *outBuffer [[buffer(0)]],
...
outBuffer[index] = uchar4(1, 1, 1, 1); // uchar2 also worked fine.

Then I came upon this 4-year old question.


MSLS 2.2 Vector Data Types specified (Table 2.3):

uchar2 size 2B

uchar3 size 4B

uchar4 size 4B

which is not the way I thought it was.

It turns out Metal still supports a very straight-forward way to access non-2B-aligned element:

define the element layout first:

typedef struct __attribute__((__packed__))
{
    unsigned char r;
    unsigned char g;
    unsigned char b;
} TryRGB888;

then compute works just as charm:

_tryBufferSize = _inputTexture.width * _inputTexture.height * 3;    // CPU side memory alloc size, *3 for BGR888
...
device TryRGB888 *outBuffer [[buffer(0)]],    // shader side
...
uint index = gid.y * inTexture.get_width() + gid.x;
outBuffer[index].r = 1;
outBuffer[index].g = 1;
outBuffer[index].b = 1;

I found similar issues and I simplified the case to as below:

Given an array unit32_t* inA, I want to output an array with each element increased by 1. Every thing works until the array length becomes 1024 * 1024 * 4, when the output array becomes all 0. It works even when the array length is 1024 * 1024 * 4 - 1.

And, somehow I increase the array size to 1024 * 1024 * 4 + 128 * 128, it works again... as a really weird workround.

Could anyone explain why 1024 * 1024 * 4 is a special number?

Thanks


    kernel void increase_array(
        /* param idx 0 - setBuffer */
        device const uint32_t* inA,
        
        /* param idx 1 - setBuffer */
        device uint32_t* result,
        
        /* the thread index */
        uint index [[thread_position_in_grid]]
       )
       {
           // the for-loop is replaced with a collection of threads, each of which
           // calls this function.
           result[index] = index;
       }