What is the difference between a float2 and a packed_float2?

I'm trying to use a simd float2 inside a struct to send to a shader and it works if the receiving struct in the metal file is typed as a packed_float2. Things do not look nice when I try to use a float2 in my receiving struct in metal. I checked the alignment documentation and noticed that the alignment for a packed_float2 is 4 bytes but it's size is 8 bytes. The simd float2 type's size is 8 bytes. Is the simd float2 type not aligned by 8 bytes? Since the sizes are the same what does it matter if the shader reads it 8 bytes at a time or 4 bytes at a time. Why does this generate garbage?


Also I'm having similar problems with float3 simd types. I have structs with just float3 types, not packed_float3 types, and it works fine - I can send data from my swift program to the shader without problems, but then I have a separate struct with mixed types (float, float2 and float3, the one mentioned above for the float2 problems actually), but everything in this struct needs to be a packed type. I'm trying to follow the documentation as best as possible, but there really isn't much of this that's documented. Is there some useful documentation or resource guide to help me through my alignment issues? Thank you.

Accepted Reply

Hello


Your original structure (with float2/float3s instead of packed types would give something like):


offset 0: zoom

offset 4: near

offset 8: far

offset 12: padding 4 bytes inserted by compiler because float2 cannot begin at offset 12 - it has to be multiple of 8

offset 16: winResolution.x - here float2 can begin

offset 20: winResolution.y

offset 24: padding 12 bytes inserted by compiler because float3 cannot begin at offset 24 - it has to be multiple of 16

offset 32: cameraRotation.x - here float3 can begin

offset 36: cameraRotation.y

offset 40: cameraRotation.z

offset 44: padding inherent to float3 type

offset 48: cameraTranslation.x - no extra padding needed here, 48 is 3 * 16

offset 52: cameraTranslation.y

offset 56: cameraTranslation.z

offset 60: padding inherent to float4 type

offset 64: useCamera


Lots of padding - 16 bytes more than is required. Simplest solution is to rearrange your structure a bit, so that compiler won't generate that much alignment. So I'd do:

struct RenderInfo {

float near;

float far;

float2 winResolution;

float3 cameraRotation;

float3 cameraTranslation;

float zoom;

bool useCamera;

};

Now, if I am not mistaken (got 6 month old daughter, not getting enough sleep, so be careful), that will look like:

offset 0: near

offset 4: far

offset 8: winResolution.x - no alignment needed, as address is multiple of 8

offset 12: winResolution.y

offset 16: cameraRotation.x - no aligment needed, as address is multiple of 16

offset 20: cameraRotation.y

offset 24: cameraRotation.z

offset 28: padding inherent to float3 type - sizeof( float3 ) is 32

offset 32: cameraTranslation.x - no alignment needed, as address is multiple of 16

offset 36: cameraTranslation.y

offset 40: cameraTranslation.z

offset 44: padding float, just like offset 28

offset 48: zoom

offset 52: useCamera

See: no extra padding, 16 bytes saved. One caveat though - I've had various problems with bool variables on Intel GPUs, ended up using 4 byte ints for boolean values instead. YMMV, but definitely put that bool at the very end.

Hope that clears it up a bit. If not, get any C manual (AFAIK classic K&R has nice explanation of this) and read up on struct layout/alignment.

Regards

Michal

Replies

a packed type guarantees you that all those floats are next to each other in adjacent memory locations so the result is always the one you would expect. on the other hand saving one float at a time (not packed) might or might not end up having them in adjacent locations in memory so it is always a safe bet to go with the packed types because they never contain padding (unused bytes) between two stored locations. padding usually happens when the types do not fully align (eg. float3 does leave unused space for an extra float). also, performance-wise, it is always better to do as few reads/writes as possible, so one 8B read is usually preferred to 2 x 4B reads from memory.

It seems to me that Metal compiler follows standard C structure layout procedures, and therefore, all one needs to know is type size and alignment. It would be easier if you had posted your exact structs, but my guess is that you're doing something like:


struct Example0 {

float singleFloat;

float2 doubleFloats;

};

vs

struct Example1 {

float single;

packed_float2 doubleFloats;

};


Now we have sizeof( float2 ) == sizeof( packed_float2 ) == 8 bytes. But alignment of float2 is 8 != alignment of packed_float2 which is 4. Therefore, Example0 in memory looks like this:

offset 0: singleFloat

offset 4: four padding bytes inserted by compiler because float2's alignment is 8

offset 8: first float of float2

offset 12: second float of float2

And sizeof( Example0 ) will be 16 bytes


Example1 is different:

offset0: singleFloat

offset4: first float of packed_float2 (because it is 4-aligned)

offset8 second float of packed_float2

In this case, sizeof( Example1 ) will be 12 bytes


Hope that helps, post your exact struct layout as well as host language data access otherwise.

Regards

Michal

Hi Michael,


Thank you for your answer. My struct looks like this:


struct RenderInfo {
    float zoom;
    float near;
    float far;
    packed_float2 winResolution;
    packed_float3 cameraRotation;
    packed_float3 cameraTranslation;
    bool useCamera;
};


The code that creates the buffer for it and loads data into it is here:


    func createRenderInfoBuffer(device: MTLDevice) {
      
        // Setup memory layout.
        let floatSize = MemoryLayout<Float>.size
        let packedFloat2Size = floatSize * 2
        let packedFloat3Size = floatSize * 3
        let boolSize = MemoryLayout<Bool>.size
      
        var minBufferSize = floatSize * 3 // zoom, far, near
        minBufferSize += packedFloat2Size // winResolultion
        minBufferSize += packedFloat3Size * 2 // cameraRotation, cameraPosition
        minBufferSize += boolSize // useCamera
        let bufferSize = alignBufferSize(bufferSize: minBufferSize, alignment: floatSize)
      
        renderInfoBuffer_ = device.makeBuffer(length: bufferSize, options: [])


    }

and


func setRenderInfo(frameInfo: FrameInfo) {
        var renderInfo = RenderInfo(
                zoom: frameInfo.zoom,
                near: frameInfo.near,
                far: frameInfo.far,
                winResolution: frameInfo.viewDimensions,
                cameraRotation: frameInfo.cameraRotation,
                cameraTranslation: frameInfo.cameraTranslation,
                useCamera: frameInfo.useCamera)
        if (renderInfoBuffer_ != nil) {
            let pointer = renderInfoBuffer_!.contents()
           
            // Memory layout for shader types:
            let packedFloat2Size = floatSize * 2
            let packedFloat3Size = floatSize * 3
            let boolSize = MemoryLayout<Bool>.size
           
            memcpy(pointer, &renderInfo.zoom, floatSize)
            var offset = floatSize
            memcpy(pointer + offset, &renderInfo.near, floatSize)
            offset += floatSize
            memcpy(pointer + offset, &renderInfo.far, floatSize)
            offset += floatSize
            memcpy(pointer + offset, &renderInfo.winResolution, packedFloat2Size)
            offset += packedFloat2Size
            memcpy(pointer + offset, &renderInfo.cameraRotation, packedFloat3Size)
            offset += packedFloat3Size
            memcpy(pointer + offset, &renderInfo.cameraTranslation, packedFloat3Size)
            offset += packedFloat3Size
            memcpy(pointer + offset, &renderInfo.useCamera, boolSize)


        }
    }

The code as written works here since I'm using packed floats, but I wanted to try and get it to work without the packed variants.


Assuming I aligned them all by unpacked float3's which are all 16 bytes alignment, it sounds like you're suggesting I add padding between each memcpy or does Metal already make the alignment adjustments? It's trivial to create a buffer with the size of 16 bytes * 7 and to always advance the pointer offset by 16 bytes on memcpy but that doesn't work. I've tried creating the buffer sized so that it is 16 bytes * 7 and I've tried advancing the buffer by 16 bytes between each copy to leave enough padding between the types, I've also tried it with just creating the buffer at that size and not advancing the buffer, but at each turned no having packed float variants produces garbage. I just wish I understood this better.


Thank you!

Hello


Your original structure (with float2/float3s instead of packed types would give something like):


offset 0: zoom

offset 4: near

offset 8: far

offset 12: padding 4 bytes inserted by compiler because float2 cannot begin at offset 12 - it has to be multiple of 8

offset 16: winResolution.x - here float2 can begin

offset 20: winResolution.y

offset 24: padding 12 bytes inserted by compiler because float3 cannot begin at offset 24 - it has to be multiple of 16

offset 32: cameraRotation.x - here float3 can begin

offset 36: cameraRotation.y

offset 40: cameraRotation.z

offset 44: padding inherent to float3 type

offset 48: cameraTranslation.x - no extra padding needed here, 48 is 3 * 16

offset 52: cameraTranslation.y

offset 56: cameraTranslation.z

offset 60: padding inherent to float4 type

offset 64: useCamera


Lots of padding - 16 bytes more than is required. Simplest solution is to rearrange your structure a bit, so that compiler won't generate that much alignment. So I'd do:

struct RenderInfo {

float near;

float far;

float2 winResolution;

float3 cameraRotation;

float3 cameraTranslation;

float zoom;

bool useCamera;

};

Now, if I am not mistaken (got 6 month old daughter, not getting enough sleep, so be careful), that will look like:

offset 0: near

offset 4: far

offset 8: winResolution.x - no alignment needed, as address is multiple of 8

offset 12: winResolution.y

offset 16: cameraRotation.x - no aligment needed, as address is multiple of 16

offset 20: cameraRotation.y

offset 24: cameraRotation.z

offset 28: padding inherent to float3 type - sizeof( float3 ) is 32

offset 32: cameraTranslation.x - no alignment needed, as address is multiple of 16

offset 36: cameraTranslation.y

offset 40: cameraTranslation.z

offset 44: padding float, just like offset 28

offset 48: zoom

offset 52: useCamera

See: no extra padding, 16 bytes saved. One caveat though - I've had various problems with bool variables on Intel GPUs, ended up using 4 byte ints for boolean values instead. YMMV, but definitely put that bool at the very end.

Hope that clears it up a bit. If not, get any C manual (AFAIK classic K&R has nice explanation of this) and read up on struct layout/alignment.

Regards

Michal

Thank you Michal! Now I understand. I was misunderstanding alignment, I assumed it contributed to padding after a type, I didn't understand that alignment indicated that the offset byte at which a type had to start on with respect to the entire struct had to be a multiple of its alignment! Huge breakthrough for me. Thank you so much.