newBufferWithBytesNoCopy:pointer is not 4096 byte aligned.

Hi,

I am trying to run metal on Xcode 7 beta, OSX 10.11 beta. I am getting this runtime error when I use a buffer to get results out of kernel. Here is the code.

int main(int argc, const char * argv[]) {
    NSError *errors;
    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
    id<MTLLibrary> library = [device newDefaultLibrary];
    id<MTLCommandQueue> commandQueue = [device newCommandQueue];
    const unsigned length = 100000;
  
    float In1[length], In2[length], Out[length];
  
    for(unsigned i=0;i<length;i++){
        In1[i] = 1.0f;
        In2[i] = 1.0f;
    }
  
    id<MTLCommandBuffer> cb_add = [commandQueue commandBuffer];
    id<MTLComputeCommandEncoder> ce_add = [cb_add computeCommandEncoder];
    id<MTLFunction> func_add = [library newFunctionWithName:@"add" ];
    id<MTLComputePipelineState> pipe_add = [device newComputePipelineStateWithFunction:func_add  error:&errors];
    [ce_add setComputePipelineState:pipe_add];
    MTLSize threadsPerGroup = {100, 1, 1};
    MTLSize numThreadgroups = {1, 1, 1};
  
    id<MTLBuffer> In1d = [device newBufferWithBytes:In1 length:length*sizeof(float) options:0];
    id<MTLBuffer> In2d = [device newBufferWithBytes:In2 length:length*sizeof(float) options:0];
    id<MTLBuffer> Outd = [device newBufferWithBytesNoCopy:Out length:length*sizeof(float) options:0 deallocator:nil];
  
    [ce_add setBuffer:In1d offset:0 atIndex:0];
    [ce_add setBuffer:In2d offset:0 atIndex:1];
    [ce_add setBuffer:Outd offset:0 atIndex:2];
  
  
    [ce_add dispatchThreadgroups:numThreadgroups threadsPerThreadgroup:threadsPerGroup];
    [ce_add endEncoding];
    [cb_add commit];
  
    NSLog(@"%f",Out[8]);
  
  
    return 0;
}


The kernel is a simple add


kernel void add(const device float *In1 [[buffer(0)]],
                const device float *In2 [[buffer(1)]],
                device float *Out [[buffer(2)]],
                uint tid [[thread_position_in_grid]]){
    Out[tid] = In1[tid] + In2[tid];
}

Accepted Reply

BTW, I've been looking at the docs a bit and I think that valloc() is probably the way to go instead of posix_memalign(). If I undestood it correctly, posix_memalign() might be a bit dumb in the way it ensures that the pointers are aligned so it can waste memory.

Replies

I'd guess you should make sure that your arrays are page-aligned.

Hi,

How should I do that?

Thank you! 🙂

With posix_memalign...

If you want too keep it easy and dirty (no ehat allocation), you can use the __attribute__((aligned(4096))) compiler attribute. But then I would declare the variables globally:


float In1[length] __attribute__((aligned(4096)));


etc.

Hi, I am getting the same runtime error

Seems like one cannot use malloc and its derivatives to do host memory allocation for using newBufferWithBytesNoCopy. vm_allocate or mmap should be used. But, if I use newBufferWithLength how do I do memory copy back to CPU?


Also, I have used MTLResourceOptions: MTLResourceStorageModeShared. It does not seem to have working properly.

MTLResourceStorageModeShared works for me... And I use standard malloc with managed buffer.

I guess (I hope) that Metal copies the data you sent via newBufferWithBytes into its own memory representation.

With a managed buffer you have to synchronize your buffer CPU copy with blit synchronizeResource (see MTLBlitCommandEncoder).

With a shared buffer, a simple waitUntilCompleted allows synchronization (see MTLCommandBuffer).


)Everythign works just fine when you read the documentation carefully. If you want to use newBufferWithBytesNoCopy, your allocated buffer storage needs to be perfeclty page-aligned — both the start and the end. So specifying alignment of the start (with __attribute__((aligned(4096))) is not enough, you also need to allocate a multiply of 4K bytes. A quick and dirty code using statically allocated arrays:


const unsigned length = 100000;
const unsigned length_pagealigned = (length/4096 +1)*4096;
float In1[length_pagealigned] __attribute__((aligned(4096)));
float In2[length_pagealigned] __attribute__((aligned(4096)));
float Out[length_pagealigned] __attribute__((aligned(4096)));
int main(int argc, const char * argv[]) {
  
    NSError *errors;
    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
    id<MTLLibrary> library = [device newDefaultLibrary];
    id<MTLCommandQueue> commandQueue = [device newCommandQueue];
  
    for(unsigned i=0;i<length;i++){
        In1[i] = 1.0f;
        In2[i] = 1.0f;
    }
  
    id<MTLCommandBuffer> cb_add = [commandQueue commandBuffer];
    id<MTLComputeCommandEncoder> ce_add = [cb_add computeCommandEncoder];
    id<MTLFunction> func_add = [library newFunctionWithName:@"add" ];
    id<MTLComputePipelineState> pipe_add = [device newComputePipelineStateWithFunction:func_add  error:&errors];
    [ce_add setComputePipelineState:pipe_add];
    MTLSize threadsPerGroup = {100, 1, 1};
    MTLSize numThreadgroups = {1, 1, 1};
  
    id<MTLBuffer> In1d = [device newBufferWithBytes:In1 length:sizeof(In1) options:0];
    id<MTLBuffer> In2d = [device newBufferWithBytes:In2 length:sizeof(In2) options:0];
    id<MTLBuffer> Outd = [device newBufferWithBytesNoCopy:&Out[0] length:sizeof(Out) options:0 deallocator:nil];
  
    [ce_add setBuffer:In1d offset:0 atIndex:0];
    [ce_add setBuffer:In2d offset:0 atIndex:1];
    [ce_add setBuffer:Outd offset:0 atIndex:2];
  
  
    [ce_add dispatchThreadgroups:numThreadgroups threadsPerThreadgroup:threadsPerGroup];
    [ce_add endEncoding];
    [cb_add commit];
  
    [cb_add waitUntilCompleted];
  
    NSLog(@"%f",Out[8]);
  
  
    return 0;
}


Note that I also added [cb_add waitUntilCompleted], without it you won't see any changes to the buffer.


In a real application, you will probably want to use one of the malloc* functions that gives you page-allocated buffers.

Hi,

Thank you very much! 🙂


The code you have posted fails for larger memories. But, good enough to make a point! 🙂

const unsigned length = 100000;
const unsigned length_pagealigned = (length/4096 +1)*4096;
float In1[length_pagealigned] __attribute__((aligned(4096)));
float In2[length_pagealigned] __attribute__((aligned(4096)));
float Out[length_pagealigned] __attribute__((aligned(4096)));

Thank you @jcookie!


I have tried this and it's working.!

    const unsigned length = 4096*2048;
    size_t size = sizeof(float)*length;
    float *In1, *In2, *Out;

    posix_memalign((void**)&Out, 4096, size);   
    In1 = (float*)malloc(size);
    In2 = (float*)malloc(size);

Thank you @iPerKard


Do you know why the 4KB constraint on shared memory? It forces usage of more memory than needed.

Hi,

Sure. Thank you! 🙂


What if I allocate with newBufferWithLength (Out buffer), do the computation and transfer the data back?

Thanks! 🙂

To enable efficient synrhonization between different physical copies, the region needs to be page-aligned. A page is the minimal unit of memory under the management of the kernel, and its also the unit of memory that the hardware operates on. You might want to read about virtual memory and how OS X memory management works. The memory is allocated in 4KB blocks anyway, so you are not realy using more memory then needed, and even if you are — the difference is basically zero. BTW, if I am not mistaken, malloc() on OS X will always return a page-aligned block when you ask for more then few KB, but I am not 100% sure. I think I saw it somewhere in the docs.

This is the first time I am looking at the metal forum. I have not considered the above suggestions yet. The following is what I am currently using. Note that I am setting up an input buffer rather than an output buffer that you asked about. This might be useful to others.


Note that I think you need to free the memory allocated by posix_memalign.


My abbreviated code is in a class:


public class MetalPrecomputedRewardsManager {
  
    var precomputedRewards:[Int8]


    var precomputedRewardsByteSize:Int = 0
    var precomputedRewardsVoidPtr:UnsafeMutablePointer<Void> = nil
    var precomputedRewardsInt8Ptr: UnsafeMutablePointer<Int8>
    var precomputedRewardsInt8BufferPtr:UnsafeMutableBufferPointer<Int8>
  
    deinit {
        free(precomputedRewardsVoidPtr)
    }
  
    public init() {
        let documentsDirectory = NSSearchPathForDirectoriesInDomains(.DocumentDirectory, .UserDomainMask, true)[0] as! NSString
        let dataPath = documentsDirectory.stringByAppendingPathComponent("...whatever...")
        if let precomputedRewardData = NSData(contentsOfFile: dataPath) {
            let count = precomputedRewardData.length / sizeof(Int8)
            precomputedRewards = [Int8](count: count, repeatedValue: 0)
            precomputedRewardData.getBytes(&precomputedRewards, length:count * sizeof(Int8))
        }
        else {
            let count = 0x4000
            precomputedRewards = [Int8](count: count, repeatedValue: 0)
        }
      
        // wch: testing creating shared memory
        // wch: note must free this memory eventually !!!!!!!
        let alignment16K = 0x4000 // 16K aligned
        let precomputedRewardsCount = ((precomputedRewards.count + alignment16K - 1) / alignment16K) * alignment16K // wch: note need to align length also
        self.precomputedRewardsByteSize = precomputedRewardsCount * sizeof(Int8)
        let error = posix_memalign(&self.precomputedRewardsVoidPtr, alignment16K, precomputedRewardsByteSize)
        if error != 0 {
            println("posix_memalign error=\(error)")
        }
      
        // wch: testing accessing
        var precomputedRewardsVoidPtr = COpaquePointer(self.precomputedRewardsVoidPtr)
        self.precomputedRewardsInt8Ptr = UnsafeMutablePointer<Int8>(precomputedRewardsVoidPtr)
        self.precomputedRewardsInt8BufferPtr = UnsafeMutableBufferPointer(start: self.precomputedRewardsInt8Ptr, count: precomputedRewardsCount)
        // wch: fill precomputedRewards with data
        for index in self.precomputedRewardsInt8BufferPtr.startIndex ..< self.precomputedRewardsInt8BufferPtr.endIndex {
            if index >= precomputedRewards.count {
                break
            }
            self.precomputedRewardsInt8BufferPtr[index] = precomputedRewards[index]
        }
    }
}

BTW, I've been looking at the docs a bit and I think that valloc() is probably the way to go instead of posix_memalign(). If I undestood it correctly, posix_memalign() might be a bit dumb in the way it ensures that the pointers are aligned so it can waste memory.

Hi,

I am looking for Obj-C + Metal.

Thank you! 🙂