Why is my sorting shader for Metal on my iOS device angry at me?

I hope someone here can give me some insight, because I am at my wits end.

I have been trying to learn Metal the past couple of months. In the process, I came across an examples and articles of Sorting Networks and decided to try and implement them in Metal.

Now the problem is, if I run the code on my Mac. Everything is fine. But if I run the the same code on my iDevice (iPadPro wLIDAR), I get all sort of errors I do not understand or sorted data is corrupted and all wrong.

Typical Error
Code Block
2021-02-17 12:13:11.218394-0500 METAL_ISSUE[97650:6709092] [GPUDebug] Invalid device load executing kernel function "bitonicSort" encoder: "0", dispatch: 0, at offset 384
file:///Users/staque/Development/OTHER/METAL_ISSUE/METAL_ISSUE/Shaders.metal:77:40 - bitonicSort()
<MTLBufferArgument: 0x28006d200>
        Name = floats 
        Type = MTLArgumentTypeBuffer 
        Access = MTLArgumentAccessReadWrite 
        LocationIndex = 0 
        IsActive = 1 
        ArrayLength = 1 
        TypeInfo = 
            DataType = MTLDataTypePointer 
            ElementType = MTLDataTypeFloat  
            Access = MTLArgumentAccessReadWrite 
            Alignment = 4 
            DataSize = 4
        Alignment = 4 
        DataSize = 4 
        DataType = MTLDataTypeFloat
buffer: "<unknown>"


You can pretty much drop these in the default Xcode Metal Game default app.


Shader (slightly modified to track the indexes of the floats.)
Code Block
/*
 [Using Code based off of this](https://github.com/tgymnich/MetalSort)
 Rewritten to make it more understandable.
 */
kernel void bitonicSort(device float *floats [[ buffer(0) ]],
device int *uInts [[ buffer(1) ]],
constant int &p [[ buffer(2) ]],
constant int &q [[ buffer(3) ]],
uint gid [[ thread_position_in_grid ]])
{
int pMinusQ = p-q;
int distance = 1 << pMinusQ;
uint gidShiftedByP = gid >> p;
// True: Increasing / False: Descreasing
bool direction = (gidShiftedByP & 2) == 0;
uint gidDistance = (gid & distance);
bool isGidDistanceZero = (gidDistance == 0);
uint gidPlusDistance = (gid | distance);
bool isLowerIndexGreaterThanHigher = (floats[gid] > floats[gidPlusDistance]);
if (isGidDistanceZero && isLowerIndexGreaterThanHigher == direction) {
float temp = floats[gid];
floats[gid] = floats[gidPlusDistance];
floats[gidPlusDistance] = temp;
int temp2 = uInts[gid];
uInts[gid] = uInts[gidPlusDistance]
uInts[gidPlusDistance] = temp2;
}
}


The call.
Code Block language
func runSort() {
let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let sortFunction = library.makeFunction(name: "bitonicSort")!
let pipeline = try! device.makeComputePipelineState(function: sortFunction)
let setRange = 0..<1024
var floatData = [Float]()
var uintData = [UInt32]()
// Build the Float and index data backward to form worst case scenerio for sorting.
for value in stride(from: Float(setRange.upperBound-1), to: Float(setRange.lowerBound-1), by: -1.0) {
floatData.append(value)
}
for value in stride(from: setRange.upperBound-1, to: setRange.lowerBound-1, by: -1) {
uintData.append(UInt32(value))
}
print(floatData)
print("")
print(uintData)
guard let logn = Int(exactly: log2(Double(floatData.count))) else {
fatalError("data.count is not a power of 2")
}
for p in 0..<logn {
for q in 0..<p+1 {
let floatDataBuffer = device.makeBuffer(bytes: &floatData,
length: MemoryLayout<Float>.stride * floatData.count,
options: [.storageModeShared])!
floatDataBuffer.label = "floatDataBuffer"
let uintDataBuffer = device.makeBuffer(bytes: &uintData,
  length: MemoryLayout<UInt32>.stride * uintData.count,
  options: [.storageModeShared])!
uintDataBuffer.label = "uintDataBuffer"
let threadgroupsPerGrid = MTLSize(width: floatData.count, height: 1, depth: 1)
let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)
var n1 = p
var n2 = q
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(floatDataBuffer, offset: 0, index: 0)
encoder.setBuffer(uintDataBuffer, offset: 0, index: 1)
encoder.setBytes(&n1, length: MemoryLayout<Float>.stride, index: 2)
encoder.setBytes(&n2, length: MemoryLayout<UInt32>.stride, index: 3)
encoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let dataPointer = floatDataBuffer.contents().assumingMemoryBound(to: Float.self)
let dataBufferPointer = UnsafeMutableBufferPointer(start: dataPointer, count: floatData.count)
floatData = Array.init(dataBufferPointer)
let dataPointer2 = uintDataBuffer.contents().assumingMemoryBound(to: UInt32.self)
let dataBufferPointer2 = UnsafeMutableBufferPointer(start: dataPointer2, count: uintData.count)
uintData = Array.init(dataBufferPointer2)
}
}
print(floatData)
print("")
print(uintData)
}


If anyone has a clue what I should be doing I am all ears, because I need help.

Thanks in advance.

Stan



Stan,

This is a lot of code to wade through.

Can you summarize (maybe with some pseudo code) how you expect this algorithm to work and how your implementation differs from what you started with?
Hi Stan

Did you find a solution. I am having the same problem, though with a fragment shader. The strange thing is that it only happens on one of my three iOS devices - the newest one.
The code stops with the failure, but if I continue executing, the code runs fine, but the "Invalid device load executing fragment function" comes every frame. I cannot seem to debug my way through.

Thanks in advance,
Michael
Hey Michael,
  1. Thanks for posting, I was going nuts thinking I can't be the only one this is happening to.

  2. My official answer is 'No and Yes', I will explain.

No: Because I still do not understand what is going wrong.  The problem always seems to manifest as the input size of the data I send into the function grows.  Like 16, or 64 inputs, fine... anything larger no.  The only thing in common is that the problem has something to do with how the compiler plans for the buffer data alignment.  However, as to what I am not sure.  Whenever the problem happens I can rewrite a couple of lines a code then later completely revert those lines back to the original and everything would be fine, for no rhyme or reason.

However,

Yes: I can suggest a few things you can try, which might resolve the problem.  But if you look for a cause and effect reasoning to why they work it might drive you more insane then the original problem.

(Prior to suggestion.) Implement the GPU debugging from the recent WWDC.  This will probably not give you the answer, but it may help a little in pointing to the cause of the issue.

Suggestions:

First.  Rename the shader function.  It can be something as small as 'trivialFunctionName()' to 'trivialFunctionName_a()' but it needs to be changed so the shader compiler is forced to rebuild the function.

Second: Clean the build, even to going as far as manually deleting the derived data folder.

Third: (This seems to be significant) Completely quit out of Xcode and restart it before you rebuild. (Do this especially if Xcode has been open a long time).

Fourth: Once Xcode is reopened, rebuild the project while you still have the 'trivialFunctionName_a()' version of the shader name.

Fifth: Once it is built, you can rename the function back to its original name and rebuild the project again (maybe even clean the build again before you do).

Optionally: While you are doing all that, comment out everything in the shader function to the bare minimum, and build all the buffers (the makeBuffer() function and coping the data into the buffers lines) on the Swift side locally to encoding the GPU command, just so you can make sure everything matches up.
 
If you can do all that and it starts working with the bare minimum you can begin adding everything back in.

The closest I can figure is the problem lies somewhere in how one of the compilers (Swift or Metal's, probably Metal) handles reserving placement of the function in memory especially if Xcode has been open a long time.  You have to force the compiler to rebuild the function to get it to align everything correctly.  (I admit this is really a wild guess, but it is the only thing that describes the problem from my CS knowledge).

If you are lucky everything will work, but I will admit this issue will drive you nuts.   You can write the EXACT same code in a brand-new project before you do any of the suggestions and everything will work fine.   Or you can revert your original code back to what you had in the first place (after you take all the steps), and everything will be fine.  But the whole time Xcode (even with the GPU debugging) with be no help.  It will probably report an '(IOAF code 11)' error and that is it.

I will I could give a more definite answer, but I hope my suggestions can get you past the problem. Doing those steps are the only things that have worked for me.

Stan
THANK YOU Stan !!

That was really helpful - to the degree that I actually found and fixed my problem. It turned out that I SOMETIMES had an index out of bounds, but only on certain devices and with certain resolutions, which I used your advice to track down.

Cheers
Michael


Today, I can into the same error: "invalid device load executing vertex function". I'm using a GPU kernel to encode draw calls.

After reading this thread, the solution turned out to be to run the app under a different scheme.

So, yes, as Stan has mentioned earlier, this error might be related to shader compilation.

Actually, throughout the day I saw Xcode recompiling parts of the codebase, even though I hadn't touched any code between runs. Something I've never seen it do.

I do have to mention that the Xcode schemes are managed by AppCode (JetBrains) so there might be some incompatibility here. Having said that, the Xcode/AppCode combination has always worked fine for me.

I think you are dispatching too many threads, which causes the invalid read of the input float buffer.

				let threadgroupsPerGrid = MTLSize(width: floatData.count, height: 1, depth: 1)
        let threadsPerThreadgroup = MTLSize(width: pipeline.threadExecutionWidth, height: 1, depth: 1)

floatData.count seems to be 1024, and pipeline.threadExecutionWidth is probably 32. So the grid size is 1024*32 = 32768 threads, so gid=0..32767 in the shader, but you only have 1024 items in the input array. This should cause both invalid reads and writes. On macOS, invalid reads come back as "0" and invalid writes are silently dropped. On iOS, these are not silently ignored.

I suggest you use the dispatchThreads method to dispatch floatData.count threads and give some reasonable threadgroup size e.g. 512x1x1.

Why is my sorting shader for Metal on my iOS device angry at me?
 
 
Q