GPU Hang Error

Hi all,


I'm using metal's compute command encoders for calculations in my neural network and I'm getting the following error:


Error Domain=MTLCommandBufferErrorDomain Code=2 "Caused GPU Hang Error (IOAF code 3)" UserInfo={NSLocalizedDescription=Caused GPU Hang Error (IOAF code 3)}


It's happening when I'm trying to pass "big" textures to my convolution layers ("big" means 1024x768 for iPhone 6/Plus and 1920x1080 for iPhone 6S/6S Plus). As I know these sizes aren't actually too big for mentioned devices, do someone know where can I search for solution?

Answered by wcm in 195240022

Calling

waitUntilCompleted
after every
commit
is discouraged, since it unnecessarily serializes the operation of the GPU and CPU.

Does this same error occur when processing smaller images? If so, there may be a genuine hang in your code (perhaps caused by an infinite loop). Otherwise, you may be exceeding the maximum time allotted for command buffer execution and need to split your work into multiple (perhaps overlapping) grids if possible. Do you see this error immediately after you dispatch the work, or is there a delay of several seconds before it's reported?

No, the error occur only when I'm trying to process big textures. I am already committing and waiting for command buffer completion after every single command encoder and I see the error after several commits with a delay of several seconds. If there is a maximum time allotted for command buffer execution, I guess you're right and I need to split the work into multiple grids. Thanks a lot for quick response.

I made some changes in my encode chain, but now I have the same error for smaller textures as well.

Here is a "working" code example, it works for small textures, but throws the GPU hang error for big ones.


/* setup first layer */
[commandEncoder endEncoding];
commandEncoder = [commandBuffer computeCommandEncoder];

/* setup second layer */
[commandEncoder endEncoding];
commandEncoder = [commandBuffer computeCommandEncoder];

/* setup third layer */
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
commandBuffer = [self.commandQueue commandBuffer];
commandEncoder = [commandBuffer computeCommandEncoder];

// etc


As you can see I do commit after passing every ~3 layers of my neural network. This works perfect for little textures (little means textures with <= 1600x990 resolution for iPhone 6S Plus and 1920x1080 for iPhone 7/7 Plus). But when I'm trying to commit commandBuffer after every layer like:


/* setup first layer */
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
commandBuffer = [self.commandQueue commandBuffer];
commandEncoder = [commandBuffer computeCommandEncoder];

/* setup second layer */
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
commandBuffer = [self.commandQueue commandBuffer];
commandEncoder = [commandBuffer computeCommandEncoder];

/* setup third layer */
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
commandBuffer = [self.commandQueue commandBuffer];
commandEncoder = [commandBuffer computeCommandEncoder];

// etc


I'm getting the GPU hang error even for texture with size of 512x384. And I can't understand what causes that.


P.S. Usually I'm getting the error after ~2nd-3rd commit call in the second example.

Accepted Answer

Calling

waitUntilCompleted
after every
commit
is discouraged, since it unnecessarily serializes the operation of the GPU and CPU.

I don't know how this answer solves the problem. Waiting for the CPU shouldn't crash the GPU.

https://forums.developer.apple.com/thread/77728


can you give me some help ? thank you !

I started getting this error in an app. Eventually tracked down to indexing beyond the end of a buffer passed into a fragment shader, due to an off-by-1 error.
GPU Hang Error
 
 
Q