metal graphics start to glitch badly then disappear altogether (only on OS X)

I've got a Metal app which runs on both iOS and OS X. The OS X version starts glitching out -- sometimes almost immediately, sometimes after a few minutes. It flickers parts of the screen black and then eventually the whole screen is flickering black and then it just all goes black.


I don't get any errors on the conole -- does this sound familiar? How would I go about debugging this? If I press the "Capture GPU Frame" then the black disappears and it seems to capture a normal, correct frame. Sometimes after I do this I do see this on the console:


Execution of the command buffer was aborted due to an error during execution. Internal Error (IOAF code 1)


Any suggestions?


Thanks!

Bob

Replies

Nobody? This has been very frustrating -- there's clearly a problem but normally no error is even produced. I have no idea what's causing it. Can't find anything online about it.

These sorts of problems are indeed frustrating. There are number of potential causes. Inconsistent behavour could be the result of using undefined behaviour such as reading uninitialized memory, reading out side the bounds of an array, or depending on specific multithreaded timing (and non locking or synchrinizizing between CPU threads when you should be). For some of these errors you can enable some of the options under "Diagnostics" in your Xcode scheme. For instance, Malloc Guard edges will help you determine whther you're accesing out-of-bounds (and crash immediately so you can see the stack) and "Malloc Scribble" will write values to all allocated memory (so if you see consisten results with that enabled, you're probably reading uninitialized memory.


There are fewer tools to see if you're using undefined behaviour in Metal or in shaders, but here some things to look out for:

The IOAF error you're seeing can happing if you're reading or writing outside the bounds of a Metal Buffer. So make sure that you're performing any pointer arithmitic or indexing into the buffer properly.


Another common mistake is impropery CPU and GPU synchronizations. For instance, if you write to a buffer before the GPU had read the data you expect. For this you can call waitUntilCompleted on every command buffer which will cause the CPU to wait until the GPU is finished with a submitted buffer. (This is not something you should add in shipping code, but could help you determine what may be wrong)

Thanks for the reply -- yeah it is kind of driving me crazy. I had tried the "waitUntilCompleted" trick, but it didn't change anything. I also turned on all of those Malloc Guard/Scribble/Zombie settings in my scheme and didn't see any change.


It's so weird -- usually the first time I run the program in the day, it goes for several minutes before glitching out. But then once it has "warmed up", it glitches almost immediately. It seems to be like some kind of interference between my various shaders maybe. For example, I have a simple user interface and then I can draw lines into the "canvas". Only when I draw lines does it glitch. However, if I leave the interface out and ONLY draw lines, no glitches.


I tried recording with Instruments using the "Metal Trace" settings, and I've stared at the GPU frame capture, but I don't really know what I am doing. Nothing jumps out at me, except that after the program has glitched out and started showing those errors, it looks like the CPU usage spikes in the Instrument trace.


I've just moved on to other things now, but it makes it difficult only being able to run the program for a few seconds at a time before it hangs. So any other suggestions you have would be greatly appreciated!


EDIT: Actually what I say above about the line drawing interfering with the UI is incorrect. If I wait, it eventually starts glitching with the UI by itself and no lines drawn. It does seem like it happens more quickly the more elements are onscreen, but that's about all I can deduce from the situation so far.

A question -- I've seen explanations of using triple buffering for MTL buffers. Am I correct in thinking that is only necessary/useful when the buffers are changing? If I have a buffer which never changes, there's no need to have three versions of it around, is there? Just trying to scour my program to try to find problems.

That's correct. If you're not changing data in a buffer, you only need one buffer. However, if you're doing any sort of animation like changing the positions of polys or updating a matrix, you'd probably need to triple buffer and ensure you're properly synchronizing CPU writes to the buffers.

Ok so that isn't it.


What about MTLVertexDescriptor? Is it necessary to specify those with each pipeline state? The tutorial I followed when I was learning Metal did not mention or use MTLVertexDescriptor, so my code doesn't either. I don't see much in the way of documentation online on how to use it. For example, how to set the layout and attribute and step, etc...

This glitch problem is killing my soul. With no errors and no way to determine where the glitch is coming from I feel like I will not be able to resolve this. When I first wake up the machine and try the program, often it takes ten minutes or more for the glitch to first appear. But once it has started appearing, it happens almost immediately when I launch the program. What kind of bug does that? It seems to depend on the heat of the machine or something.


I screen-recorded a few of these in case that would help anyone diagnose it. I've got two rows of little beveled boxes for a palette at the bottom of the screen, and then I start drawing lines. After a few seconds it glitches out and becomes unresponsive. Although like I say, sometimes, usually when I first start the machine, I can draw for ten minutes, hundreds of lines or more without it glitching. But once it has started to do it, it usually happens quickly.


It seems to be exacerbated by the combination of the line rendering and the palette rendering. I've removed all the other screen elements. If I only draw lines, then it does not seem to happen. If I only draw the palette, it does happen but only after a long time, like if it has been sitting there for 20 minutes or so. Also it may look like changing colors has something to do with it, but it doesn't. It starts glitching whether or not I do that.


http://www.flatblackfilms.com/Glitch.mov

http://www.flatblackfilms.com/Glitch2.mov

http://www.flatblackfilms.com/Glitch3.mov

You do not need to use a vertex descriptor. However, using one does offer some performance advantages and makes easier to programatically define your vertex layout.


There are samples using vertex descriptors such as LOD with Function Specialization, but they all use MetalKit to build a vertex descriptor from a ModelIO vertex descriptor (which is very similar to a Metal vertex descriptor so you may be able to figure out how it works anyways).

This is a hunch, but it looks like you might have a zombie object causing havoc. I would turn on the Adress sanitizer and zombie object detection object.


It looks like bugs I have seen from autoreleasing memory (where it happens a while later).


Your glitch always happen after selecting another color (perhaps it is not retained ???).


You can also use Instruments too, but I find the Adress sanitizer to be very helpful with debugging.

Thanks! I turned on the sanitizer and it caught something! I get a stack buffer overflow when I call device.makeBuffer() to allocate an MTLBuffer for a projection matrix. However, it's not clear why that should cause an error. There's a ton of output in the console window, but I don't really know how to make sense of it. Do you know what causes stack buffer overflows?


EDIT: Actually turns out I get a lot of these errors. Every time I copy memory to a MTLBuffer and the size is not 16-bytes aligned, it gives me one of these. I wasn't aware that that was a problem, as I have seen many tutorials do this. I know the buffers on the GPU side must be 16 bytes aligned, but it seems that the source buffers also have to be the correct size? For example, I've seen suggestions to use 'stride' instead of size -- but doing so would cause this kind of overflow.


And actually after going through and fixing all of these, I still get the glitch, with no complaints from the sanitizer. So I still haven't fixed the problem 😟

The C function to create aligned memory on the host side is posix_memalign().


This code allocates 512 bytes aligned to a page boundary (4096 alignment).


int *pool = NULL;

posix_memalign((void **)&pool, 4096, 512);


... initialize the buffer contents .....

... write it to the GPU device ...

free(pool); // VERY Important, make sure you free the buffer after you finished using it


Remember to us free() !!


This article shows how to use posix_memalign from a Swift app: https://stackoverflow.com/questions/27365905/pointer-memory-alignment-to-16k-in-swift-for-metal-buffer-creation


If you use Structs in your Metal C code, I recommend specifying their alignment explicitly. Like this:


struct rgba

{

float r;

float g;

float b;

float a;

} __attribute__ ((aligned (16)));

Thanks -- I'm confused, though, about when exactly 16 byte alignment is needed. Sometimes I get errors/warnings, and sometimes I don't get errors but I still get the glitching/hanging.


Say I have an array of vertices I'm passing to a shader, and they are constructed like this:

typedef struct
{
    float4 position [[position]];
    float2 texCoord;
} ColorInOut;


That struct is only 24 bytes -- is that going to cause a problem if I pass in an odd number of vertices?