tensorflow-metal freezing nondeterministically during model training

I am training a model using tensorflow-metal and model training (and the whole application) freezes up. The behavior is nondeterministic. I believe the problem is with Metal (1) because of the contents of the backtraces below, and (2) because when I run the same code on a machine with non-Metal TensorFlow (using a GPU), everything works fine.

I can't share my code publicly, but I would be willing to share it with an Apple engineer privately over email if that would help. It's hard to create a minimum reproduction example since my program is somewhat complex and the bug is nondeterministic. The bug does appear pretty reliably.

It looks like the problem might be in some Metal Performance Shaders init code.

The state of everything (backtraces, etc.) when the program freezes is attached.

I should say -- these logs are under 2.6.0, but the same bug was happening under 2.8.0. I downgraded to see if the bug would go away. I am upgrading to 2.8.0 again now, and if/when the bug appears again I will post more backtraces.

Follow-up: just got the same freeze on 2.8.0. Backtraces attached, they look the same.

Hi @andmis!

Thanks for reporting this and providing the backtraces. Based on those it looks like a thread left waiting causing a hangup in the program. We are already investigating one similar issue (https://developer.apple.com/forums/thread/702174) and I'm hoping that the root cause for this issue and the one you are seeing will be the same. We are working on providing a fix for the issue soon, I'll update here once it's out so you can try it out on your code. If the issue you are seeing persists after that we can look into ways on how to debug the specific case you have.

I'm seeing something similar when training the SwinTransformerV2Tiny_ns model from https://github.com/leondgarse/keras_cv_attention_models. After 4075-ish training steps it pretty reliably seems to just give up on using the GPU. The gpu memory / usage drops off and cpu usage also stays low. You can see the steps/sec absolutely tank in the training logs:

FastEstimator-Train: step: 3975; ce: 1.2872236; model_lr: 0.00022985446; steps/sec: 4.19;
FastEstimator-Train: step: 4000; ce: 1.3085787; model_lr: 0.00022958055; steps/sec: 4.2;
FastEstimator-Train: step: 4025; ce: 1.3924551; model_lr: 0.00022930496; steps/sec: 4.19;
FastEstimator-Train: step: 4050; ce: 1.4702798; model_lr: 0.0002290277; steps/sec: 4.16;
FastEstimator-Train: step: 4075; ce: 1.2734954; model_lr: 0.00022874876; steps/sec: 0.05;

GPU Memory Utilization over time (about 30% during training, then just cuts out. The first dip is an evaluation step during training, then training resumes and cuts out)

GPU Utilization over time (about 100% during training, then just stalls out. The first dip is an evaluation step during training, then training resumes and cuts out)

After the GPU gives up the terminal no longer responds to attempts to kill the training with ctrl-c.

Hi @TortoiseHam

Could you test if using tensorflow-macos==2.9.2 and tensorflow-metal==0.5.1 solve your issue? There are multiple bug fixes addressing gpu hangups that hopefully solve this issue for you.

tensorflow-metal freezing nondeterministically during model training
 
 
Q