I think I'm facing this same issue, but using a transformer model (https://developer.apple.com/forums/thread/703081?answerId=719068022#719068022)
Post
Replies
Boosts
Views
Activity
I'm seeing something similar when training the SwinTransformerV2Tiny_ns model from https://github.com/leondgarse/keras_cv_attention_models. After 4075-ish training steps it pretty reliably seems to just give up on using the GPU. The gpu memory / usage drops off and cpu usage also stays low. You can see the steps/sec absolutely tank in the training logs:
FastEstimator-Train: step: 3975; ce: 1.2872236; model_lr: 0.00022985446; steps/sec: 4.19;
FastEstimator-Train: step: 4000; ce: 1.3085787; model_lr: 0.00022958055; steps/sec: 4.2;
FastEstimator-Train: step: 4025; ce: 1.3924551; model_lr: 0.00022930496; steps/sec: 4.19;
FastEstimator-Train: step: 4050; ce: 1.4702798; model_lr: 0.0002290277; steps/sec: 4.16;
FastEstimator-Train: step: 4075; ce: 1.2734954; model_lr: 0.00022874876; steps/sec: 0.05;
GPU Memory Utilization over time (about 30% during training, then just cuts out. The first dip is an evaluation step during training, then training resumes and cuts out)
GPU Utilization over time (about 100% during training, then just stalls out. The first dip is an evaluation step during training, then training resumes and cuts out)
After the GPU gives up the terminal no longer responds to attempts to kill the training with ctrl-c.
If anyone's still facing issues, there are up-to-date installation steps (for TF and Torch) here: https://github.com/fastestimator/fastestimator/issues/1224
This might be more complex than you need for your particular use-case (it involves putting conda inside a virtual environment to keep it isolated from the rest of the system), but it has worked for at least 2 people