TortoiseHam’s Profile | Apple Developer Forums

Reply to The new tensorflow-macos and tensorflow-metal incapacitate training

I think I'm facing this same issue, but using a transformer model (https://developer.apple.com/forums/thread/703081?answerId=719068022#719068022)

Machine Learning & AI General

Jun ’22

Reply to tensorflow-metal freezing nondeterministically during model training

I'm seeing something similar when training the SwinTransformerV2Tiny_ns model from https://github.com/leondgarse/keras_cv_attention_models. After 4075-ish training steps it pretty reliably seems to just give up on using the GPU. The gpu memory / usage drops off and cpu usage also stays low. You can see the steps/sec absolutely tank in the training logs: FastEstimator-Train: step: 3975; ce: 1.2872236; model_lr: 0.00022985446; steps/sec: 4.19; FastEstimator-Train: step: 4000; ce: 1.3085787; model_lr: 0.00022958055; steps/sec: 4.2; FastEstimator-Train: step: 4025; ce: 1.3924551; model_lr: 0.00022930496; steps/sec: 4.19; FastEstimator-Train: step: 4050; ce: 1.4702798; model_lr: 0.0002290277; steps/sec: 4.16; FastEstimator-Train: step: 4075; ce: 1.2734954; model_lr: 0.00022874876; steps/sec: 0.05; GPU Memory Utilization over time (about 30% during training, then just cuts out. The first dip is an evaluation step during training, then training resumes and cuts out) GPU Utilization over time (about 100% during training, then just stalls out. The first dip is an evaluation step during training, then training resumes and cuts out) After the GPU gives up the terminal no longer responds to attempts to kill the training with ctrl-c.

Graphics & Games Metal

Jun ’22

Reply to How to install Tensorflow on the Apple M1 Notebook

If anyone's still facing issues, there are up-to-date installation steps (for TF and Torch) here: https://github.com/fastestimator/fastestimator/issues/1224 This might be more complex than you need for your particular use-case (it involves putting conda inside a virtual environment to keep it isolated from the rest of the system), but it has worked for at least 2 people

Machine Learning & AI General

Feb ’22

TortoiseHam

Post

Replies

Boosts

Views

Activity