Same problem here.
The training would simply hang at the same epoch given the same training loop (tested 5 times on two different training loops).
The code cell (in jupyter lab) would have the [*] symbol on, indicating the kernel is working, while CPU and GPU usage are at around 0.
Another peculiar behaviour: if a training loop would always get stuck at the 47th out of 50 epochs, then changing the number of epochs to 45 would allow this training loop to finish, however the next training loop after it would get stuck at the second epoch (47-45=2 lol), even though it was for a different model and hence a different training loop.
Device:
macOS 12.6.2 on M2
Software:
python==3.10.8
numpy==1.23.2
tensorflow-deps==2.10.0
tensorflow-estimator==2.11.0
tensorflow-macos==2.11.0
tensorflow-metal==0.7.0
Post
Replies
Boosts
Views
Activity
RngReadAndSkip isn't registered in tensorflow-metal==0.7.0
What are the differences between 0.7.0 and 0.5.1, and should I downgrade to 0.5.1?