No, I explicitly observe the GPU/CPU loads with Performance Monitor, and explicitly set tf.device.
In contrast, the Tesla V100 outperforms the CPU on the same code by 10 X on a decent Linux GPU cluster.
This is definitely an issue with tensorflow-metal, at least on macOS 11.6.
Post
Replies
Boosts
Views
Activity
Thank you for pointing this out! I confirm 2–3 s/epoch on a Radeon Pro 580X using batch_size=2^12 – 2^14.
Here's more code that illustrates this issue with another LSTM model: Predict Shakespeare with Cloud TPUs and Keras.
I'd recommend releasing TF-for-M1/2 working examples like this, the same way TF has done for the TPU—it's very helpful to compare against V100 performance training comparable models.
It's also very helpful when the performance reveals some underlying issue in the TF code.