Wrong results when using tensor flow-metal

Question

Created Dec ’21

Replies 2

Boosts 0

Views 2.0k

Participants 4

After installation of tensorflow-metal, loss does not decrease as it does in CPU-only training or when running the same training on a Nvidia CUDA environment.

The results when training on M1 Max with metal are completely useless. I get good result again when uninstalling tensorflow-metal via pip uninstall tensorflow-metal and leaving everything else unchanged, but then training is slow and fans seem louder as when doing GPU training.

Without tensor flow-metal

With tensorflow-metal

X-axis in both graphs is the epoch.

OS is macOS 12.0.1.

Tensorflow version is 2.6, since 2.7 crashes.

tensorflow-metal version is 0.3, but the behaviour was the same with 0.2.

The network is an RCNN. I am following this tutorial https://keras-ocr.readthedocs.io/en/latest/examples/end_to_end_training.html. Only the recogniser part, to be exact.

I would much appreciate a solution to this issue, since the reason for buying a 64GB m1 Max was mainly to use it for nn training.

Answered by Frameworks Engineer in 698287022

Hi @gtsoukas,

Thanks for reporting this. I have the issue reproduced and can verify that there seems to be a problem with the GPU training. I'm looking into it and will update here once we have a solution available.

Boost

Answer 1

Frameworks Engineer OP

Apple

Dec ’21

Accepted Answer

Hi @gtsoukas,

Thanks for reporting this. I have the issue reproduced and can verify that there seems to be a problem with the GPU training. I'm looking into it and will update here once we have a solution available.

2

Answer 2

venkatg OP

Dec ’21

I've been running into a similar issue with my model training, and I found that it depends on the model that I used (look at the answer I've added in the link). MobileNetV3Small model performs poorly, but a custom image classification model that I define works great on the same dataset. @gtsoukas I wonder if there is any overlap in the layers between your model and mine that could explain this issue.

2