I've been experiencing this issue with the official faster_rcnn_inception_resnet_v2_1024x1024_coco17_tpu tensorflow model since I started experimenting with tensorflow metal version 0.2.0. Tested 0.4.0 last week with the same training data and the problem persists. Predictions that work when trained on M1 CPU, and other hardware come out fine. On tf metal, training results visualised through tensorboard are nonsense. Objects are located in a number of fixed positions within images, and prediction scores are substantially greater than 100%.
Post
Replies
Boosts
Views
Activity
Hi @radagast. There are other reports for wrong results from metal. TensorFlow model predictions are incorrect on M1 GPU has confirmations from 3 people as problematic (including myself). Thank-you for providing the Apple engineers with a reproducible juypter notebook. Given the significance of such a flaw in the metal library, I am hoping that the engineers will be able to respond soon. The reports I cited above were from 3 weeks ago.
A follow-up. I uninstalled the metal package and re-ran the training and evaluation using the m1 CPUs instead. Like @AdkPete, I was able to get sensible output. I am seeing reports from other users such as @radagast who are experiencing similar incorrect output from metal, and were able to provide a reproducible recipe (Juypter notebook) for the Apple engineers to work with. The engineers haven't acknowledged there is a problem yet though.
I am experiencing the same issue too with a custom dataset. I am using the faster_rcnn_inception_resnet_v2_1024x1024_coco17_tpu model from tensorflow model zoo.
Trained under Ubuntu 20.04 using CPU, and the eval results come out okay.
When I trained the data using metal on macOS 12.0.1, the prediction results of the training evaluation are garbage. They don't align with class names either. Tensorboard shows the bounding boxes without class labels, and most bounding boxes are scattered in bottom left corner.
I also reran just the tensorflow eval using Metal on the checkpoints from the Ubuntu CPU training (which gave good results on Ubuntu) and output was also garbage. I used the tensorflow model_main_tf2.py script to train and evaluate in all experiments.
So I can also reproduce your issue @AdkPete. Something isn't right here.
Versions for me are:
tensorflow-macos 2.6.0
tensorflow-metal 0.2.0
tf-models-official 2.6.0
tensorflow-deps 2.6.0
python 3.8.12
macOS 12.0.1
16" M1 Max Macbook Pro