Tensorflow-metal runs extremely slow

I am comparing my M1 MBA with my 2019 16" Intel MBP. The M1 MBA has tensorflow-metal, while the Intel MBP has TF directly from Google.

Generally, the same programs runs 2-5 times FASTER on the Intel MBP, which presumably has no GPU acceleration.

Is there anything I could have done wrong on the M1?

Here is the start of the metal run: Metal device set to: Apple M1

systemMemory: 16.00 GB maxCacheSize: 5.33 GB

2022-01-19 04:43:50.975025: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2022-01-19 04:43:50.975291: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: ) 2022-01-19 04:43:51.216306: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz Epoch 1/10 2022-01-19 04:43:51.298428: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.

Hi @ahostmadsen

Thanks for reporting the issue. Do you have a sample script we could use to study the issue in case there is some performance issue? Another possibility is that the model size or batch sizes used when running the scripts are too small to take full advantage of the GPU and amortize the time cost in dispatching the data to the GPU.

This is a simple program I just downloaded to test. Each epoch takes about 6s on the M1 MBA, but 1s on the Intel MBP. But all my programs run slow. Yes, the examples I have been running are fairly small.

import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10) ])

predictions = model(x_train[:1]).numpy() tf.nn.softmax(predictions).numpy()

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

loss_fn(y_train[:1], predictions).numpy()

model.compile(optimizer = 'sgd', loss = loss_fn) model.fit(x_train, y_train, epochs=10)

Now I tried a tutorial example from Google:

https://www.tensorflow.org/tutorials/quickstart/advanced

That one runs about twice as fast on my M1 MBA as on my Intel MBP. Perhaps the example I put in the previous post is not well-suited for GPU? One would then hope that the metal framework could make a choice to run it on CPU (my experience is that the M1 as about twice as fast as Intel in running scientific computations on CPU).

Anyway, I think I will upgrade my 16" Intel MBP to a 16" M1 MBP, hoping that the TF metal framework continues to be developed.

Hi. I am having the same problem. Even with an encoder decoder architecture m1 runs 5 times slower than intel. And it couldn't find a GPU support

print("Num GPUs available:", len(tf.config.experimental.list_physical_devices('GPU')))

and it outputs " Num GPUs available: 0 "

I bought this Mac because of its speed and now it is even slower. How can ı fix this ?

I have same problem as like mentioned above

Hello, Based on my observations, tensor flow-metal slows the processing instead of speeding it up on a M1 Pro MacBook. Easy to explain as the GPU's are not optimised for Neural operations and the "normal" processor has two optimised cores for ML.

Log from simple program in python: WITH tensor flow-metal plugin Epoch 16/20 1875/1875 [==============================] - 8s 4ms/step - loss: 0.0844 - accuracy: 0.9747 Epoch 17/20 1875/1875 [==============================] - 8s 4ms/step - loss: 0.0818 - accuracy: 0.9756 Epoch 18/20 1875/1875 [==============================] - 8s 4ms/step - loss: 0.0794 - accuracy: 0.9759

(tinyML-env) remco@Remcos-MBP tinyML % pip uninstall tensorflow-metal Found existing installation: tensorflow-metal 0.4.0 So without: Epoch 16/20 1875/1875 [==============================] - 1s 731us/step - loss: 0.0880 - accuracy: 0.9727 Epoch 17/20 1875/1875 [==============================] - 1s 736us/step - loss: 0.0845 - accuracy: 0.9742 Epoch 18/20 1875/1875 [==============================] - 1s 733us/step - loss: 0.0821 - accuracy: 0.9747 Epoch 19/20 1875/1875 [==============================] - 1s 727us/step - loss: 0.0807 - accuracy: 0.9750

Maybe you can try this to.

I have the same problem in my LSTM model on APPLE M2. I follow the progress of https://developer.apple.com/metal/tensorflow-plugin/. to set up my environment, and the model run extremely slow. How to fix it?

Also, i got this respond while i running the model: Failed to get CPU frequency: 0 Hz

It seems like there is no consensus as to how to resolve this. I have upgraded my OS to Sonoma on Mac - latest OS to date and it seems like my tensorflow needed to be updated along with all dependent libraries and at that point it runs EXTREMELY slow. I have been searching all over to find a solution but there isnt' one that I was able to find. Any help / direction from you would be greatly appreciated

Want to address the same issue. Two Apple Silicon computers. Mac Studio and MBP. The only difference is the new OS. They were performing almost identical to each other before upgrade to Sonoma. Now the latter OS gives 5-6 times slower performance on the same Python code. And it looks like it is lack of GPU use making most of the difference.

Apple... seriously... why?!?

Is there any update ?

I have faced with the same issue that even example code from https://developer.apple.com/metal/tensorflow-plugin/ runs 4x slower than on CPU !!

Looks like the version of Python matters. My environment: MacBook Pro 14-inch, 2021, M1 Pro, 16 GB

Using this code example I've created two different virtual environments:

  • Python 3.8.19
  • Python 3.11.9

Results

  • Python 3.8 (CPU)
Epoch 1/5
782/782 [==============================] - 403s 513ms/step - loss: 4.8157 - accuracy: 0.0648
  • Python 3.8 (GPU)
Epoch 1/5
2024-07-22 21:35:48.809586: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
782/782 [==============================] - 64s 77ms/step - loss: 4.9219 - accuracy: 0.0574

  • Python 3.11 (CPU)
Epoch 1/5
782/782 ━━━━━━━━━━━━━━━━━━━━ 435s 544ms/step - accuracy: 0.0480 - loss: 5.0793
  • Python 3.11 (GPU)
Epoch 1/5
2024-07-22 21:48:42.497240: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
782/782 ━━━━━━━━━━━━━━━━━━━━ 412s 472ms/step - accuracy: 0.0487 - loss: 5.1804

I did not include the results for Python versions between 3.8 and 3.11, but the behavior is the same: slow. It looks like tensorflow-metal utilizes the Apple Silicon GPU well only in Python 3.8 🤷‍♂️

Tensorflow-metal runs extremely slow
 
 
Q