Why is it so slow?

I have post the issue to tensorflow and keras-team.Cause the can't reproduce the issu, they suggest I post the issue to apple team.

I'm using MacBook Air with M1 chip. OS version is Big Sur 11.4.

which python
/Users/dmitry/Applications/Miniforge3/bin/python

I run the following code using tensorflow-macos and tensorflow_macos, respectively.

import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
predictions = model(x_train[:1]).numpy()
tf.nn.softmax(predictions).numpy()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_fn(y_train[:1], predictions).numpy()
model.compile(optimizer = 'sgd', loss = loss_fn)
model.fit(x_train, y_train, epochs=100)

I got this! with tensorflow-macos and python3.9:

Epoch 1/100 1875/1875 [==============================] - 8s 4ms/step - loss: 0.7026

Epoch 2/100 1875/1875 [==============================] - 8s 4ms/step - loss: 0.3872

Epoch 3/100 1875/1875 [==============================] - 8s 4ms/step - loss: 0.3284

Epoch 4/100 1875/1875 [==============================] - 8s 4ms/step - loss: 0.2891

Epoch 5/100 1875/1875 [==============================] - 8s 4ms/step - loss: 0.2622

with tensorflow_macos and python3.8 env:

Epoch 1/100 1875/1875 [==============================] - 1s 276us/step - loss: 1.2181

Epoch 2/100 1875/1875 [==============================] - 1s 270us/step - loss: 0.4678

Epoch 3/100 1875/1875 [==============================] - 1s 269us/step - loss: 0.3935

Epoch 4/100 1875/1875 [==============================] - 1s 271us/step - loss: 0.3507

Epoch 5/100 1875/1875 [==============================] - 1s 270us/step - loss: 0.3231

Why tensorflow-macos is so slower than tensorflow_macos?Did I miss something?

I got the same behavior on my macbook air with M1.

with tensorflow-metal (GPU)

Init Plugin
Init Graph Optimizer
Init Kernel
Num GPUs Available: 1
Metal device set to: Apple M1
2021-07-27 14:56:50.637472: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-07-27 14:56:50.639552: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2021-07-27 14:56:50.928184: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/5
2021-07-27 14:56:51.027610: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
1875/1875 [==============================] - 16s 8ms/step - loss: 0.2980 - accuracy: 0.9137
Epoch 2/5
1875/1875 [==============================] - 15s 8ms/step - loss: 0.1446 - accuracy: 0.9563
Epoch 3/5
1875/1875 [==============================] - 15s 8ms/step - loss: 0.1079 - accuracy: 0.9676
Epoch 4/5
1875/1875 [==============================] - 15s 8ms/step - loss: 0.0894 - accuracy: 0.9722
Epoch 5/5
1875/1875 [==============================] - 16s 8ms/step - loss: 0.0746 - accuracy: 0.9767
2021-07-27 14:58:08.389344: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
313/313 - 3s - loss: 0.0763 - accuracy: 0.9766

with tensorflow-macos alone (CPU)

Num GPUs Available: 0
2021-07-27 15:40:12.555884: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/5
1875/1875 [==============================] - 1s 481us/step - loss: 0.2988 - accuracy: 0.9139
Epoch 2/5
1875/1875 [==============================] - 1s 492us/step - loss: 0.1416 - accuracy: 0.9581
Epoch 3/5
1875/1875 [==============================] - 1s 484us/step - loss: 0.1072 - accuracy: 0.9674
Epoch 4/5
1875/1875 [==============================] - 1s 493us/step - loss: 0.0876 - accuracy: 0.9736
Epoch 5/5
1875/1875 [==============================] - 1s 492us/step - loss: 0.0734 - accuracy: 0.9779
313/313 - 0s - loss: 0.0721 - accuracy: 0.9782

Similar here. With tensorflow-metal the performance is 6x slower than pure CPU.

One thing is that this model is fairly small as a result it doesn't make the best use of the GPU. You should be able to get better performance by increasing the batch size (batch_size=1024 for example).

People, please go through the tensorflow-metal requirements here : https://developer.apple.com/metal/tensorflow-plugin/

It clearly mentions that the requirements for macOS are 12.0 (Monterey). There is no point in trying it out on Big Sur as it will run in a very unoptimised CPU settings.

Tensorflow-Metal was built using MPSGraph inference enhancements capabilities and thus, was only meant to be used with GPUs. You can watch more in the WWDC 2021 session here : https://developer.apple.com/videos/play/wwdc2021/10152/

I get the exact same results on macOS 12. I can’t test the old library now since it was only built for Big Sur but I remember it not being this slow.

Same here, running on macOS 12 beta 5. tensorflow metal is much slower. Although it seems to have better accuracy.

I'm new to tensorflow and maybe I'm not doing it right? Or maybe it is just another bogus "feature" of mac?

On cpu: 1875/1875 [==============================] - 1s 371us/step - loss: 0.5217 - accuracy: 0.8446 Epoch 2/5 1875/1875 [==============================] - 1s 353us/step - loss: 0.1890 - accuracy: 0.9428 Epoch 3/5 1875/1875 [==============================] - 1s 362us/step - loss: 0.1459 - accuracy: 0.9555 Epoch 4/5 1875/1875 [==============================] - 1s 353us/step - loss: 0.1301 - accuracy: 0.9616 Epoch 5/5 1875/1875 [==============================] - 1s 353us/step - loss: 0.1194 - accuracy: 0.9623

on GPU: 1875/1875 [==============================] - 8s 4ms/step - loss: 0.2965 - accuracy: 0.9130 Epoch 2/5 1875/1875 [==============================] - 7s 4ms/step - loss: 0.1405 - accuracy: 0.9588 Epoch 3/5 1875/1875 [==============================] - 7s 4ms/step - loss: 0.1046 - accuracy: 0.9680 Epoch 4/5 1875/1875 [==============================] - 7s 4ms/step - loss: 0.0861 - accuracy: 0.9733 Epoch 5/5 1875/1875 [==============================] - 7s 4ms/step - loss: 0.0743 - accuracy: 0.9761

Small batch sizes (default is 32) run much faster on CPU. Thats normal. GPU takes over once you have big workload.

This isn't unexpected, on any platform with any device. Sometime the CPU is faster than the GPU. Sometime my M1 on my MacBook Air 13" is faster than my Nvidia Quadro, or a Tesla K80.

It depend on the workload. It's not specific to TensorFlow metal.

To be 100% sure you disable the GPU in order to test :

tf.config.set_visible_devices([], 'GPU')
Why is it so slow?
 
 
Q