M1 Mac mini GPU acting strange during tensorflow-metal tests

I recently got an M1 Mac mini and I was doing some testing in TensorFlow. I installed tensorflow-macos to do CPU testing and tensorflow-metal to do GPU testing.

I followed the procedure here: https://developer.apple.com/metal/tensorflow-plugin/ to install tensorflow-metal. I did not see any warnings or error during the installation process.

I was pleased to see that TensorFlow CPU testing on the M1 went smoothly.

I then tested the M1 integrated GPU to see if everything is working correctly by running sample TensorFlow code:

from tensorflow import keras

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

y_train = y_train[:1000]
y_test = y_test[:1000]

x_train, x_test = x_train / 255.0, x_test / 255.0
x_train = x_train[:1000].reshape(-1, 28*28)
x_test = x_test[:1000].reshape(-1, 28*28)

def create_model():
  model = tf.keras.models.Sequential([
    keras.layers.Dense(512, activation='relu', input_shape=(784,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(1000, activation='relu'),
    keras.layers.Dense(10)
  ])

  model.compile(optimizer='adam',
                loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=[tf.metrics.SparseCategoricalAccuracy()])

  return model

# Create a basic model instance
model = create_model()

# Display the model's architecture
model.summary()
predictions = model(x_train[:1]).numpy()

model.fit(x_train, y_train, epochs=10)

loss, acc = model.evaluate(x_test,  y_test, verbose=2)
print("Accuracy: {:5.2}%".format(100*acc))

When I tried to run this, the program stopped entirely with some strange errors which I could trace back to the model.fit line. I have attached an error log file containing the exact terminal output.

Switching the optimizer in the model.compile line to sgd allows the program to complete, but accuracy and loss is stuck at 1000.00:

2021-07-02 10:27:26.486274: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
32/32 [==============================] - 0s 5ms/step - loss: 1000.0000 - sparse_categorical_accuracy: 1000.0000
Epoch 2/10
32/32 [==============================] - 0s 5ms/step - loss: 1000.0000 - sparse_categorical_accuracy: 1000.0000
Epoch 3/10
32/32 [==============================] - 0s 5ms/step - loss: 1000.0000 - sparse_categorical_accuracy: 1000.0000
Epoch 4/10
32/32 [==============================] - 0s 5ms/step - loss: 1000.0000 - sparse_categorical_accuracy: 1000.0000
Epoch 5/10
32/32 [==============================] - 0s 5ms/step - loss: 1000.0000 - sparse_categorical_accuracy: 1000.0000
Epoch 6/10
32/32 [==============================] - 0s 5ms/step - loss: 1000.0000 - sparse_categorical_accuracy: 1000.0000
Epoch 7/10
32/32 [==============================] - 0s 5ms/step - loss: 1000.0000 - sparse_categorical_accuracy: 1000.0000
Epoch 8/10
32/32 [==============================] - 0s 5ms/step - loss: 1000.0000 - sparse_categorical_accuracy: 1000.0000
Epoch 9/10
32/32 [==============================] - 0s 5ms/step - loss: 1000.0000 - sparse_categorical_accuracy: 1000.0000
Epoch 10/10
32/32 [==============================] - 0s 5ms/step - loss: 1000.0000 - sparse_categorical_accuracy: 1000.0000
2021-07-02 10:27:28.218923: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
32/32 - 0s - loss: 1000.0000 - sparse_categorical_accuracy: 1000.0000
Accuracy: 1e+05%

During my testing, I also tried giving a pre-trained network and only testing ML inferencing performance. To be specific, this inferencing benchmark I set up used 10,000 upscaled CIFAR10 images through a pre-trained ResNet50 network. I found that this worked completely on both the M1 CPU and GPU, but was about twice as slow on the GPU as the CPU running the exact same code. I was surprised by this result since I expected the GPU to outperform the CPU. I also found, through sudo powermetrics logs, that GPU power consumption was almost twice that of the CPU, around 10 Watts on the GPU and around 5 Watts on the CPU. These tests were performed separately.

It appears that my issue is specific to training on the GPU, however I'm wondering if there is a larger issue here that results in poor optimization on the inferencing side too.

What I have tried to fix these issues:

  • Full restart
  • Verified on a friend's M1 Mac mini to see if it is specific to my device. They had the exact same issue as me on the training test.

Both tests (training and inferencing) are completely repeatable and occur the exactly the same way every time I have tried.

Update to this: I was doing this testing in MacOS 11.0. After updating to 11.4, GPU tests were working! Much faster than the CPU too (for my application). I'm not really sure which update in specific fixed this issue, maybe someone could let me know. Also, the tensorflow-metal page says that the required OS is MacOS 12.0, which is only available as a beta. Not really sure why this is the case.

`import tensorflow as tf

tf.config.list_physical_devices()

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]`

As soon as I try to run a keras based model it dies with:

2021-11-08 19:11:56.350233: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-11-08 19:11:56.350804: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2021-11-08 19:11:56.351033: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: ) 2021-11-08 19:11:56.512351: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing. 2021-11-08 19:11:56.512369: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started. 2021-11-08 19:11:56.512818: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down. 2021-11-08 19:11:57.362096: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)

M1 Mac mini GPU acting strange during tensorflow-metal tests
 
 
Q