GPU training deadlock with tensorflow-metal 0.5

I am training a model using tensorflow-metal and having training deadlock issue similar to (https://developer.apple.com/forums/thread/703081). Following is a minimum code to reproduce the problem.

import tensorflow as tf

#dev = '/cpu:0'
dev = '/gpu:0'
epochs = 1000
batch_size = 32
hidden = 128


mnist = tf.keras.datasets.mnist
train, _ = mnist.load_data()
x_train, y_train = train[0] / 255.0, train[1]

with tf.device(dev):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(hidden, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(hidden, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(10, activation='softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)

Test configurations are:

  • MacBook Air M1
  • macOS 12.4
  • tensorflow-deps 2.9
  • tensorflow-macos 2.9.2
  • tensorflow-metal 0.5.0

With this configuration and above code, training stops in the middle of 27th epoch (100% as far as I have tested). Interestingly, the problem can not be reproduced if I change any of following.

  1. GPU to CPU
  2. remove Dropout layers
  3. downgrade tensorflow-metal to 0.4

I'm having the same issue, anybody in a year has ever found a solution? I'm really ****** that 3k+ pc is not able to do such basic things that nvidia and others are able to do without any sort of issue, and the only fix is to avoid using GPU with a old tf-metal, which was the whole point of buying the Pro instead of the Air!

I'm having the same issue, anybody in a year has ever found a solution? I'm really "sad" (otherwise the post will be "reviewed") that 3k+ pc is not able to do such basic things that nvidia and others are able to do without any sort of issue, and the only fix is to avoid using GPU with a old tf-metal, which was the whole point of buying the Pro instead of the Air!

Can anyone confirm if this problem persists on tensorflow-metal version 0.7.0?

I also peviously had the same problems with training coming to a near-halt mid-epoch. Today, I (again) followed the steps on https://developer.apple.com/metal/tensorflow-plugin/ and installed tensorflow-deps 2.9.0, tensorflow-macos 2.12.0, and tensorflow metal 0.8.0. So far, I have not experienced any training "deadlocks".

GPU training deadlock with tensorflow-metal 0.5
 
 
Q