I am training a model using tensorflow-metal and having training deadlock issue similar to (https://developer.apple.com/forums/thread/703081).
Following is a minimum code to reproduce the problem.
import tensorflow as tf
#dev = '/cpu:0'
dev = '/gpu:0'
epochs = 1000
batch_size = 32
hidden = 128
mnist = tf.keras.datasets.mnist
train, _ = mnist.load_data()
x_train, y_train = train[0] / 255.0, train[1]
with tf.device(dev):
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(hidden, activation='relu'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(hidden, activation='relu'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)
Test configurations are:
MacBook Air M1
macOS 12.4
tensorflow-deps 2.9
tensorflow-macos 2.9.2
tensorflow-metal 0.5.0
With this configuration and above code, training stops in the middle of 27th epoch (100% as far as I have tested).
Interestingly, the problem can not be reproduced if I change any of following.
GPU to CPU
remove Dropout layers
downgrade tensorflow-metal to 0.4