Post

Replies

Boosts

Views

Activity

GPU training deadlock with tensorflow-metal 0.5
I am training a model using tensorflow-metal and having training deadlock issue similar to (https://developer.apple.com/forums/thread/703081). Following is a minimum code to reproduce the problem. import tensorflow as tf #dev = '/cpu:0' dev = '/gpu:0' epochs = 1000 batch_size = 32 hidden = 128 mnist = tf.keras.datasets.mnist train, _ = mnist.load_data() x_train, y_train = train[0] / 255.0, train[1] with tf.device(dev): model = tf.keras.models.Sequential() model.add(tf.keras.layers.Flatten()) model.add(tf.keras.layers.Dense(hidden, activation='relu')) model.add(tf.keras.layers.Dropout(0.3)) model.add(tf.keras.layers.Dense(hidden, activation='relu')) model.add(tf.keras.layers.Dropout(0.3)) model.add(tf.keras.layers.Dense(10, activation='softmax')) model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs) Test configurations are: MacBook Air M1 macOS 12.4 tensorflow-deps 2.9 tensorflow-macos 2.9.2 tensorflow-metal 0.5.0 With this configuration and above code, training stops in the middle of 27th epoch (100% as far as I have tested). Interestingly, the problem can not be reproduced if I change any of following. GPU to CPU remove Dropout layers downgrade tensorflow-metal to 0.4
19
2
5.2k
Jun ’22