GPU training deadlock with tensorflow-metal 0.5

I am training a model using tensorflow-metal and having training deadlock issue similar to (https://developer.apple.com/forums/thread/703081). Following is a minimum code to reproduce the problem.

import tensorflow as tf

#dev = '/cpu:0'
dev = '/gpu:0'
epochs = 1000
batch_size = 32
hidden = 128


mnist = tf.keras.datasets.mnist
train, _ = mnist.load_data()
x_train, y_train = train[0] / 255.0, train[1]

with tf.device(dev):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Flatten())
    model.add(tf.keras.layers.Dense(hidden, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(hidden, activation='relu'))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(10, activation='softmax'))
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs)

Test configurations are:

  • MacBook Air M1
  • macOS 12.4
  • tensorflow-deps 2.9
  • tensorflow-macos 2.9.2
  • tensorflow-metal 0.5.0

With this configuration and above code, training stops in the middle of 27th epoch (100% as far as I have tested). Interestingly, the problem can not be reproduced if I change any of following.

  1. GPU to CPU
  2. remove Dropout layers
  3. downgrade tensorflow-metal to 0.4

@masa6s

Thanks for reporting the issue and the excellent test script to reproduce it. I can confirm that I have reproduced this locally and found an issue relating to the dropout layer that causes the training to stop. After we have verified the fix we will include it in tensorflow-metal==0.5.1.

Same with me. (Python: 3.9.13 tensorflow-macos: 2.9.2 tensorflow-metal: 0.5.0)

The same problem.

Python 3.8.9 tensorflow-macos 2.9.2 tensorflow-metal 0.5.0

Hello, I don't know if the same reason but I tried to fine tune a BERT model and at at some point, I also have a deadlock after some time (need to kill the kernel and start over). The dead lock will happen depending on the quantity of data I used to fine tuned. In the cas below the training will stop in the middle of the 3rd epoch

my machine:

  • MacOS 12.5
  • Mac Book Pro Apple M1 Max

I use :

  • python                    3.10.5
  • tensorflow-macos          2.9.2
  • tensorflow-metal          0.5.0
  • tokenizers                0.12.1.dev0 
  • transformers              4.22.0.dev0

data : https://www.kaggle.com/datasets/kazanova/sentiment140

Quantity of tweets used: 11200

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased",
                                                             num_labels=2)

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(tf_train_dataset,
          validation_data=tf_validation_dataset,
          epochs=4,
         )

We have now released tensorflow-metal==0.5.1 which addresses multiple memory leak issues leading to GPU hangups. Please give it a try and see if it helps with the problems you are seeing.

Hi,

I did not see any improvement with TF-MACOS=2.9.2 and TF-METAL=0.5.1 with python 3.9.13. please see my latest (relevant) response in thread https://developer.apple.com/forums/thread/711753

This is why I am sticking to my old setup of TF-MACOS==2.8.0 and TF-METAL=0.4.0 along with python 3.8.13. And I am using CPUONLY option, which gives relatively less memory leakage. Yet I wish to wait till end when all the epochs (merely 3) end.

Thanks, Bapi

I ran into the same issue. The training would stop at some random epochs with no error or warning when using tensorflow-metal 0.5.1.

The only way I could fix this was to reinstall my environment following Apple's instructions but now using this version of Miniforge3-MacOSX-arm64.sh from scratch and, this time, use tensorflow-metal 0.4.0.

Hi @bahman_n and @karbapi! Thanks for verifying that this issue still persists in 0.5.1. I'll continue looking into the issue to get to the bottom of this.

Hi,

I wish to share a strange thing I noticed apart from this issue (and the memory leak issue for GPU):

**There is a huge gap between two epochs. Typically an epoch is taking around 4-5mins, but this inter-epoch gap spans around 6-7mins. This is possibly due to M1 Ultra process scheduler is under-optimised. **

Hope this pointer helps in your fix and yield better resolution in the subsequent TF-METAL releases.

Thanks, Bapi

I am really disappointed with my Mac Studio (M1 Ultra, 64c GPU, 128GB RAM). Now I am thinking why did I spend that huge money on this ****** machine!

Now I am getting error due to multiprocessing and the training is stopped!

2022-08-26 07:49:49.615373: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.

2022-08-26 07:49:49.615621: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)

Process Keras_worker_SpawnPoolWorker-92576:

Traceback (most recent call last):

  File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap

    self.run()

  File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/process.py", line 108, in run

    self._target(*self._args, **self._kwargs)

  File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/pool.py", line 109, in worker

    initializer(*initargs)

  File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/site-packages/keras/utils/data_utils.py", line 812, in init_pool_generator

    id_queue.put(worker_proc.ident, block=True, timeout=0.1)

  File "/Users/bapikar/miniforge3/envs/tf28_python38/lib/python3.8/multiprocessing/queues.py", line 84, in put

    raise Full

queue.Full

Regarding the deadlock, it seems I found a way around accidentally. You have to include a line which in fact say you want to use the GPU especially if like me you do it cell by cell . Exemple below:

with tf.device('/gpu:0'):
    <write your model here>

then here you do other things in your notebook like batch and such... Then you train your model

with tf.device('/gpu:0'):
    hist_1 = model_1.fit

Somehow, this stopped my deadlock. In addition (and I don't know if it is related but just in case), I stopped to use Safari for my Jupyter Notebook and went on chrome instead (not for this reason but mainly because safari kept reloading my "heavy" notebook...)

Hope this help

cheers

Hi, Thanks for sharing the info.

However, my issue is little different (please see the thread on memory leakage https://developer.apple.com/forums/thread/711753).

My training is stopped apparently due to memory leakage and one potential (guessed by me) reason of CPU/GPU scheduling issue when the memory usage is too high (say ~125GB out of 128GB RAM in my system, with no SWAP being used for whatsoever reason) in my M1 ULTRA machine with 64c GPU (Mac Studio).

And FYI, my training setup use:

with tf.GradientTape() as tape:
       .......

And I do not use Jupyter notebook. My work setup is run on command line, while my code (structured over multiple files) is written with a text editor such as GVIM.

--Bapi

CPU Only run on Mac STUDIO (20c CPU, 64c GPU, 128GB RAM). Training is STALLED, perhaps the CPUs are DEAD for some FABULOUS REASONS. Below is the snapshot (with temperatures of different cores).

HEY, ANY UPDATE? SHOULD MY 64c GPUs BE ALLOWED TO SIT IDLE?

Am wondering if this is a manifestation of a related problem?

My python code starts with: from transformers import AutoTokenizer, AutoModel

Then crashes during execution of the following code: model = AutoModel.from_pretrained("bert-base-uncased")

Running from within PyCharm SDE, I get this error: Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Interestingly, this crashes on my (Intel i5 w 16MB RAM) MacMini, but runs fine on my (Apple M1 w 16 MB RAM) MacBook Air. Both are running MacOS Ventura, v13.0.1 at the moment.

GPU training deadlock with tensorflow-metal 0.5
 
 
Q