The new tensorflow-macos and tensorflow-metal incapacitate training

Not only Upgrading tensorflow-macos and tensorflow-metal breaks Conv2d with groups arg , it also makes training unable to finish.

Today, after upgrading the tensorflow-macos to 2.9.0 and tensorflow-metal to 0.5.0, my notebook can no longer make progress after training around 16 minutes.

I tested 4 times. It could happily run around 17 to 18 epochs, each epoch around 55 seconds. After that, it just stopped making progress.

I checked the activity monitor, both cpu and gpu usage were 0 at that point.

I accidentally found that there are a lot of kernel faults in the Console app.

The last one before I force-killed the process:

IOReturn IOGPUDevice::new_resource(IOGPUNewResourceArgs *, struct IOGPUNewResourceReturnData *, IOByteCount, uint32_t *): PID 68905 likely leaking IOGPUResource (count=200000)

The PID 68905 is in fact the training process.

I have always observed this kind of issue for several months. But it's not as frequent and I can restart my notebook train successfully. No luck today.

Hope Apple engineers can found the cause and fix it.

I ran my notebook again, observed that the training progress paused once the count reached 200000.

So maybe the tensorflow-metal is indeed leaking kernel resources which prevents training to continue at some point.

Hi @wangcheng

Thanks for reporting this issue! Can you confirm was this issue reproduced with the same script sample you provided in the other thread? This leak seems like it is related to a specific op so it would help us focus the search.

No, that script was for reproducing the groups arg issue only.

Here is the script to reproduce this problem. You need to obtain the data from kaggle first. https://www.kaggle.com/competitions/tabular-playground-series-may-2022/data

On my macbook pro, after running the script for around 10 mins, I started to see the leaking IOGPUResource messages from Console.app.

Hardware info: 14' M1 Pro, 32G, 10 core CPU, 16 core GPU

Software info: macOS 12.4, python 3.9.7, tensorflow-macos 2.9.2, tensorflow-metal 0.5.0

If I train the model on CPU, it can finish training, and surprisingly near 20% faster each epoch.

I think I'm facing this same issue, but using a transformer model (https://developer.apple.com/forums/thread/703081?answerId=719068022#719068022)

I've also had this issue on this network on TF 2.9.1 (both tensorflow-macos and built from source), tensorflow-metal 0.4.1 and 0.5.0. It's mitigable by restarting the python script every 10-30 epochs (usually corresponding to 20 mins of time), transferring weights and then resuming training, but sometimes it would hang right off the bat. Eventually I had to switch to CPU training, and I can also concur that it seems to be faster for some reason.

Hardware info: Mac Studio w/ M1 Max, 10 CPU cores 24 GPU cores, 32GB RAM

Software info: macOS 12.4, Python 3.9.13 (homebrew)

I also get this issue when training a PointNet model (https://keras.io/examples/vision/pointnet/). My setup is tensorflow-macos 2.9.2, tensorflow-metal 0.5.0, Python 3.9.13 on a Mac Studio M1 Max with 64GB memory running macOS 12.4. For the timebeing I've reverted to an older Ubuntu machine for training. I could just use CPU training on the M1 I suppose.

It would be great to get this resolved because I love using the M1 for everything. It's so fast and quiet.

I had come to his forum to report my problem, but reading recent posts, this one seems nearly identical.

A few days ago I followed the instructions at https://developer.apple.com/metal/tensorflow-plugin/ and was able to run the simple (MNIST) proof-of-life suggested here https://caffeinedev.medium.com/how-to-install-tensorflow-on-m1-mac-8e9b91d93706

Then I tried to run my own Jupyter notebook I was using on Colab. It ran for 16 minutes then hung, the GPU usage dropped to zero, nothing further happened. This sounds very close to what @wangcheng reported last month.

I assume that TensorFlow on Metal on Apple Silicon is not a huge priority for Apple. Still, clearly some skilled engineering effort went into the current version. Which led some of us to buy M1 machines hoping to run our TensorFlow code on it. I write just in hopes this bug can be addressed “soonish.”

Rereading @wangcheng’s report from two months ago, it says “If I train the model on CPU, it can finish training, and surprisingly near 20% faster each epoch.” I tried this in my code. I saw what I assume is more typical, on GPU my epochs were taking about 10 minutes (before it quit with the IOGPUResource), whereas on CPU it seems to be taking about 45 minutes per epoch.

Still a problem on M2 (with 16GB of unified RAM). And still stuck on tensorflow-macos 2.9 and tensorflow-metal 0.5.0 since newer versions are broken.

Same problem here.

The training would simply hang at the same epoch given the same training loop (tested 5 times on two different training loops). The code cell (in jupyter lab) would have the [*] symbol on, indicating the kernel is working, while CPU and GPU usage are at around 0.

Another peculiar behaviour: if a training loop would always get stuck at the 47th out of 50 epochs, then changing the number of epochs to 45 would allow this training loop to finish, however the next training loop after it would get stuck at the second epoch (47-45=2 lol), even though it was for a different model and hence a different training loop.

Device:

  • macOS 12.6.2 on M2

Software:

  • python==3.10.8
  • numpy==1.23.2
  • tensorflow-deps==2.10.0
  • tensorflow-estimator==2.11.0
  • tensorflow-macos==2.11.0
  • tensorflow-metal==0.7.0

I am reporting the same issue on MacBook Air M2 with 16GB and 512 GB Drive. The training randomly stops after 15 minutes while the notebook indicates that the training is still going. The CPU and GPU usage drop to 0.

My environment:

  1. Python 3.8.15
  2. tensorflow-datasets      4.8.1
  3. tensorflow-estimator     2.9.0
  4. tensorflow-macos         2.9.2
  5. tensorflow-metadata     1.12.0
  6. tensorflow-metal         0.5.0
  7. tensorflow-probability   0.15.0

I have same problem.
env:

python 3.10.9 (also checked 3.8)
tensorflow-macos         2.11.0
tensorflow-metal         0.7.0

Ive used tensorflow-macos==2.9 and tensorflow-metal ==0.5.0/0.5.1.
Still same.

First:
If i use latest versions i receive on fit() method error: tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:418 : NOT_FOUND: could not find registered platform with id: 0x16997a500

So i worked around with this: from tensorflow.keras.optimizers.legacy import Adam

Now: fit() works, but somewhere in middle goes to deadlock

CPU fitting works as intended.

This problem is brutal. We want to purchase new Apple Silicon Macs but cannot in the face of such a glaring problem. I hope Apple realizes how big of an issue this is and resolves it soon.

As far as I can tell, @tux_o_matic is correct about the only workable solution.

Still a problem on M2 (with 16GB of unified RAM). And still stuck on tensorflow-macos 2.9 and tensorflow-metal 0.5.0 since newer versions are broken.

If I recall correctly, tensorflow-metal 0.4.0 didn't stop randomly during training (e.g. the deadlock that @Namalek mentioned) - does anyone know how to get that version? pip can only find 0.5.0 at the earliest, and that has the stalling bug. I am mystified by how this keeps getting updated with broken fixes - even the simple tutorial models don't work.

Unfortunately I'm on the other side of the issue as @wbattel4607. I bought a Mac Studio with the M1Ultra only to discover that Apple had effectively nerfed tensorflow by creating broken updates and removing the tensorflow-macos < 2.9.0 and tensorflow-metal = 0.4.0 configurations that could actually train models.

Ok, a bit of work to get the older and more stable versions up and running. First and foremost, you'll need homebrew Then you'll need to use the version of python supported for the targeted release. The table for how to match up archival versions of tensorflow-macos and tensorflow-metal is near the bottom of this page.

You can then use brew to install the legacy python

brew install python@3.9

And then use that to create a virtual environment. Code follows for my install, though double check the location of your homebrew.

/opt/homebrew/opt/python@3.9/bin/python3.9 -m venv ~/tensorflow
source ~/tensorflow/bin/activate

With the virtual environment created, you then need to get the urls for the old pip installs. Apple prohibits the linking of external urls on this forum, but you can look up tensorflow-macos and tensoflow-metal at pypi dot org and find their release history on the left side column. Then right click/command click the release. pip install <url> is an acceptable way to install packages.

Take careful note of the c38 or c39 in the filename - this tells you whether you need python 3.8 or 3.9 for a particular release.

With that, you just need to install using the urls. So in my example, I want to use tensorflow-macos 2.8 and tensorflow-metal 0.4.0, which did not have the deadlock issue (at least not that I recall, will add another comment with a stable configuration if I need to find it).

pip install https://files.pythonhosted.org/packages/4d/74/47440202d9a26c442b19fb8a15ec36d443f25e5ef9cf7bfdeee444981513/tensorflow_macos-2.8.0-cp39-cp39-macosx_11_0_arm64.whl
pip install https://files.pythonhosted.org/packages/d5/37/c48486778e4756b564ef844b145b16f3e0627a53b23500870d260c3a49f3/tensorflow_metal-0.4.0-cp39-cp39-macosx_11_0_arm64.whl

With that, I am off to the races. I am using tensorflow-macos to build a chatbot ai. The older configuration of tensorflow-macos and tensoflow-metal have the same training time on my configuration - about an hour per epoch. Which is not bad at all for a model with 82 million parameters and a dataset of hundreds of thousands of scientific papers (this is with M1Ultra and batch sizes of 64). Tensorflow on Mac is very powerful, but unfortunately you can't rely on latest releases or the provided installation instructions to get anything functional.

The new tensorflow-macos and tensorflow-metal incapacitate training
 
 
Q