I have the exact same problem!! Started noticing really long training times for a simple BLSTM, and decided to test the above code. I'm also using MacBook Air M1 (Mac OS 12 beta) TensorFlow version (2.5 version) with Metal Support Python version: 3.9 GPU model and memory: MacBook Air M1 and 16 GB.
This completely undermines my work! Apple should do something!
Post
Replies
Boosts
Views
Activity
I'm already using macOS 12.0 Beta 8
I'm also having similar problems, but Apple just does not give any support!
Sorry for insisting, but this issue does not let me use tensorflow and it's being really needed.
Hi there!
That will be difficult, given that these are spurious errors while doing model.fit with the Lookahead optimizer (I'm doing fine-tuning with big datasets, and my code just breaks while fitting to different files, and in a not-reproducible way, i.e. each time I run it it breaks on a different file, and on different operations).
So, the only way for me to share this it would be to try to reduce a little bit my part of code (but it will still be big) and also send you one of the datasets (>2G), to be sure it would break also on your side.
I don't think I have any other way I can share this with you. Is that ok?
I'm asking because this will take me some hours to do, time that I don't really have, but I would do it if you could look at the code I'll send.
I'll wait on your feedback.
Please let me know something about this.
This error is making me unable to run my code, and I'm sure that will happen to anyone in M1 that needs to use the tensorflow_addons.
In summary:
1- I had finetuning code running without problem in my old MacOs (loads a previously trained TCN model and creates a finetuned model per file in the dataset);
2- When I bought the new M1, almost 1 year ago, the same code started producing the following error:
Cannot assign a device for operation model/conv_1_convolution/Conv2D/ReadVariableOp: Could not satisfy explicit device specification '' because the node {{colocation_node model/conv_1_convolution/Conv2D/ReadVariableOp}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' assigned_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' resource_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' supported_device_types_=[CPU] possible_devices_=[]
Equal: CPU
AssignSubVariableOp: GPU CPU
AssignVariableOp: GPU CPU
GreaterEqual: GPU CPU
FloorDiv: CPU
Sqrt: GPU CPU
NoOp: GPU CPU
Pow: GPU CPU
Mul: CPU
Cast: GPU CPU
Identity: GPU CPU
SelectV2: GPU CPU
ReadVariableOp: GPU CPU
RealDiv: GPU CPU
Sub: GPU CPU
AddV2: GPU CPU
Const: GPU CPU
Square: GPU CPU
_Arg: GPU CPU
3- I avoided the last problem by setting tf.config.set_soft_device_placement(True) and forcing with tf.device('/cpu:0'): before any call to tensorflow, but when I do long finetuning sessions, inevitably at some random point, I'll get the error I reported in the 1st post (ie "Incompatible shapes: [0] vs. [5,40,20]", with varying error shapes).
4- I've tried with 2 different versions of tf+tfaddons (conda environments), but I got the same type of errors, probably more frequent with the pylast conda environment. You can see the environment.yml attached.
pylast: tensorflow-macos 2.9.0, tensorflow-metal 0.5.0, tensorflow-addons 0.17.0
py39deps26-source: tensorflow-macos 2.6.0, tensorflow-metal 0.2.0, tensorflow-addons 0.15.0.dev0
5- The developers of tensorflow_addons have been really helpful https://github.com/tensorflow/addons/issues/2578, but they said "As tensorflow-macos and tensorflow-metal are closed source packages we cannot do anything here in the case we cannot reproduce the issue on another platform."
6- the code does finetuning of a TCN network to specific audio files (with annotations), so you really need this data to debug this. Furthermore, the problem happens when doing long runs, so the dataset must be big for you to run into the issue.
So, please clone the https://github.com/MR-T77/M1_tf_problems (~1.4GB), and extract it to the same path as the py file.
You will see:
a stripped down version of my code (problem_TCNv2.py) - just run it as it is; go into run_me() to change dataset or data augmentation.
2 datasets, one big and one small. The code breaks when I'm doing long runs, so I'm pretty sure that if you run the code with the big dataset, the issue will appear.
pretrained_model.h5 - the pretrained network model;
yml files of the conda environments that I tested.
I really hope you can figure out what is the problem, as I really need this code to work.
Please keep me posted.