I have a bug running tensorflow code in my M1, but it needs annotated audio data (as the TCN network processes audio data). How can I share a example dataset? I can't submit GDrive links...
Post
Replies
Boosts
Views
Activity
First of all, as I understand that this is a problem related with tensorflow addons, I've been in contact with tfa developers (https://github.com/tensorflow/addons/issues/2578), and this issue only happens in M1, so they think it has to do with Apple tensorflow-metal.
I've been getting spurious errors while doing model.fit with the Lookahead optimizer (I'm doing fine-tuning with big datasets, and my code just breaks while fitting to different files, and in a not-reproducible way, i.e. each time I run it it breaks on a different file, and on different operations).
I can see that these errors are undoubtedly related to the Lookahead optimizer.
Let me try to explain this new info in a clear manner.
I've tried with 2 different versions of tf+tfaddons (conda environments), but I got the same type of errors, probably more frequent with the pylast conda environment:
pylast:tensorflow-macos 2.9.0, tensorflow-metal 0.5.0, tensorflow-addons 0.17.0
py39deps26-source: tensorflow-macos 2.6.0, tensorflow-metal 0.2.0, tensorflow-addons 0.15.0.dev0
The base code is always the same, I use tf.config.set_soft_device_placement(True) and also with tf.device('/cpu:0'): in every call to tensorflow, otherwise I get errors. As explained before, in my code, I just load a model, and fine-tune it to each file of a dataset.
Here are a pair of example error outputs (obtained with the pylast conda environment):
File "/Users/machine/Projects/finetune-asp/src/finetune_IMR2020.py", line 138, in finetune_dataset_db
history = model.fit(ft, steps_per_epoch=len(ft), epochs=ft_cfg["num_epochs"], shuffle=True,
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
Detected at node 'Lookahead/Lookahead/update_64/mul_11' defined at (most recent call last):
File "/Users/machine/Projects/finetune-asp/src/finetune_IMR2020.py", line 138, in finetune_dataset_db
history = model.fit(ft, steps_per_epoch=len(ft), epochs=ft_cfg["num_epochs"], shuffle=True,
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
return fn(*args, **kwargs)
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1409, in fit
tmp_logs = self.train_function(iterator)
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1051, in train_function
return step_function(self, iterator)
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1040, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1030, in run_step
outputs = model.train_step(data)
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 893, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 539, in minimize
return self.apply_gradients(grads_and_vars, name=name)
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/lookahead.py", line 104, in apply_gradients
return super().apply_gradients(grads_and_vars, name, **kwargs)
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 678, in apply_gradients
return tf.__internal__.distribute.interim.maybe_merge_call(
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 723, in _distributed_apply
update_op = distribution.extended.update(
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 706, in apply_grad_to_update_var
update_op = self._resource_apply_dense(grad, var, **apply_kwargs)
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/lookahead.py", line 130, in _resource_apply_dense
train_op = self._optimizer._resource_apply_dense(
File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/rectified_adam.py", line 249, in _resource_apply_dense
coef["r_t"] * m_corr_t / (v_corr_t + coef["epsilon_t"]),
Node: 'Lookahead/Lookahead/update_64/mul_11'
Incompatible shapes: [0] vs. [5,40,20]
[[{{node Lookahead/Lookahead/update_64/mul_11}}]] [Op:__inference_train_function_30821]
and
Another error output
I have a simple TCN model that I've been using for a while. Since my change to Apple M1, I am unable to run it.
My issue seems very similar to this and this, and I've also reported on FB9722799.
On one of these threads Apple recognized that "we are aware of this issue and already working on a fix.". But this was 3 months ago, which is too much time without being able to fully use the computer for development!
If I uninstall tensorflow-metal, I can run the code (of course, only in CPU). If I install tensorflow-metal, I get the following error:
Cannot assign a device for operation model/conv_1_convolution/Conv2D/ReadVariableOp: Could not satisfy explicit device specification '' because the node {{colocation_node model/conv_1_convolution/Conv2D/ReadVariableOp}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0'
...
[[{{node model/conv_1_convolution/Conv2D/ReadVariableOp}}]] [Op:__inference_train_function_15035]
Dear all,
I'm unable to install tensorflow-macos, after updating to macOS Monterey(12.0 Beta). According to the last instructions from tensorflow/apple (https://developer.apple.com/metal/tensorflow-plugin/), I'm using miniforge conda, create a blank environment and then do conda install -c apple tensorflow-deps, which runs without any error or warning. Then when I try to do the following, everything breaks.
python -m pip install tensorflow-macos
Tried with python3.8 with the following error (summary, not the full logs):
distutils.errors.CompileError: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for grpcio
Tried with python3.9 with the following error (summary, not the full logs):
distutils.errors.CompileError: command '/usr/bin/clang' failed with exit code 1
----------------------------------------
ERROR: Failed building wheel for grpcio
Tried with force reinstall and no-cache-dir (python -m pip install tensorflow-macos --no-cache-dir --force-reinstall) with the following error :
ERROR: Command errored out with exit status 1: /Users/machine/miniforge3/envs/tf38/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-install-djre1j5j/numpy_48546adcbc9d4c558a4dc32a8e607649/setup.py'"'"'; __file__='"'"'/private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-install-djre1j5j/numpy_48546adcbc9d4c558a4dc32a8e607649/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-record-343ln54c/install-record.txt --single-version-externally-managed --prefix /private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-build-env-1fyu7c9t/normal --compile --install-headers /private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-build-env-1fyu7c9t/normal/include/python3.8/numpy Check the logs for full command output.
----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/a7/81/20d5d994c91ed8347efda90d32c396ea28254fd8eb9e071e28ee5700ffd5/h5py-3.1.0.tar.gz#sha256=1e2516f190652beedcb8c7acfa1c6fa92d99b42331cbef5e5c7ec2d65b0fc3c2 (from https://pypi.org/simple/h5py/) (requires-python:>=3.6). Command errored out with exit status 1: /Users/machine/miniforge3/envs/tf38/bin/python /private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-standalone-pip-nmsgrvml/__env_pip__.zip/pip install --ignore-installed --no-user --prefix /private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-build-env-1fyu7c9t/normal --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'numpy==1.12; python_version == "3.6"' 'Cython>=0.29; python_version < "3.8"' 'numpy==1.14.5; python_version == "3.7"' 'numpy==1.19.3; python_version >= "3.9"' 'numpy==1.17.5; python_version == "3.8"' pkgconfig 'Cython>=0.29.14; python_version >= "3.8"' Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement h5py~=3.1.0 (from tensorflow-macos) (from versions: 2.2.1, 2.3.0b1, 2.3.0, 2.3.1, 2.4.0b1, 2.4.0, 2.5.0, 2.6.0, 2.7.0rc2, 2.7.0, 2.7.1, 2.8.0rc1, 2.8.0, 2.9.0rc1, 2.9.0, 2.10.0, 3.0.0rc1, 3.0.0, 3.1.0, 3.2.0, 3.2.1, 3.3.0, 3.4.0)
ERROR: No matching distribution found for h5py~=3.1.0
Could anyone point me out any solution, I'm really desperate here, as my work is completely stuck because of this. Thanks in advance.
My computer almost stalls whenever I try to use a Bidirectional layer. I'm using Macos M1 with tensorflow-macos 2.5 tensorflow-metal 0.1.2, tensorflow-deps 2.5.0.
Bellow I show 2 short snippets of demo code: one working (without Bidirectional), one not-working (with Bidirectional).
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.layers import Embedding, Dense, LSTM
from tensorflow.keras.losses import BinaryCrossentropy
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import SimpleRNN, Bidirectional, Masking
import tensorflow_addons as tfa
additional_metrics = ['accuracy']
batch_size = 128
embedding_output_dims = 15
loss_function = BinaryCrossentropy()
max_sequence_length = 300
num_distinct_words = 5000
number_of_epochs = 5
optimizer = Adam()
optimizer = tfa.optimizers.RectifiedAdam(learning_rate=0.01, clipnorm=0.5)
validation_split = 0.20
verbosity_mode = 1
def working_demo_LSTM():
# Load dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words)
print(x_train.shape)
print(x_test.shape)
# Pad all sequences
padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length,
value=0.0) # 0.0 because it corresponds with <PAD>
padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length,
value=0.0) # 0.0 because it corresponds with <PAD>
# Define the Keras model
model = Sequential()
model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length))
model.add(LSTM(10))
model.add(Dense(1, activation='sigmoid'))
# Compile the model
model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics)
# Give a summary
model.summary()
history = model.fit(padded_inputs, y_train, batch_size=batch_size, epochs=number_of_epochs,
verbose=verbosity_mode, validation_split=validation_split)
# Test the model after training
test_results = model.evaluate(padded_inputs_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
return True
def nonworking_demo():
# Load dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words)
print(x_train.shape)
print(x_test.shape)
# Pad all sequences
padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length,
value=0.0) # 0.0 because it corresponds with <PAD>
padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length,
value=0.0) # 0.0 because it corresponds with <PAD>
# Define the Keras model
model = Sequential()
model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length))
model.add(Bidirectional(SimpleRNN(units=10, return_sequences=True)))
model.add(Dense(1, activation='sigmoid'))
# Compile the model
model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics)
# Give a summary
# model.summary()
history = model.fit(padded_inputs, y_train, batch_size=batch_size, epochs=number_of_epochs,
verbose=verbosity_mode, validation_split=validation_split)
# Test the model after training
test_results = model.evaluate(padded_inputs_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
return True
def main():
# working_demo_LSTM()
nonworking_demo_BLSTM()
if __name__ == "__main__":
main()
I'm getting the following warnings and the computer stalls whenever I run nonworking_demo_BLSTM() with
with tf.device('/cpu:0'): I get 7secs per epoch. If I don't explicitly select CPU, I get a ETA of 05:44:30 just for the 1st epoch!
Are these values normal?