Posts

Post not yet marked as solved
4 Replies
1.6k Views
First of all, as I understand that this is a problem related with tensorflow addons, I've been in contact with tfa developers (https://github.com/tensorflow/addons/issues/2578), and this issue only happens in M1, so they think it has to do with Apple tensorflow-metal. I've been getting spurious errors while doing model.fit with the Lookahead optimizer (I'm doing fine-tuning with big datasets, and my code just breaks while fitting to different files, and in a not-reproducible way, i.e. each time I run it it breaks on a different file, and on different operations). I can see that these errors are undoubtedly related to the Lookahead optimizer. Let me try to explain this new info in a clear manner. I've tried with 2 different versions of tf+tfaddons (conda environments), but I got the same type of errors, probably more frequent with the pylast conda environment: pylast:tensorflow-macos 2.9.0, tensorflow-metal 0.5.0, tensorflow-addons 0.17.0 py39deps26-source: tensorflow-macos 2.6.0, tensorflow-metal 0.2.0, tensorflow-addons 0.15.0.dev0 The base code is always the same, I use tf.config.set_soft_device_placement(True) and also with tf.device('/cpu:0'): in every call to tensorflow, otherwise I get errors. As explained before, in my code, I just load a model, and fine-tune it to each file of a dataset. Here are a pair of example error outputs (obtained with the pylast conda environment): File "/Users/machine/Projects/finetune-asp/src/finetune_IMR2020.py", line 138, in finetune_dataset_db history = model.fit(ft, steps_per_epoch=len(ft), epochs=ft_cfg["num_epochs"], shuffle=True, File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error: Detected at node 'Lookahead/Lookahead/update_64/mul_11' defined at (most recent call last): File "/Users/machine/Projects/finetune-asp/src/finetune_IMR2020.py", line 138, in finetune_dataset_db history = model.fit(ft, steps_per_epoch=len(ft), epochs=ft_cfg["num_epochs"], shuffle=True, File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(*args, **kwargs) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1409, in fit tmp_logs = self.train_function(iterator) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1051, in train_function return step_function(self, iterator) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1040, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 1030, in run_step outputs = model.train_step(data) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/engine/training.py", line 893, in train_step self.optimizer.minimize(loss, self.trainable_variables, tape=tape) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 539, in minimize return self.apply_gradients(grads_and_vars, name=name) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/lookahead.py", line 104, in apply_gradients return super().apply_gradients(grads_and_vars, name, **kwargs) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 678, in apply_gradients return tf.__internal__.distribute.interim.maybe_merge_call( File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 723, in _distributed_apply update_op = distribution.extended.update( File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 706, in apply_grad_to_update_var update_op = self._resource_apply_dense(grad, var, **apply_kwargs) File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/lookahead.py", line 130, in _resource_apply_dense train_op = self._optimizer._resource_apply_dense( File "/Users/machine/miniforge3/envs/pylast/lib/python3.9/site-packages/tensorflow_addons/optimizers/rectified_adam.py", line 249, in _resource_apply_dense coef["r_t"] * m_corr_t / (v_corr_t + coef["epsilon_t"]), Node: 'Lookahead/Lookahead/update_64/mul_11' Incompatible shapes: [0] vs. [5,40,20] [[{{node Lookahead/Lookahead/update_64/mul_11}}]] [Op:__inference_train_function_30821] and Another error output
Posted
by mrt77.
Last updated
.
Post not yet marked as solved
1 Replies
514 Views
I have a simple TCN model that I've been using for a while. Since my change to Apple M1, I am unable to run it. My issue seems very similar to this and this, and I've also reported on FB9722799. On one of these threads Apple recognized that "we are aware of this issue and already working on a fix.". But this was 3 months ago, which is too much time without being able to fully use the computer for development! If I uninstall tensorflow-metal, I can run the code (of course, only in CPU). If I install tensorflow-metal, I get the following error: Cannot assign a device for operation model/conv_1_convolution/Conv2D/ReadVariableOp: Could not satisfy explicit device specification '' because the node {{colocation_node model/conv_1_convolution/Conv2D/ReadVariableOp}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0]. Colocation Debug Info: Colocation group had the following types and supported devices: Root Member(assigned_device_name_index_=2 requested_device_name_='/job:localhost/replica:0/task:0/device:GPU:0' ... [[{{node model/conv_1_convolution/Conv2D/ReadVariableOp}}]] [Op:__inference_train_function_15035]
Posted
by mrt77.
Last updated
.
Post not yet marked as solved
4 Replies
1.3k Views
My computer almost stalls whenever I try to use a Bidirectional layer. I'm using Macos M1 with tensorflow-macos 2.5 tensorflow-metal 0.1.2, tensorflow-deps 2.5.0. Bellow I show 2 short snippets of demo code: one working (without Bidirectional), one not-working (with Bidirectional). import tensorflow as tf from tensorflow.keras.datasets import imdb from tensorflow.keras.layers import Embedding, Dense, LSTM from tensorflow.keras.losses import BinaryCrossentropy from tensorflow.keras.models import Sequential from tensorflow.keras.optimizers import Adam from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.layers import SimpleRNN, Bidirectional, Masking import tensorflow_addons as tfa additional_metrics = ['accuracy'] batch_size = 128 embedding_output_dims = 15 loss_function = BinaryCrossentropy() max_sequence_length = 300 num_distinct_words = 5000 number_of_epochs = 5 optimizer = Adam() optimizer = tfa.optimizers.RectifiedAdam(learning_rate=0.01, clipnorm=0.5) validation_split = 0.20 verbosity_mode = 1 def working_demo_LSTM(): # Load dataset (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words) print(x_train.shape) print(x_test.shape) # Pad all sequences padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length, value=0.0) # 0.0 because it corresponds with <PAD> padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length, value=0.0) # 0.0 because it corresponds with <PAD> # Define the Keras model model = Sequential() model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length)) model.add(LSTM(10)) model.add(Dense(1, activation='sigmoid')) # Compile the model model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics) # Give a summary model.summary() history = model.fit(padded_inputs, y_train, batch_size=batch_size, epochs=number_of_epochs, verbose=verbosity_mode, validation_split=validation_split) # Test the model after training test_results = model.evaluate(padded_inputs_test, y_test, verbose=False) print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%') return True def nonworking_demo(): # Load dataset (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words) print(x_train.shape) print(x_test.shape) # Pad all sequences padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length, value=0.0) # 0.0 because it corresponds with <PAD> padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length, value=0.0) # 0.0 because it corresponds with <PAD> # Define the Keras model model = Sequential() model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length)) model.add(Bidirectional(SimpleRNN(units=10, return_sequences=True))) model.add(Dense(1, activation='sigmoid')) # Compile the model model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics) # Give a summary # model.summary() history = model.fit(padded_inputs, y_train, batch_size=batch_size, epochs=number_of_epochs, verbose=verbosity_mode, validation_split=validation_split) # Test the model after training test_results = model.evaluate(padded_inputs_test, y_test, verbose=False) print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%') return True def main(): # working_demo_LSTM() nonworking_demo_BLSTM() if __name__ == "__main__": main() I'm getting the following warnings and the computer stalls whenever I run nonworking_demo_BLSTM() with with tf.device('/cpu:0'): I get 7secs per epoch. If I don't explicitly select CPU, I get a ETA of 05:44:30 just for the 1st epoch! Are these values normal?
Posted
by mrt77.
Last updated
.
Post marked as solved
2 Replies
5.5k Views
Dear all, I'm unable to install tensorflow-macos, after updating to macOS Monterey(12.0 Beta). According to the last instructions from tensorflow/apple (https://developer.apple.com/metal/tensorflow-plugin/), I'm using miniforge conda, create a blank environment and then do conda install -c apple tensorflow-deps, which runs without any error or warning. Then when I try to do the following, everything breaks. python -m pip install tensorflow-macos Tried with python3.8 with the following error (summary, not the full logs): distutils.errors.CompileError: command 'gcc' failed with exit status 1 ---------------------------------------- ERROR: Failed building wheel for grpcio Tried with python3.9 with the following error (summary, not the full logs): distutils.errors.CompileError: command '/usr/bin/clang' failed with exit code 1 ---------------------------------------- ERROR: Failed building wheel for grpcio Tried with force reinstall and no-cache-dir (python -m pip install tensorflow-macos --no-cache-dir --force-reinstall) with the following error : ERROR: Command errored out with exit status 1: /Users/machine/miniforge3/envs/tf38/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-install-djre1j5j/numpy_48546adcbc9d4c558a4dc32a8e607649/setup.py'"'"'; __file__='"'"'/private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-install-djre1j5j/numpy_48546adcbc9d4c558a4dc32a8e607649/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-record-343ln54c/install-record.txt --single-version-externally-managed --prefix /private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-build-env-1fyu7c9t/normal --compile --install-headers /private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-build-env-1fyu7c9t/normal/include/python3.8/numpy Check the logs for full command output. ---------------------------------------- WARNING: Discarding https://files.pythonhosted.org/packages/a7/81/20d5d994c91ed8347efda90d32c396ea28254fd8eb9e071e28ee5700ffd5/h5py-3.1.0.tar.gz#sha256=1e2516f190652beedcb8c7acfa1c6fa92d99b42331cbef5e5c7ec2d65b0fc3c2 (from https://pypi.org/simple/h5py/) (requires-python:>=3.6). Command errored out with exit status 1: /Users/machine/miniforge3/envs/tf38/bin/python /private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-standalone-pip-nmsgrvml/__env_pip__.zip/pip install --ignore-installed --no-user --prefix /private/var/folders/0k/hz9yngm56nz1htdc3c3t3d0c0000gn/T/pip-build-env-1fyu7c9t/normal --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'numpy==1.12; python_version == "3.6"' 'Cython>=0.29; python_version < "3.8"' 'numpy==1.14.5; python_version == "3.7"' 'numpy==1.19.3; python_version >= "3.9"' 'numpy==1.17.5; python_version == "3.8"' pkgconfig 'Cython>=0.29.14; python_version >= "3.8"' Check the logs for full command output. ERROR: Could not find a version that satisfies the requirement h5py~=3.1.0 (from tensorflow-macos) (from versions: 2.2.1, 2.3.0b1, 2.3.0, 2.3.1, 2.4.0b1, 2.4.0, 2.5.0, 2.6.0, 2.7.0rc2, 2.7.0, 2.7.1, 2.8.0rc1, 2.8.0, 2.9.0rc1, 2.9.0, 2.10.0, 3.0.0rc1, 3.0.0, 3.1.0, 3.2.0, 3.2.1, 3.3.0, 3.4.0) ERROR: No matching distribution found for h5py~=3.1.0 Could anyone point me out any solution, I'm really desperate here, as my work is completely stuck because of this. Thanks in advance.
Posted
by mrt77.
Last updated
.