AdamW crashes on tf-macos (tf-nightly-macos 2.14.0-dev20230523)

I initially raised this issue in the tensorflow forum, and they directed me back here since this is a tf-macos specific problem [see https://github.com/tensorflow/tensorflow/issues/60673].

When calling Model.compile() with the AdamW optimizer, a warning is thrown saying that v2.11+ optimizers have a known slowdown on M1/M2 devices, and so the backend attempts to fallback to a legacy version. However, no legacy version of the AdamW optimizer exists. In a previous tf-macos version 2.12, this lead to an error during Model.compile() [see issue https://github.com//issues/60652 and https://developer.apple.com/forums/thread/729732]. In the current nightly, this error is not thrown - however, after calling model.compile(), the attribute model.optimizer is set to string 'adamw' instead of an optimizer object.

Later, when we call model.fit(), this leads to an AttributeError, because model.optimizer.minimize() does not exist when model.optimizer is a string.

Expected behaviour: correctly compile the model with either a v2.11+ optimiser without slowdown, or a legacy-compatible implementation of the AdamW optimizer. Then the model will train correctly with a valid AdamW optimizer when calling model.fit().

Note: a warning message suggests using the optimizer located at tf.keras.optimizers.legacy.AdamW, but this does not exist

It would be nice to be able to either use modern optimizers, or have a legacy-compatible version of AdamW, since weight-decay is an important tool in modern ML research, and currently cannot be used on mac.

Standalone code to reproduce the issue

##===========##
##  Imports  ##
##===========##

import sys

import tensorflow as tf

import numpy as np

from tensorflow.keras.models     import Model
from tensorflow.keras.layers     import Input, Dense
from tensorflow.keras.optimizers import AdamW

##===================##
##  Report versions  ##
##===================##
#
# Expected outputs:
# Python version is: 3.10.11 | packaged by conda-forge | (main, May 10 2023, 19:01:19) [Clang 14.0.6 ]
# TF version is: 2.14.0-dev20230523
# Numpy version is: 1.23.2
#

print(f"Python version is: {sys.version}")
print(f"TF version is: {tf.__version__}")
print(f"Numpy version is: {np.__version__}")

##==============================##
##  Create a very simple model  ##
##==============================##
#
# Expected outputs:
# Model: "model_1"
# _________________________________________________________________
#  Layer (type)                Output Shape              Param #   
# =================================================================
#  Layer_in (InputLayer)       [(None, 2)]               0         
#                                                                 
#  Layer_hidden (Dense)        (None, 10)                30        
#                                                                 
#  Layer_out (Dense)           (None, 2)                 22        
#                                                                 
# =================================================================
# Total params: 52 (208.00 Byte)
# Trainable params: 52 (208.00 Byte)
# Non-trainable params: 0 (0.00 Byte)
# _________________________________________________________________
#

x_in  = Input(2 , dtype=tf.float32, name="Layer_in"    )
x     = x_in
x     = Dense(10, dtype=tf.float32, name="Layer_hidden", activation="relu"  )(x)
x     = Dense(2 , dtype=tf.float32, name="Layer_out"   , activation="linear")(x)
model = Model(x_in, x)
model.summary()

##===================================================##
##  Compile model with MSE loss and AdamW optimizer  ##
##===================================================##
#
# Expected outputs:
# WARNING:absl:At this time, the v2.11+ optimizer `tf.keras.optimizers.AdamW` runs slowly on M1/M2 Macs, please use the legacy Keras optimizer instead, located at `tf.keras.optimizers.legacy.AdamW`.
# WARNING:absl:There is a known slowdown when using v2.11+ Keras optimizers on M1/M2 Macs. Falling back to the legacy Keras optimizer, i.e., `tf.keras.optimizers.legacy.AdamW`.
#

model.compile(
    loss      = "mse", 
    optimizer = AdamW(learning_rate=1e-3, weight_decay=1e-2)
)

##===========================##
##  Generate some fake data  ##
##===========================##
#
# Expected outputs:
# X shape is (100, 2), Y shape is (100, 2)
#

dataset_size = 100
X = np.random.normal(size=(dataset_size, 2))
X = tf.constant(X, dtype=tf.float32)
Y = np.random.normal(size=(dataset_size, 2))
Y = tf.constant(Y, dtype=tf.float32)

print(f"X shape is {X.shape}, Y shape is {Y.shape}")

##===================================##
##  Fit model to data for one epoch  ##
##===================================##
#
# Expected outputs:
# ---------------------------------------------------------------------------
# AttributeError                            Traceback (most recent call last)
# Cell In[9], line 51
#       1 ##===================================##
#       2 ##  Fit model to data for one epoch  ##
#       3 ##===================================##
#    (...)
#      48 #       • mask=None
#      49 #
# ---> 51 model.fit(X, Y, epochs=1)

# File ~/miniforge3/envs/tf_macos_nightly_230523/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
#      67     filtered_tb = _process_traceback_frames(e.__traceback__)
#      68     # To get the full stack trace, call:
#      69     # `tf.debugging.disable_traceback_filtering()`
# ---> 70     raise e.with_traceback(filtered_tb) from None
#      71 finally:
#      72     del filtered_tb

# File /var/folders/6_/gprzxt797d5098h8dtk22nch0000gn/T/__autograph_generated_filezzqv9k36.py:15, in outer_factory.<locals>.inner_factory.<locals>.tf__train_function(iterator)
#      13 try:
#      14     do_return = True
# ---> 15     retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
#      16 except:
#      17     do_return = False

# AttributeError: in user code:

#     File "/Users/Ste/miniforge3/envs/tf_macos_nightly_230523/lib/python3.10/site-packages/keras/src/engine/training.py", line 1338, in train_function  *
#         return step_function(self, iterator)
#     File "/Users/Ste/miniforge3/envs/tf_macos_nightly_230523/lib/python3.10/site-packages/keras/src/engine/training.py", line 1322, in step_function  **
#         outputs = model.distribute_strategy.run(run_step, args=(data,))
#     File "/Users/Ste/miniforge3/envs/tf_macos_nightly_230523/lib/python3.10/site-packages/keras/src/engine/training.py", line 1303, in run_step  **
#         outputs = model.train_step(data)
#     File "/Users/Ste/miniforge3/envs/tf_macos_nightly_230523/lib/python3.10/site-packages/keras/src/engine/training.py", line 1084, in train_step
#         self.optimizer.minimize(loss, self.trainable_variables, tape=tape)

#     AttributeError: 'str' object has no attribute 'minimize'

model.fit(X, Y, epochs=1)

Hello! I have the same problem, but I figure out how to turn off the fallback behavior: Just delete is_arm_mac call in condition here https://github.com/keras-team/keras/blob/5849a0953a644bd6af51b672b32a235510d4f43d/keras/optimizers/init.py#LL301C9-L301C9

Of course, it is a temporary and clumsy solution, I'd like to contribute with a proper fix. Here is an issue in Keras repo: https://github.com/keras-team/keras/issues/18224

Hi, thanks for the suggestion - I was able to follow the same procedure in v2.12 and it is running, albeit with the known slowdown - thanks!

I had some trouble in v2.13 as it looks like they have restructured the code a little. I also tried to implement a legacy-style version of AdamW based on the Adam class that already exists, but unfortunately this also ground to a halt because there is a point in which the python class calls an ApplyAdam function that is compiled in C++ (I can modify C++ but don't fancy trying to modify and recompile tensorflow!)

I would also be happy to help contribute to a more durable solution - the tf forum directed me back here, let's see if your keras issue has any luck - I am not 100% sure who is in charge of tf/apple compatibility

I found a workaround to make AdamW work on Apple Silicon using the latest version of tensorflow, tensorflow-addons.

All you need to do is to import AdamW from tensorflow_addons.optimizers and you should be good.

I'm not using it with tensorflow-metal though, there's a huge impact in performance. (At least 4x slower) 👎

(btw, the msg "There's a known slowdown on M1/M2 devices.... falling back.... " won't appear anymore)

Hope this helps! 🙂

I second this. AdamW is so important these days as it correctly fixes weight decay. If quick fix is difficult, it'd be great if we don't fallback to legacy adamw. It's much better than just crashing.

AdamW crashes on tf-macos (tf-nightly-macos 2.14.0-dev20230523)
 
 
Q