TensorFlow with Metal start giving wrong results after upgrading macOS from 12.0.1 to 12.1

After installing tensorflow-metal PluggableDevice according to Getting Started with tensorflow-metal PluggableDevice I have tested this DCGAN example: https://www.tensorflow.org/tutorials/generative/dcgan. Everything was working perfectly until I decided tu upgrade macOS from 12.0.1 to 12.1. Before the final result after 50 epoch was like on picture1 below

, after upgrade is like on picture2 below

.

I am using:

  • TensrofFlow 2.7.0
  • tensorflow-metal-0.3.0
  • python3.9

I hope this question will also help Apple to improve Metal PluggableDevice. I can't wait to use it in my research.

I upgraded to 12.1 today. I just launched a DCGAN, I'll let you know. BUT, I have other model in training (an autoencoder) and haven't noticed any difference since yesterday.

I'm still on Epoch 5, on a MacBook Air M1 2020, but it look fine too me. so far. My other trainings run just fine too. look like you just got bad luck on this run ? What about the other intermediary result ? do they all look bad ?

edit : I also have some very bad result sometimes, weird. is there a problem with random generation ? i have a model that heavily use random.uniform, I'll check.

EDIT again : I need to double check but random is broken in some situation

wrote a minimal use case, this used to generate 2 different series :

import tensorflow as tf

x = tf.random.uniform((10,))
y = tf.random.uniform((10,))

tf.print(x)
tf.print(y)
[0.178906798 0.8810848 0.384304762 ... 0.162458301 0.64780426 0.0123682022]
[0.178906798 0.8810848 0.384304762 ... 0.162458301 0.64780426 0.0123682022]

works fine on collab :

It also works fine if I disable GPU with :

tf.config.set_visible_devices([], 'GPU')

WORKAROUND :

g = tf.random.Generator.from_non_deterministic_state()
x = g.uniform((10,))
y = g.uniform((10,))
tf.print(x)
tf.print(y)

I have the same problem with TensorFlow-metal-0.3.0 and python3.9 running DCGAN. The solutions converge to an almost identical picture that does not resemble a digit. I have tried several times with up to 100 Epochs. It never worked correctly. I have  MacBook Pro M1 2020 with the system version 12.1. The problem seems to be specific to version 12.1 of the operating system

This issue has been addressed and fixed in tensorflow-metal==0.5.0.

For me it still occurs on Monterey 12.3.1 with newest versions:

tensorflow-metal==0.5.0
tensorflow-macos=2.9.2

For example this code will still always print the same values:

import tensorflow as tf

class CustomLayer(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
    
    def call(self, x, training):
        a = tf.random.uniform([])
        tf.print(a)
        return x
    
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test, y_train, y_test = x_train[:10], x_test[:10], y_train[:10], y_test[:10]
model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)),
                                    tf.keras.layers.Dense(128, activation='relu'),
                                    CustomLayer(),
                                    tf.keras.layers.Dropout(0.2),
                                    tf.keras.layers.Dense(10) ])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=1, batch_size=1) 

I am new to deep learning, so at first I thought it must be an error with my code. After testing with a code example I know to work, I discovered I have an issue very similar to this. The produced images are nonsensical - often some noise centred around the middle of the image. I am using the latest version of TensorFlow(2.9.2) and the metal plugin (0.5.0), and macOS Monterey (12.3).

The code I used is here https://github.com/PacktPublishing/Deep-Learning-with-TensorFlow-2-and-Keras/blob/master/Chapter%206/VanillaGAN.ipynb Below are images of the result using the Metal plugin at epochs 1 & 5 (larger epochs have also been tested)

Running this code in an environment using standard TensorFlow (without the macOS/metal plugin) does not produce the same error. It also works on Google Colab fine.

TensorFlow with Metal start giving wrong results after upgrading macOS from 12.0.1 to 12.1
 
 
Q