Tensorflow-metal giving garbage results with GAN

I recently wrote some code for a basic GAN (I am learning about neural networks, so not an expert), and got very strange results. Unable to debug, I tested someone else's code that I know works, and still got the same results. When running a GAN to generate digits (from the MNIST dataset) the images produced each epoch are identical to each other, and don't resemble digits at all. An example of the images produced can be seen below.

Rerunning the same code on Google Colab, and on my machine locally (with standard tensorflow, i.e. without the metal plugin) gives expected results of images resembling digits.

The code is used to test this can be found here: https://github.com/PacktPublishing/Deep-Learning-with-TensorFlow-2-and-Keras/blob/master/Chapter%206/VanillaGAN.ipynb

I am using these versions of relevant software: tensorflow-metal 0.5.0; tensorflow-macos 2.9.2; macOS Monterey 12.3;

I would be grateful if Apple engineers could advise, or give a timeframe for a solution please.

Answered by Frameworks Engineer in 717955022

Hi @90jtip

Thanks for reporting this issue and providing the script to reproduce it. This does look like something is going wrong with the GPU implementation on this network. Based on a quick look the GAN is using fairly simple layers so I'm hoping the debugging process won't get too cumbersome. I can't make any promises on the timeline of the resolutions here but in the meanwhile you can use the block with tf.device('CPU: 0'): to enclose parts of your code or the whole script to limit it to run on the CPU. This issue is likely related to something in the GPU implementation so this could get the correctness issue sorted out at the cost of longer runtime due to not being able to use the GPU but hopefully it'll be enough while learning and running through the examples in the book.

Accepted Answer

Hi @90jtip

Thanks for reporting this issue and providing the script to reproduce it. This does look like something is going wrong with the GPU implementation on this network. Based on a quick look the GAN is using fairly simple layers so I'm hoping the debugging process won't get too cumbersome. I can't make any promises on the timeline of the resolutions here but in the meanwhile you can use the block with tf.device('CPU: 0'): to enclose parts of your code or the whole script to limit it to run on the CPU. This issue is likely related to something in the GPU implementation so this could get the correctness issue sorted out at the cost of longer runtime due to not being able to use the GPU but hopefully it'll be enough while learning and running through the examples in the book.

Many thanks for your reply.

Currently I am using standard tensorflow (without the metal plugin & running on CPU) without any issues. I haven't tried your workaround, but I think you're right that it's the GPU calculations going wrong, so it makes sense using tf.device('CPU: 0'): would work too. So at least we have two workarounds for anyone else who encounters this issue.

Thank you for your help!

Tensorflow-metal giving garbage results with GAN
 
 
Q