GPU much slower than CPU for LSTMs and bidirectional in TensorFlow 2.8

I am trying to run the notebook https://www.tensorflow.org/text/tutorials/text_classification_rnn from the TensorFlow website.

The code has LSTM and Bidirectional layers

When the GPU is enabled the time is 56 minutes/epoch.

When I am only using the CPU is 264 seconds/epoch.

I am using MacBook Pro 14 (10 CPU cores, 16 GPU cores) and TensorFlow-macos 2.8 with TensorFlow-metal 0.5.0.  I face the same problem for TensorFlow-macos 2.9 too.

My environment has:

tensorflow-macos          2.8.0  

tensorflow-metal          0.5.0 

tensorflow-text           2.8.1  

tensorflow-datasets       4.6.0                   

tensorflow-deps           2.8.0                         

tensorflow-hub            0.12.0                      

tensorflow-metadata       1.8.0                    

                   

When I am using CNNs the GPU is fully enabled and 3-4 times faster than when only using the CPU. 

Any idea where is the problem when using LSTMs and Bidirectional?

Post a bug report on the tensorflow github project

Same issue here: link

Hi @vasileiosgk

Thanks for reporting the problem and providing a script to produce it. I'll take a look at the issue.

To me initially this seems like the Bidirectional LSTM kernel ends up falling back to the Python level un-fused implementation for the operation which is unfortunately intolerably slow when called with the pluggable device (GPU) at the moment. However the kernel here should be taking the faster implementation since it satisfies the "cuDNN conditions" described on the Tensorflow documentation page for the op which allows us to use to fused implementation. So this looks like a bug on the tensorflow-metal side. I'll update here once I've confirmed this to be the case.

Hello!

do you have any update regarding this issue? It looks like LSTMs and sequential networks in general are very slow and do not give correct results when using the GPU, as other users have reported here in the forum for various versions of tf-metal.

It would be great if this bug in tf-metal is addressed.

Thank you!

I tried running the notebook using the new tensorflow-metal version 0.5.1

The problem when training using the GPU still persists. Things are fine when using the CPU only.

I tried the following combinations:

  • tensorflow-macos 2.8.0, tensorflow-metal 0.5.1, python 3.8.13
  • tensorflow-macos 2.9.0, tensorflow-metal 0.5.1, python 3.9.13
  • tensorflow-macos 2.9.2, tensorflow-metal 0.5.1, python 3.8.13
  • tensorflow-macos 2.9.2, tensorflow-metal 0.5.1, python 3.9.13

Can you confirm you face the same issue?

Can you please fix the bug in tensorflow-metal, since sequential networks are really important and it is really beneficial to accelerate the training using the GPU?

I tried running the notebook using tensorflow-macos 2.10.0,  tensorflow-metal 0.6.0 and python 3.9.13 and the problem with the GPU still persists even with the new tensorflow-metal version.

Things are fine when using the CPU only.

The code I am running https://www.tensorflow.org/text/tutorials/text_classification_rnn is from the official TensorFlow website.

Can you confirm you face the same issue and please fix the bug in tensorflow-metal? Sequential networks using LSTMs are really basic and important. 

Same issue here, unfortunately. Are there any news?

GPU much slower than CPU for LSTMs and bidirectional in TensorFlow 2.8
 
 
Q