I was training the CRNN model described here (https://keras.io/examples/vision/handwriting_recognition/) in tensorflow 2.8 with and without tensorflow metal version 0.4. This model has 424,081 trainable parameters. Even when varying the batch size, GPU is always much slower than CPU, as shown in below graph. Surprisingly, training gets even slower on GPU for larger batch sizes.
Please let me know, how I can make GPU training much faster than CPU.
System: M1 Max 64GB, macOS 12.2.1.
P.s. since there were differences in the loss trajectory between CPU and metal in metal versions prior to 0.4, I am happy to report, that this has been resolved. See below graph