Odd CPU/GPU behaviour in TF-metal on M1 Pro

Hi, I have started experimenting with using my MBP with M1 Pro (10CPU cores / 16 GPU cores) for Tensorflow.

Two things were odd/noteworthy:

I've compared training models in a tensorflow environment with tensorflow-metal, running the code with either

  • with tf.device('gpu:0'): or
  • with tf.device('cpu:0'):

as well as in an environment without the tensorflow-metal plugin. Specifiying the device as CPU in tf-metal almost always leads to a lot longer training times compared to specifying using the GPU, but also compared to running the standard (non-metal environment). Also, the GPU was running at quite high power despite of telling TF to use the CPU. Is this an intended or expected behaviour? As it will be preferable to use the non-metal environment when not benefitting from a GPU.

Secondly, at small batch sizes, the GPU power in system stats increases with the batch size, as expected. However, when chaning the batch size from 9 to 10 (this appears like a hard step specifically at this number), GPU power drops by about half, and training time doubles. Increasing batch size from about 10 leads again to a gradual increase in GPU power, on my model the same GPU power as batchsize=9 is reached only at about batchsize=50. Making GPU acceleration using batch-sizes from 10 to about 50 rather useless. I've noticed this behavior on several models, which makes me wonder that this is a general tf-metal behaviour. As a result, I've only been able to benefit from GPU acceleration at a batchsize of 9 and above 100. Once again, is this intended or to be expected?

Can you please share the training network where you are observing this?

Thanks for your reply! Different models show these observation to different degrees. At first the model regarding the slower execution when using the CPU:

normalizer = Normalization(input_shape=[1024,], axis=None)
normalizer.adapt(X_train)

def test_model():
  model = keras.Sequential([
      normalizer,
      Dense(64, activation='relu'),
      Dense(8, activation='relu'),
      Dense(1)
  ])

  model.compile(loss='mae',
                optimizer=tf.keras.optimizers.Adam(0.001),metrics='mae')
  return model

test=test_model()
with tf.device('cpu:0'):
    History_test=test.fit(X_train,Y_train,batch_size=9,validation_data=(X_test,Y_test),epochs=200)

This training takes 45s in a tensorflow-metal environment, and the Activity monitor shows 70% GPU usage, even though its told not to use the GPU. Running this code in a standard tensorflow environment takes 9 seconds.

Regarding the GPU power drop: I have restarted my Mac today and the issue seems much less pronounced. Previously I have literally seen a 50% drop, now I see something in the range of about 20%, which makes it far less of a problem. Nonetheless, here is the code and power figures:

input1 = Input((1024,),name='input1')
input2 = Input((1024,),name='input2')
input_shape=(16,217,1)
input3 = Input((input_shape),name='input3')

norm1=layers.LayerNormalization()(input1)
norm2=layers.LayerNormalization()(input2)
dense21= Dense(8,activation='relu')(norm1)
dense22= Dense(8,activation='relu')(norm2)
dense31=Dense(32,activation='relu')(dense21)
dense32=Dense(32,activation='relu')(dense22)

conv_1=Conv2D(64, (3,3), activation='relu', padding="same",input_shape=input_shape)(input3)
maxp_1 = MaxPooling2D(pool_size = (2,2)) (conv_1)
conv_2=Conv2D(128, (3,3), activation='relu', padding="same")(maxp_1)
maxp_2 = MaxPooling2D(pool_size = (2,2)) (conv_2)
conv_2=Conv2D(128, (3,3), activation='relu', padding="same")(maxp_2)
flatten=Flatten()(conv_2)
densePL_1= Dense(128, activation='relu')(flatten)
output= Dense(1, activation='relu')(densePL_1)

concat = layers.concatenate([dense31,dense32,densePL_1])
output_2= Dense(1,activation="relu",name='pred')(concat)
model_concat_test = Model(inputs=[input1,input2,input3], outputs=[output_2])
model_concat_test.compile(loss=["mae"], optimizer="adam",metrics=["mae"])

Historytest=model_concat_test.fit({"input3":X3_train,"input1":X1_train,"input2":X1_train}, Y_train,batch_size=9,validation_data=({"input3":X3_test,"input1":X1_test,"input2":X2_test},Y_test),epochs=500)

Batchsize - GPU Power: 7 - 7W, 8 - 8W, 9 - 11W, 10 - 8W, 11 - 10W, 12 - 10W

A comment to add regarding the mentioned problem:

I changed my input pipeline from the numpy array to tf.dataset. GPU power, even at small batch sizes increased to above 20W, averaging at about 21.5W on my model. Prefetching and Caching did not bring any additional benefits. The memory bandwidth also increased by 50% from averaging at 40GB/s to 60GB/s. Batch-size then neither changed GPU power nor training time.

I also did training with the CPU. Now, with the tf.dataset it is also able to use the entire CPU power at small batch sizes (previously, it barely used more than 2 cores). However, even at small batch sizes, I now get 5x acceleration with the GPU. 

Conclusion: In order to fully use the Apple Silicon chips for deep learning, the tf.dataset API is absolutely necessary, but then 5x acceleration can be achieved even on relatively small (convolutional) models. Numpy pipelines appear to be a major bottleneck. I think Machine learning users would be grateful if Apple would provide a more comprehensive documentation that highlights such issues.

However, there are models (the simple DNN model I posted above) that still run almost 100% on the CPU when specified to run on the GPU. Is that supposed to happen? In this case I would expect it to run faster on the CPU, however there might be special situations where one wants to override this.

Also, from the maximum power I've measured with various benchmarks, I believe these 21-22W power consumption on the GPU sustained over a longer period is probably 85% of the maximum power the GPU is capable of. Is there a way to check a "real" GPU usage in percent? As putting the GPU to use requires some tweaks, I would be happy if Apple could provide some numbers to check if one is fully utilising the GPU.

Also, if the issue I reported is truly due to Numpy as input pipeline, it would be very helpful if Apple could publish some methods and best practices on how to build an efficient input pipeline, specifically on the M1 (Pro/Max) chips.

We really lack documentation indeed. I had weird case were cpu was faster than gpu too. ^^ I only have the M1 (non pro/max)

To fully disable the CPU I use this :

tf.config.set_visible_devices([], 'GPU')

call it first before doing anything else.

You might also want to display what device is used for what operation :

tf.debugging.set_log_device_placement(True))

It's very verbose and the 1st step is usually mostly cpu (function tracing).

From my experience too : don't use float16 (not faster) and don't use mixed_precision (it fallback to CPU), at least on my M1.

Give a try to this option too :

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0],True)

You did not provide the whole script that could be run on other people's computers.

However, from looking at the partial script above, I can see that the model creation and compilation is done in the default context that probably is set to GPU. Then you train under CPU device. It seems that Tensorflow transfers variables between CPU and GPU all the time during training. At least this is my guess for it.

Besides the performance issues, there is also the problem that training results are still worse when using a GPU vs CPU only. I test it by installing or uninstalling tensorflow-metal. Any insights on that front? tensorflow-macos 2.12.2, tensorflow-metal 0.8.0

Odd CPU/GPU behaviour in TF-metal on M1 Pro
 
 
Q