I saw an improvement on my MacBook on 12.1 (compared to 12.0.1). Previously I saw over 12GB memory for an idle python runtime. Your example shows that it is not really solved.
Post
Replies
Boosts
Views
Activity
A comment to add regarding the mentioned problem:
I changed my input pipeline from the numpy array to tf.dataset. GPU power, even at small batch sizes increased to above 20W, averaging at about 21.5W on my model. Prefetching and Caching did not bring any additional benefits. The memory bandwidth also increased by 50% from averaging at 40GB/s to 60GB/s. Batch-size then neither changed GPU power nor training time.
I also did training with the CPU. Now, with the tf.dataset it is also able to use the entire CPU power at small batch sizes (previously, it barely used more than 2 cores). However, even at small batch sizes, I now get 5x acceleration with the GPU.
Conclusion: In order to fully use the Apple Silicon chips for deep learning, the tf.dataset API is absolutely necessary, but then 5x acceleration can be achieved even on relatively small (convolutional) models. Numpy pipelines appear to be a major bottleneck. I think Machine learning users would be grateful if Apple would provide a more comprehensive documentation that highlights such issues.
However, there are models (the simple DNN model I posted above) that still run almost 100% on the CPU when specified to run on the GPU. Is that supposed to happen? In this case I would expect it to run faster on the CPU, however there might be special situations where one wants to override this.
Also, from the maximum power I've measured with various benchmarks, I believe these 21-22W power consumption on the GPU sustained over a longer period is probably 85% of the maximum power the GPU is capable of. Is there a way to check a "real" GPU usage in percent? As putting the GPU to use requires some tweaks, I would be happy if Apple could provide some numbers to check if one is fully utilising the GPU.
Also, if the issue I reported is truly due to Numpy as input pipeline, it would be very helpful if Apple could publish some methods and best practices on how to build an efficient input pipeline, specifically on the M1 (Pro/Max) chips.
Thanks for your reply! Different models show these observation to different degrees. At first the model regarding the slower execution when using the CPU:
normalizer = Normalization(input_shape=[1024,], axis=None)
normalizer.adapt(X_train)
def test_model():
model = keras.Sequential([
normalizer,
Dense(64, activation='relu'),
Dense(8, activation='relu'),
Dense(1)
])
model.compile(loss='mae',
optimizer=tf.keras.optimizers.Adam(0.001),metrics='mae')
return model
test=test_model()
with tf.device('cpu:0'):
History_test=test.fit(X_train,Y_train,batch_size=9,validation_data=(X_test,Y_test),epochs=200)
This training takes 45s in a tensorflow-metal environment, and the Activity monitor shows 70% GPU usage, even though its told not to use the GPU. Running this code in a standard tensorflow environment takes 9 seconds.
Regarding the GPU power drop: I have restarted my Mac today and the issue seems much less pronounced. Previously I have literally seen a 50% drop, now I see something in the range of about 20%, which makes it far less of a problem. Nonetheless, here is the code and power figures:
input1 = Input((1024,),name='input1')
input2 = Input((1024,),name='input2')
input_shape=(16,217,1)
input3 = Input((input_shape),name='input3')
norm1=layers.LayerNormalization()(input1)
norm2=layers.LayerNormalization()(input2)
dense21= Dense(8,activation='relu')(norm1)
dense22= Dense(8,activation='relu')(norm2)
dense31=Dense(32,activation='relu')(dense21)
dense32=Dense(32,activation='relu')(dense22)
conv_1=Conv2D(64, (3,3), activation='relu', padding="same",input_shape=input_shape)(input3)
maxp_1 = MaxPooling2D(pool_size = (2,2)) (conv_1)
conv_2=Conv2D(128, (3,3), activation='relu', padding="same")(maxp_1)
maxp_2 = MaxPooling2D(pool_size = (2,2)) (conv_2)
conv_2=Conv2D(128, (3,3), activation='relu', padding="same")(maxp_2)
flatten=Flatten()(conv_2)
densePL_1= Dense(128, activation='relu')(flatten)
output= Dense(1, activation='relu')(densePL_1)
concat = layers.concatenate([dense31,dense32,densePL_1])
output_2= Dense(1,activation="relu",name='pred')(concat)
model_concat_test = Model(inputs=[input1,input2,input3], outputs=[output_2])
model_concat_test.compile(loss=["mae"], optimizer="adam",metrics=["mae"])
Historytest=model_concat_test.fit({"input3":X3_train,"input1":X1_train,"input2":X1_train}, Y_train,batch_size=9,validation_data=({"input3":X3_test,"input1":X1_test,"input2":X2_test},Y_test),epochs=500)
Batchsize - GPU Power:
7 - 7W,
8 - 8W,
9 - 11W,
10 - 8W,
11 - 10W,
12 - 10W
You do not specify the batch size in your fitting arguments. Try setting this to a high number.