I can report same behavior here. while a lot of small vanilla models train (with Input, Dense, Drop layers only), once you add sth more advanced like RNNs the learning ability of the model disappears on Metal GPU. I recorded the same same behavior for both buildings-in and custom loss functions.
once I added:
tf.config.set_visible_devices([], 'GPU') (others suggested uninstalling TensorFlow-metal all together)
I got decreasing loss and increase in prediction as expected.
Is this issue somehow investigated on the Apple side? I am considering returning my M1 Max back at this moment...
Post
Replies
Boosts
Views
Activity
Additional Details.
I narrowed the conditions for the issue to the need of using weights in tf.data.dataset creation":
train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train, weights))
Once the weights are removed (train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train))) OR!!! we use non-custom loss function the learning proceeds.
Does the Dev Team has any insights into it?
Further Detail:
the issue is only present when running it on GPU (M1 Max in my case). Once, I block GPU (with tf.config.set_visible_devices([], 'GPU') ) all works as expected!
this is definitively an implementation issue. Could Apple comment on it please?
thx,
Mac OS 12.1
tf 2.7
metal 0.3
will provide the example syntax on Saturday
The whole point is that the squeeze seems not to be performed when:
using built-in loss functions or
not ruining on M1 GPU
or weights are not used in the tf.dataset definition.
maybe broadcasting is broken in such scenario or weights are applied in not the appropriate moment
also, it might be that the custom loss function definition does not assume existence of the weight while the built in does. but if so than why:
it works with same custom functions just on CPU or on not M1 Macs and windowes
same error happens when I define the loss function as class inheriting from loss
Also added to an already reported case where TF does not train M1 GPU but does on its CPU with no changes to the code. Maybe those are related.
I pinpointed the issue further. The core of the issue is the dimensionality of the weight vector provided to the dataset. In my Non-M1 implementations it was (none,) and it was working well. on M1, I need to change it to (None,1). that said:
it is only required when we we work with 3-dimensional data (possibly just output (i.e. input can be of any dimension) -- but I did not test that), (possibly that dimensionality must be increased further as the dimensionality of our data increase -- not tested)
we use either custom loss function or we wrap the built-in one in a custom loss wrapper (using class and def() has the same effect)
the odd behavior is that my initial explorations as well as my research syntax works on well on M1 CPU without any modification. the syntax below fails with the above conditions both on M1 CPU and GPU. I have not investigated it further.
I also worked with TF 2.8 and experienced the same behavior.
Thx for looking into that. the expected solution is either alignment of behavior across environments or further investigation of the required structure of the weight vector and update in documentation.
Here is the syntax:
TEST CONDITIONS:
breaking condition: 1,1,3,1,1 and 1,1,3,1,2
dataset_weight = 1 # 0 No, 1 Yes
dw_type = 1 # 1 unidimensional, 2 dimensional
data_shape = 3 # 2 two dimensional # 3 dimensional
gpu = 1 # 0 No, 1 Yes
loss = 1 # 0 No, 1 Yes, 2 pseudo custom loss
import numpy as np
import pandas as pd
import sys
"""
if 'tensorflow' in sys.modules:
print("tensorflow uploaded")
del sys.modules["tensorflow"]
del tf
import tensorflow as tf
else:
print("tensorflow not uploaded")
import tensorflow as tf
if gpu == 1:
pass
else:
tf.config.set_visible_devices([], 'GPU')
#print("GPUs:", tf.config.list_physical_devices('GPU'))
print("GPUs:", tf.config.list_logical_devices('GPU'))
#print("CPUs:", tf.config.list_physical_devices('CPU'))
print("CPUs:", tf.config.list_logical_devices('CPU'))
"""
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import backend as K
import tensorflow as tf
print("TensorFlow version:", tf.version)
batch = 128
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Displacement', 'Horsepower', 'Weight']
dataset = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True).dropna()
if data_shape == 2:
x_train = np.array(dataset[['Horsepower', 'Weight']]).reshape(-1,2)
y_train = np.array(dataset[['MPG','Displacement']]).reshape(-1,2)
else:
x_train = np.array(dataset[['Horsepower', 'Weight']]).reshape(-1,2,2)
y_train = np.array(dataset[['MPG','Displacement']]).reshape(-1,2,2)
if dw_type == 2:
weight = np.expand_dims(np.ones(x_train.shape[0]), axis = 1)
else:
weight = np.ones(x_train.shape[0])
#print(dataset)
print(x_train.shape)
print(y_train.shape)
print(weight.shape)
if dataset_weight == 0:
train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train)).cache().shuffle(x_train.shape[0]).batch(batch).repeat().prefetch(tf.data.experimental.AUTOTUNE)
else:
train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train, weight)).cache().shuffle(x_train.shape[0]).batch(batch).repeat().prefetch(tf.data.experimental.AUTOTUNE)
model = Sequential([
Dense(64, activation='relu'),
Dense(32, activation='relu'),
Dense(2)])
loss_tf = tf.keras.losses.MeanSquaredError()
def custom_loss(y_true, y_pred):
error = y_true-y_pred
sqr_error = K.square(error)
mean_sqr_error = K.mean(sqr_error)
sqrt_mean_sqr_error = K.sqrt(mean_sqr_error)
return sqrt_mean_sqr_error
def pseudo_custom_loss(y_true, y_pred):
return loss_tf(y_true, y_pred)
if loss == 0:
model.compile(optimizer='adam', loss=loss_tf, run_eagerly=True)
elif loss == 1:
model.compile(optimizer='adam', loss=custom_loss, run_eagerly=True)
else:
model.compile(optimizer='adam', loss=pseudo_custom_loss, run_eagerly=True)
model.fit(train_data, epochs=2, steps_per_epoch = 3)
print(model.summary())
updating to TM 0.4.0 fixes the issue (even without the update to TF2.8)!!! thx!