sebtac’s Profile | Apple Developer Forums

Reply to M1 Max GPU fails to converge in more complex models

updating to TM 0.4.0 fixes the issue (even without the update to TF2.8)!!! thx!

Machine Learning & AI General

Feb ’22

Reply to TF-Metal Custom Loss Functions do not work

I pinpointed the issue further. The core of the issue is the dimensionality of the weight vector provided to the dataset. In my Non-M1 implementations it was (none,) and it was working well. on M1, I need to change it to (None,1). that said: it is only required when we we work with 3-dimensional data (possibly just output (i.e. input can be of any dimension) -- but I did not test that), (possibly that dimensionality must be increased further as the dimensionality of our data increase -- not tested) we use either custom loss function or we wrap the built-in one in a custom loss wrapper (using class and def() has the same effect) the odd behavior is that my initial explorations as well as my research syntax works on well on M1 CPU without any modification. the syntax below fails with the above conditions both on M1 CPU and GPU. I have not investigated it further. I also worked with TF 2.8 and experienced the same behavior. Thx for looking into that. the expected solution is either alignment of behavior across environments or further investigation of the required structure of the weight vector and update in documentation. Here is the syntax: TEST CONDITIONS: breaking condition: 1,1,3,1,1 and 1,1,3,1,2 dataset_weight = 1 # 0 No, 1 Yes dw_type = 1 # 1 unidimensional, 2 dimensional data_shape = 3 # 2 two dimensional # 3 dimensional gpu = 1 # 0 No, 1 Yes loss = 1 # 0 No, 1 Yes, 2 pseudo custom loss import numpy as np import pandas as pd import sys """ if 'tensorflow' in sys.modules: print("tensorflow uploaded") del sys.modules["tensorflow"] del tf import tensorflow as tf else: print("tensorflow not uploaded") import tensorflow as tf if gpu == 1: pass else: tf.config.set_visible_devices([], 'GPU') #print("GPUs:", tf.config.list_physical_devices('GPU')) print("GPUs:", tf.config.list_logical_devices('GPU')) #print("CPUs:", tf.config.list_physical_devices('CPU')) print("CPUs:", tf.config.list_logical_devices('CPU')) """ from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras import backend as K import tensorflow as tf print("TensorFlow version:", tf.version) batch = 128 url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data' column_names = ['MPG', 'Displacement', 'Horsepower', 'Weight'] dataset = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipinitialspace=True).dropna() if data_shape == 2: x_train = np.array(dataset[['Horsepower', 'Weight']]).reshape(-1,2) y_train = np.array(dataset[['MPG','Displacement']]).reshape(-1,2) else: x_train = np.array(dataset[['Horsepower', 'Weight']]).reshape(-1,2,2) y_train = np.array(dataset[['MPG','Displacement']]).reshape(-1,2,2) if dw_type == 2: weight = np.expand_dims(np.ones(x_train.shape[0]), axis = 1) else: weight = np.ones(x_train.shape[0]) #print(dataset) print(x_train.shape) print(y_train.shape) print(weight.shape) if dataset_weight == 0: train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train)).cache().shuffle(x_train.shape[0]).batch(batch).repeat().prefetch(tf.data.experimental.AUTOTUNE) else: train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train, weight)).cache().shuffle(x_train.shape[0]).batch(batch).repeat().prefetch(tf.data.experimental.AUTOTUNE) model = Sequential([ Dense(64, activation='relu'), Dense(32, activation='relu'), Dense(2)]) loss_tf = tf.keras.losses.MeanSquaredError() def custom_loss(y_true, y_pred): error = y_true-y_pred sqr_error = K.square(error) mean_sqr_error = K.mean(sqr_error) sqrt_mean_sqr_error = K.sqrt(mean_sqr_error) return sqrt_mean_sqr_error def pseudo_custom_loss(y_true, y_pred): return loss_tf(y_true, y_pred) if loss == 0: model.compile(optimizer='adam', loss=loss_tf, run_eagerly=True) elif loss == 1: model.compile(optimizer='adam', loss=custom_loss, run_eagerly=True) else: model.compile(optimizer='adam', loss=pseudo_custom_loss, run_eagerly=True) model.fit(train_data, epochs=2, steps_per_epoch = 3) print(model.summary())

Machine Learning & AI General

Feb ’22

Reply to TF-Metal Custom Loss Functions do not work

thx, Mac OS 12.1 tf 2.7 metal 0.3 will provide the example syntax on Saturday The whole point is that the squeeze seems not to be performed when: using built-in loss functions or not ruining on M1 GPU or weights are not used in the tf.dataset definition. maybe broadcasting is broken in such scenario or weights are applied in not the appropriate moment also, it might be that the custom loss function definition does not assume existence of the weight while the built in does. but if so than why: it works with same custom functions just on CPU or on not M1 Macs and windowes same error happens when I define the loss function as class inheriting from loss Also added to an already reported case where TF does not train M1 GPU but does on its CPU with no changes to the code. Maybe those are related.

Machine Learning & AI General

Feb ’22

Reply to TF-Metal Custom Loss Functions do not work

Further Detail: the issue is only present when running it on GPU (M1 Max in my case). Once, I block GPU (with tf.config.set_visible_devices([], 'GPU') ) all works as expected! this is definitively an implementation issue. Could Apple comment on it please?

Machine Learning & AI General

Feb ’22

Reply to TF-Metal Custom Loss Functions do not work

Additional Details. I narrowed the conditions for the issue to the need of using weights in tf.data.dataset creation": train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train, weights)) Once the weights are removed (train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train))) OR!!! we use non-custom loss function the learning proceeds. Does the Dev Team has any insights into it?

Machine Learning & AI General

Feb ’22

Reply to Tensorflow MobileNetV3Small model not training on custom image classification task

I can report same behavior here. while a lot of small vanilla models train (with Input, Dense, Drop layers only), once you add sth more advanced like RNNs the learning ability of the model disappears on Metal GPU. I recorded the same same behavior for both buildings-in and custom loss functions. once I added: tf.config.set_visible_devices([], 'GPU') (others suggested uninstalling TensorFlow-metal all together) I got decreasing loss and increase in prediction as expected. Is this issue somehow investigated on the Apple side? I am considering returning my M1 Max back at this moment...

Graphics & Games General

Feb ’22

sebtac

Post

Replies

Boosts

Views

Activity