sebtac’s Profile | Apple Developer Forums

M1 Max GPU fails to converge in more complex models

We run into an issue that a more complex model fails to converge on M1 Max GPU while it converges on its CPU and on Non-M1 based models. the performance is the same for CPU and GPU for models with single RNN but once we use two RNNs GPU fails to converge. That said, the below example is based on non-sensical data for the model architecture used. but we can observe here the same behavior as the one we observe in our production models (which for obvious reasons we cannot share here). Mainly: the loss goes down to the bottom of the e-06 precision in all cases but when we use two RNNs on GPU. during training we often test e-07 precision level for double RNN with GPU condition, the results do not go that low sometimes reaching also e-05 value level. for our production data we see that double RNN with GPU results in loss of 1.0 and basically stays the same from the first epoch; but for the other conditions it often reaches 0.2 level with clear learning curve. in production model increasing the LSTM_Cell number made the divergence more visible (in this syntactic date it does not happen) the more complex the model is (after the RNN layers) the more visible the issue. Suspected issues: different precision used in CPU and GPU training - we had to decrease the data values a lot to make the effect visible ( if you work with raw data all approaches seem to produce the comparable results) somehow the vanishing gradient problem is more pronounced on GPU as indicated by worse performance as the complexity of the model increases. please let me know if you need any further details Software Stack: Mac OS 12.1 tf 2.7 metal 0.3 also tested on tf. 2.8 Sample Syntax: TEST CONDITIONS: #conditions with issue: 1,2 gpu = 1 # 0 CPU, 1 GPU model_size = 2 # 1 single RNN, 2 double RNN #PARAMETERS LSTM_Cells = 64 epochs = 300 batch = 128 import numpy as np import pandas as pd import sys from sklearn import preprocessing #""" if 'tensorflow' in sys.modules: print("tensorflow uploaded") del sys.modules["tensorflow"] #del tf import tensorflow as tf else: print("tensorflow not uploaded") import tensorflow as tf if gpu == 1: pass else: tf.config.set_visible_devices([], 'GPU') #print("GPUs:", tf.config.list_physical_devices('GPU')) print("GPUs:", tf.config.list_logical_devices('GPU')) #print("CPUs:", tf.config.list_physical_devices('CPU')) print("CPUs:", tf.config.list_logical_devices('CPU')) #""" from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data' column_names = ['MPG', 'Displacement', 'Horsepower', 'Weight'] dataset = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipinitialspace=True).dropna() scaler = preprocessing.StandardScaler().fit(dataset) X_scaled = scaler.transform(dataset) X_scaled = X_scaled * 0.001 Large Values #x_train = np.array(dataset[['Horsepower', 'Weight']]).reshape(-1,2,2) #y_train = np.array(dataset[['MPG','Displacement']]).reshape(-1,2,2) Small Values x_train = np.array(X_scaled[:,2:]).reshape(-1,2,2) y_train = np.array(X_scaled[:,:2]).reshape(-1,2,2) #print(dataset) print(x_train.shape) print(y_train.shape) print(weight.shape) train_data = tf.data.Dataset.from_tensor_slices((x_train[:,:,:8], y_train)).cache().shuffle(x_train.shape[0]).batch(batch).repeat().prefetch(tf.data.experimental.AUTOTUNE) if model_size == 2: #""" # MINIMAL NOT WORKING encoder_inputs = tf.keras.Input(shape=(x_train.shape[1],x_train.shape[2])) encoder_l1 = tf.keras.layers.LSTM(LSTM_Cells,return_sequences = True, return_state=True) encoder_l1_outputs = encoder_l1(encoder_inputs) encoder_l2 = tf.keras.layers.LSTM(LSTM_Cells, return_state=True) encoder_l2_outputs = encoder_l2(encoder_l1_outputs[0]) dense_1 = tf.keras.layers.Dense(128, activation='relu')(encoder_l2_outputs[0]) dense_2 = tf.keras.layers.Dense(64, activation='relu')(dense_1) dense_3 = tf.keras.layers.Dense(32, activation='relu')(dense_2) dense_4 = tf.keras.layers.Dense(16, activation='relu')(dense_3) flat = tf.keras.layers.Flatten()(dense_2) dense_5 = tf.keras.layers.Dense(22)(flat) reshape_output = tf.keras.layers.Reshape([2,2])(dense_5) model = tf.keras.models.Model(encoder_inputs, reshape_output) #""" else: #""" # WORKING encoder_inputs = tf.keras.Input(shape=(x_train.shape[1],x_train.shape[2])) encoder_l1 = tf.keras.layers.LSTM(LSTM_Cells,return_sequences = True, return_state=True) encoder_l1_outputs = encoder_l1(encoder_inputs) dense_1 = tf.keras.layers.Dense(128, activation='relu')(encoder_l1_outputs[0]) dense_2 = tf.keras.layers.Dense(64, activation='relu')(dense_1) dense_3 = tf.keras.layers.Dense(32, activation='relu')(dense_2) dense_4 = tf.keras.layers.Dense(16, activation='relu')(dense_3) flat = tf.keras.layers.Flatten()(dense_2) dense_5 = tf.keras.layers.Dense(22)(flat) reshape_output = tf.keras.layers.Reshape([2,2])(dense_5) model = tf.keras.models.Model(encoder_inputs, reshape_output) #""" print(model.summary()) loss_tf = tf.keras.losses.MeanSquaredError() model.compile(optimizer='adam', loss=loss_tf, run_eagerly=True) model.fit(train_data, epochs = epochs, steps_per_epoch = 3)

Machine Learning & AI General tensorflow-metal

8

0

2.6k

Feb ’22

Multiprocessing with TF Metal on M1

Hi, I am running into unexpected behavior where Tensorflow Metal with python multiprocessing does not work with "forking" on M1s. It used to work so on previous MacOS platforms though. It does work with "spawning" but this is not preferred approach for our implementation. Specifically: execution of the code in a subprocess continues till it reaches creation of the tf.data.Dataset() where it stops further execution without any error messages (it quits the subprocess and executes the rest of the code in main). the above happens when we use forking and provide the code to be executed in the subprocess as either fun() in the main script or a sub-module script imported to the main script. the above happens no matter we use the multiprocessing or multiprocess module, the later of which was supposed to fix such behavior on other platforms. the example code is below. please let me know if you would like to receive any further details Is it expected behavior or something that needs further investigation? Software Stack: Mac OS 12.1 tf 2.7 metal 0.3 also tested on tf. 2.8 the Example Code is: def executor(worker, params): print("START") batch = 128 import numpy as np import pandas as pd import sys import tensorflow as tf from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data' column_names = ['MPG', 'Displacement', 'Horsepower', 'Weight'] dataset = pd.read_csv(url, names=column_names, na_values='?', comment='\t', sep=' ', skipinitialspace=True).dropna() x_train = np.array(dataset[['Horsepower', 'Weight']]).reshape(-1,2,2) y_train = np.array(dataset[['MPG','Displacement']]).reshape(-1,2,2) print(x_train.shape) print(y_train.shape) print("BEFORE td.data.Dataset!") train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train)).cache().shuffle(x_train.shape[0]).batch(batch).repeat().prefetch(tf.data.experimental.AUTOTUNE) print("WOW!, AFTER td.data.Dataset") model = tf.keras.models.Sequential() model.add(tf.keras.Input(shape=(x_train.shape[1],x_train.shape[2]))) model.add(tf.keras.layers.Dense(64, activation='relu')) model.add(tf.keras.layers.Dense(32, activation='relu')) model.add(tf.keras.layers.Flatten()) model.add(tf.keras.layers.Dense(2*2)) model.add(tf.keras.layers.Reshape([2,2])) print(model.summary()) model.compile(optimizer='adam', loss=tf.keras.losses.MeanSquaredError(), run_eagerly=True) model.fit(train_data, epochs=params["epochs"], steps_per_epoch = 3) CHECK WITH MULTIPROCESSING TESTING CONDITIONS On old MACs 0,0,0 worked On M1 only 0,1,1 works ?!?!? package = 0 # 0 multiprocessing, 1 multiprocess method = 1 # 0 Fork, 1 Spawn executor_mode = 1 # 0 from-main, 1 from module if package == 0: import multiprocessing as mp else: import multiprocess as mp if method == 0: mp.set_start_method('fork', force=True) else: mp.set_start_method('spawn', force=True) import sys if 'dataset_in_multiprocessing_issue' in sys.modules: print("yes") del sys.modules["dataset_in_multiprocessing_issue"] import dataset_in_multiprocessing_issue else: print("No") import dataset_in_multiprocessing_issue params = {"epochs": 300} if name == 'main': #def executor(worker, params): -- tested puting executor here for SPAWN processes = [] for worker in range(mp.cpu_count()-9): if executor_mode == 0: p = mp.Process(target=executor, args=(worker,params,)) else: p = mp.Process(target=dataset_in_multiprocessing_issue.executor, args=(worker,params,)) processes.append(p) p.start() for process in processes: process.join() print("Fully DONE!")

Machine Learning & AI General tensorflow-metal

1

0

2.4k

Feb ’22

TF-Metal Custom Loss Functions do not work

Hi, I am getting the following error in TF on M1 Max when I use custom loss function (that I define myself) 2022-02-14 21:23:44.437000: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2022-02-14 21:23:44.437119: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: ) Process Process-82: Traceback (most recent call last): File "/Users/sebtac/miniforge3/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/Users/sebtac/miniforge3/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/Users/sebtac/Documents/executor_metal.py", line 892, in executor history=model.fit(train_data, File "/Users/sebtac/miniforge3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/Users/sebtac/miniforge3/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7107, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.InvalidArgumentError: Can not squeeze dim[0], expected a dimension of 1, got 512 [Op:Squeeze] Custom function: def my_rmse(y_true, y_pred): error = y_true-y_pred sqr_error = K.square(error) mean_sqr_error = K.mean(sqr_error) sqrt_mean_sqr_error = K.sqrt(mean_sqr_error) return sqrt_mean_sqr_error model.compile(optimizer=optimizer,loss=my_rmse,run_eagerly=True) #model.compile(optimizer=optimizer,loss="mae",run_eagerly=True) Additional Details: -same does not happen when I use built-in functions 512 is the Batch size and batching works fine without custom loss function it works well when I set batch to 1 it works well on non M1 MACs I run the model from within microprocessing process

Machine Learning & AI General tensorflow-metal

5

0

1.3k

Feb ’22

sebtac

Post

Replies

Boosts

Views

Activity