Hi,
I am running into unexpected behavior where Tensorflow Metal with python multiprocessing does not work with "forking" on M1s. It used to work so on previous MacOS platforms though. It does work with "spawning" but this is not preferred approach for our implementation. Specifically:
execution of the code in a subprocess continues till it reaches creation of the tf.data.Dataset() where it stops further execution without any error messages (it quits the subprocess and executes the rest of the code in main).
the above happens when we use forking and provide the code to be executed in the subprocess as either fun() in the main script or a sub-module script imported to the main script.
the above happens no matter we use the multiprocessing or multiprocess module, the later of which was supposed to fix such behavior on other platforms.
the example code is below. please let me know if you would like to receive any further details
Is it expected behavior or something that needs further investigation?
Software Stack: Mac OS 12.1 tf 2.7 metal 0.3 also tested on tf. 2.8
the Example Code is:
def executor(worker, params):
print("START")
batch = 128
import numpy as np
import pandas as pd
import sys
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Displacement', 'Horsepower', 'Weight']
dataset = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True).dropna()
x_train = np.array(dataset[['Horsepower', 'Weight']]).reshape(-1,2,2)
y_train = np.array(dataset[['MPG','Displacement']]).reshape(-1,2,2)
print(x_train.shape)
print(y_train.shape)
print("BEFORE td.data.Dataset!")
train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train)).cache().shuffle(x_train.shape[0]).batch(batch).repeat().prefetch(tf.data.experimental.AUTOTUNE)
print("WOW!, AFTER td.data.Dataset")
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(x_train.shape[1],x_train.shape[2])))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(2*2))
model.add(tf.keras.layers.Reshape([2,2]))
print(model.summary())
model.compile(optimizer='adam', loss=tf.keras.losses.MeanSquaredError(), run_eagerly=True)
model.fit(train_data, epochs=params["epochs"], steps_per_epoch = 3)
CHECK WITH MULTIPROCESSING
TESTING CONDITIONS
On old MACs 0,0,0 worked
On M1 only 0,1,1 works ?!?!?
package = 0 # 0 multiprocessing, 1 multiprocess
method = 1 # 0 Fork, 1 Spawn
executor_mode = 1 # 0 from-main, 1 from module
if package == 0:
import multiprocessing as mp
else:
import multiprocess as mp
if method == 0:
mp.set_start_method('fork', force=True)
else:
mp.set_start_method('spawn', force=True)
import sys
if 'dataset_in_multiprocessing_issue' in sys.modules:
print("yes")
del sys.modules["dataset_in_multiprocessing_issue"]
import dataset_in_multiprocessing_issue
else:
print("No")
import dataset_in_multiprocessing_issue
params = {"epochs": 300}
if name == 'main':
#def executor(worker, params): -- tested puting executor here for SPAWN
processes = []
for worker in range(mp.cpu_count()-9):
if executor_mode == 0:
p = mp.Process(target=executor, args=(worker,params,))
else:
p = mp.Process(target=dataset_in_multiprocessing_issue.executor, args=(worker,params,))
processes.append(p)
p.start()
for process in processes:
process.join()
print("Fully DONE!")
Post
Replies
Boosts
Views
Activity
We run into an issue that a more complex model fails to converge on M1 Max GPU while it converges on its CPU and on Non-M1 based models.
the performance is the same for CPU and GPU for models with single RNN but once we use two RNNs GPU fails to converge.
That said, the below example is based on non-sensical data for the model architecture used. but we can observe here the same behavior as the one we observe in our production models (which for obvious reasons we cannot share here). Mainly:
the loss goes down to the bottom of the e-06 precision in all cases but when we use two RNNs on GPU. during training we often test e-07 precision level
for double RNN with GPU condition, the results do not go that low sometimes reaching also e-05 value level.
for our production data we see that double RNN with GPU results in loss of 1.0 and basically stays the same from the first epoch; but for the other conditions it often reaches 0.2 level with clear learning curve.
in production model increasing the LSTM_Cell number made the divergence more visible (in this syntactic date it does not happen)
the more complex the model is (after the RNN layers) the more visible the issue.
Suspected issues:
different precision used in CPU and GPU training - we had to decrease the data values a lot to make the effect visible ( if you work with raw data all approaches seem to produce the comparable results)
somehow the vanishing gradient problem is more pronounced on GPU as indicated by worse performance as the complexity of the model increases.
please let me know if you need any further details
Software Stack:
Mac OS 12.1 tf 2.7 metal 0.3
also tested on tf. 2.8
Sample Syntax:
TEST CONDITIONS:
#conditions with issue: 1,2
gpu = 1 # 0 CPU, 1 GPU
model_size = 2 # 1 single RNN, 2 double RNN
#PARAMETERS
LSTM_Cells = 64
epochs = 300
batch = 128
import numpy as np
import pandas as pd
import sys
from sklearn import preprocessing
#"""
if 'tensorflow' in sys.modules:
print("tensorflow uploaded")
del sys.modules["tensorflow"]
#del tf
import tensorflow as tf
else:
print("tensorflow not uploaded")
import tensorflow as tf
if gpu == 1:
pass
else:
tf.config.set_visible_devices([], 'GPU')
#print("GPUs:", tf.config.list_physical_devices('GPU'))
print("GPUs:", tf.config.list_logical_devices('GPU'))
#print("CPUs:", tf.config.list_physical_devices('CPU'))
print("CPUs:", tf.config.list_logical_devices('CPU'))
#"""
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Displacement', 'Horsepower', 'Weight']
dataset = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True).dropna()
scaler = preprocessing.StandardScaler().fit(dataset)
X_scaled = scaler.transform(dataset)
X_scaled = X_scaled * 0.001
Large Values
#x_train = np.array(dataset[['Horsepower', 'Weight']]).reshape(-1,2,2)
#y_train = np.array(dataset[['MPG','Displacement']]).reshape(-1,2,2)
Small Values
x_train = np.array(X_scaled[:,2:]).reshape(-1,2,2)
y_train = np.array(X_scaled[:,:2]).reshape(-1,2,2)
#print(dataset)
print(x_train.shape)
print(y_train.shape)
print(weight.shape)
train_data = tf.data.Dataset.from_tensor_slices((x_train[:,:,:8], y_train)).cache().shuffle(x_train.shape[0]).batch(batch).repeat().prefetch(tf.data.experimental.AUTOTUNE)
if model_size == 2:
#""" # MINIMAL NOT WORKING
encoder_inputs = tf.keras.Input(shape=(x_train.shape[1],x_train.shape[2]))
encoder_l1 = tf.keras.layers.LSTM(LSTM_Cells,return_sequences = True, return_state=True)
encoder_l1_outputs = encoder_l1(encoder_inputs)
encoder_l2 = tf.keras.layers.LSTM(LSTM_Cells, return_state=True)
encoder_l2_outputs = encoder_l2(encoder_l1_outputs[0])
dense_1 = tf.keras.layers.Dense(128, activation='relu')(encoder_l2_outputs[0])
dense_2 = tf.keras.layers.Dense(64, activation='relu')(dense_1)
dense_3 = tf.keras.layers.Dense(32, activation='relu')(dense_2)
dense_4 = tf.keras.layers.Dense(16, activation='relu')(dense_3)
flat = tf.keras.layers.Flatten()(dense_2)
dense_5 = tf.keras.layers.Dense(22)(flat)
reshape_output = tf.keras.layers.Reshape([2,2])(dense_5)
model = tf.keras.models.Model(encoder_inputs, reshape_output)
#"""
else:
#""" # WORKING
encoder_inputs = tf.keras.Input(shape=(x_train.shape[1],x_train.shape[2]))
encoder_l1 = tf.keras.layers.LSTM(LSTM_Cells,return_sequences = True, return_state=True)
encoder_l1_outputs = encoder_l1(encoder_inputs)
dense_1 = tf.keras.layers.Dense(128, activation='relu')(encoder_l1_outputs[0])
dense_2 = tf.keras.layers.Dense(64, activation='relu')(dense_1)
dense_3 = tf.keras.layers.Dense(32, activation='relu')(dense_2)
dense_4 = tf.keras.layers.Dense(16, activation='relu')(dense_3)
flat = tf.keras.layers.Flatten()(dense_2)
dense_5 = tf.keras.layers.Dense(22)(flat)
reshape_output = tf.keras.layers.Reshape([2,2])(dense_5)
model = tf.keras.models.Model(encoder_inputs, reshape_output)
#"""
print(model.summary())
loss_tf = tf.keras.losses.MeanSquaredError()
model.compile(optimizer='adam', loss=loss_tf, run_eagerly=True)
model.fit(train_data,
epochs = epochs,
steps_per_epoch = 3)
Hi,
I am getting the following error in TF on M1 Max when I use custom loss function (that I define myself)
2022-02-14 21:23:44.437000: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-02-14 21:23:44.437119: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: )
Process Process-82:
Traceback (most recent call last):
File "/Users/sebtac/miniforge3/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/Users/sebtac/miniforge3/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/sebtac/Documents/executor_metal.py", line 892, in executor
history=model.fit(train_data,
File "/Users/sebtac/miniforge3/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/Users/sebtac/miniforge3/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 7107, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: Can not squeeze dim[0], expected a dimension of 1, got 512 [Op:Squeeze]
Custom function:
def my_rmse(y_true, y_pred):
error = y_true-y_pred
sqr_error = K.square(error)
mean_sqr_error = K.mean(sqr_error)
sqrt_mean_sqr_error = K.sqrt(mean_sqr_error)
return sqrt_mean_sqr_error
model.compile(optimizer=optimizer,loss=my_rmse,run_eagerly=True)
#model.compile(optimizer=optimizer,loss="mae",run_eagerly=True)
Additional Details:
-same does not happen when I use built-in functions
512 is the Batch size and batching works fine without custom loss function
it works well when I set batch to 1
it works well on non M1 MACs
I run the model from within microprocessing process