Tensorflow MobileNetV3Small model not training on custom image classification task

Hi I'm trying to train a MobileNetV3Small model on a custom image classification pipeline on my M1 MacBook Pro using tensorflow-metal. While the code runs without error, the model doesn't seem to train at all - it predicts the same class for any input after training. I have already experimented with similar training on the same dataset with torchvision mobilenetv2 (on a GPU cluster) where I got over 60% accuracy (on 1098 image classes) after 2 epochs. I've included my code below, where even evaluating on the training set after training leads to poor performance. Any ideas what I could be doing wrong?

import tensorflow as tf

EPOCHS = 1
BATCH_SIZE = 128
LEARNING_RATE = 0.003
SEED=1220

if __name__ == '__main__':	
	# Load train and validation data
	train_ds = tf.keras.preprocessing.image_dataset_from_directory(
								'/Volumes/detext/drawings/',
							    color_mode="grayscale",
  								seed=SEED,
  								batch_size=BATCH_SIZE,
								labels='inferred',
								label_mode='int',
								image_size=(200,300))
	
	# Get the class names
	class_names = train_ds.class_names
	num_classes = len(class_names)

	# Create model
	model = tf.keras.applications.MobileNetV3Small(
    	input_shape=(200,300,1), alpha=1.0, minimalistic=False, 
    	include_top=True, weights=None, input_tensor=None, classes=num_classes,
    	pooling=None, classifier_activation="softmax",
    	include_preprocessing=True)

	# Compile model
	model.compile(
		   optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
           loss=tf.keras.losses.SparseCategoricalCrossentropy(),
           metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

    # Training
	model.fit(x=train_ds, epochs=EPOCHS)
	
	# Testing
	hist = model.evaluate(x=train_ds)
	print(hist)
	model.save('./saved_model3/')

The issue seems to be specific to certain types of operations/layers in Tensorflow, and specifically with respect to the validation accuracy (similar to this issue). When I build my own custom model with convolutions like so:

    model = Sequential([
  layers.Rescaling(1./255, input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
  layers.Conv2D(16, 1, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(32, 1, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 1, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Dropout(0.2),
  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_classes),
  layers.Softmax()
])

training proceeds as expected, with a high validation accuracy as well. Below is the output for the above model:

Found 210454 files belonging to 1098 classes.
Metal device set to: Apple M1 Pro

systemMemory: 32.00 GB
maxCacheSize: 10.67 GB

2021-12-21 12:27:24.005759: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-12-21 12:27:24.006206: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Found 210454 files belonging to 1098 classes.
Using 31568 files for validation.
2021-12-21 12:27:26.965648: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-12-21 12:27:26.968717: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2021-12-21 12:27:26.969214: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
1645/1645 [==============================] - ETA: 0s - loss: 2.1246 - sparse_categorical_accuracy: 0.52732021-12-21 12:32:57.475358: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
1645/1645 [==============================] - 353s 214ms/step - loss: 2.1246 - sparse_categorical_accuracy: 0.5273 - val_loss: 1.3041 - val_sparse_categorical_accuracy: 0.6558
2021-12-21 12:33:19.600146: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.

However, the very same code with the MobileNetV3Small model (instead of my custom model) produces the following output:

Found 210454 files belonging to 1098 classes.
Metal device set to: Apple M1 Pro

systemMemory: 32.00 GB
maxCacheSize: 10.67 GB

2021-12-21 12:34:46.754598: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-12-21 12:34:46.754793: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Found 210454 files belonging to 1098 classes.
Using 31568 files for validation.
2021-12-21 12:34:49.742015: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-12-21 12:34:49.747397: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2021-12-21 12:34:49.747606: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
1645/1645 [==============================] - ETA: 0s - loss: 2.4072 - sparse_categorical_accuracy: 0.46722021-12-21 12:41:28.137948: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
1645/1645 [==============================] - 415s 252ms/step - loss: 2.4072 - sparse_categorical_accuracy: 0.4672 - val_loss: 21.6091 - val_sparse_categorical_accuracy: 0.0131
2021-12-21 12:41:46.017580: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
/Users/venkat/miniforge3/envs/tf-metal/lib/python3.9/site-packages/keras/utils/generic_utils.py:494: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
  warnings.warn('Custom mask layers require a config and must override '

The validation loss/accuracy is hilariously bad, and I find that the model constantly predicts the same class. My guess is that MobileNetV3Small seems to contain some operations/layers that don't work well with tensorflow-metal for whatever reason, and only Apple Engineers can fix this problem at a low level.

Hi @venkatg,

I'm looking into this issue at the moment and tried to reproduce your results. While I don't have access to your dataset nor do I know about any special features it might have, I tried it on the 350 Bird Species dataset on Kaggle: https://www.kaggle.com/gpiosenka/100-bird-species .

First thing that I took notice was that the model is quite sizable with its 1,850,693 trainable parameters. The other point of note was that you are not using any pretrained weights to initialize the model but instead start the training from scratch with randomly initialized weights. While this is of course fine I would usually expect quite a few weight updates to be required before the network starts to achieve a meaningful configuration.

So first of here's the training trajectory I got when training on the GPU with the tensorflow-metal plugin:

While initially the scenario is exactly as you described with the validation loss and accuracy remaining poor and even decreasing while the training loss and accuracy improve, around 14 epochs in and onwards the network starts to converge towards a more general solution to the issue which can be seen from the improvement in the validation loss and accuracy. Now again since this is a different dataset than the one you are using your mileage may vary. 

However in order to confirm that this really is just about the uninitialized network taking a while to start learning the general solution, I ran the same test without the tensorflow-metal plugin. This means its running on the CPU using the TensorFlow Core which is independent from the metal implementation and as such does not have any potential bugs we might have at the moment affecting it. This includes the current random op issue mentioned by @gtsoukas in the other thread, which could influence for example the random initializations.

The training trajectory on the CPU overlaid with the GPU results:

So while the exact numerical values between steps slightly differ the general flow is the same. For the first 14 epochs the network seems to be mostly reconfiguring its weights towards a direction that helps it with the training set but doesn't generalize to the validation set, before it starts to converge towards a more general solution. This makes me think that the issue is not currently in the GPU ops the network is made out of but instead the network and the dataset are complicated enough that the uninitialized weights require more than a few epochs over your dataset to start converging towards a desirable solution. Also note that the training is still very much not fully converged after 20 epochs here, I simply stopped at that point as it seemed to be enough to confirm my suspicion.

Now with the custom network you provided I would guess the reason why the validation values and training values go more closely along each other for you is because the network is restricted by the relatively simple structure it can't learn the training set noise as effectively in the beginning but instead it is regularized by its architecture towards exploring the more generalizable solutions. For the result you saw on MobileNetV2 with torchnet my first thought would be to check if some layers have pretrained weights loaded into them since that tends to speedup the convergence to good solutions quite a bit.

Hope this helps! 

I can report same behavior here. while a lot of small vanilla models train (with Input, Dense, Drop layers only), once you add sth more advanced like RNNs the learning ability of the model disappears on Metal GPU. I recorded the same same behavior for both buildings-in and custom loss functions.

once I added: tf.config.set_visible_devices([], 'GPU') (others suggested uninstalling TensorFlow-metal all together)

I got decreasing loss and increase in prediction as expected.

Is this issue somehow investigated on the Apple side? I am considering returning my M1 Max back at this moment...

Tensorflow MobileNetV3Small model not training on custom image classification task
 
 
Q