The issue seems to be specific to certain types of operations/layers in Tensorflow, and specifically with respect to the validation accuracy (similar to this issue). When I build my own custom model with convolutions like so:
model = Sequential([
layers.Rescaling(1./255, input_shape=(IMG_HEIGHT, IMG_WIDTH, 1)),
layers.Conv2D(16, 1, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(32, 1, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(64, 1, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Dropout(0.2),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes),
layers.Softmax()
])
training proceeds as expected, with a high validation accuracy as well. Below is the output for the above model:
Found 210454 files belonging to 1098 classes.
Metal device set to: Apple M1 Pro
systemMemory: 32.00 GB
maxCacheSize: 10.67 GB
2021-12-21 12:27:24.005759: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-12-21 12:27:24.006206: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Found 210454 files belonging to 1098 classes.
Using 31568 files for validation.
2021-12-21 12:27:26.965648: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-12-21 12:27:26.968717: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2021-12-21 12:27:26.969214: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
1645/1645 [==============================] - ETA: 0s - loss: 2.1246 - sparse_categorical_accuracy: 0.52732021-12-21 12:32:57.475358: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
1645/1645 [==============================] - 353s 214ms/step - loss: 2.1246 - sparse_categorical_accuracy: 0.5273 - val_loss: 1.3041 - val_sparse_categorical_accuracy: 0.6558
2021-12-21 12:33:19.600146: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
However, the very same code with the MobileNetV3Small model (instead of my custom model) produces the following output:
Found 210454 files belonging to 1098 classes.
Metal device set to: Apple M1 Pro
systemMemory: 32.00 GB
maxCacheSize: 10.67 GB
2021-12-21 12:34:46.754598: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2021-12-21 12:34:46.754793: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Found 210454 files belonging to 1098 classes.
Using 31568 files for validation.
2021-12-21 12:34:49.742015: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-12-21 12:34:49.747397: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2021-12-21 12:34:49.747606: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
1645/1645 [==============================] - ETA: 0s - loss: 2.4072 - sparse_categorical_accuracy: 0.46722021-12-21 12:41:28.137948: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
1645/1645 [==============================] - 415s 252ms/step - loss: 2.4072 - sparse_categorical_accuracy: 0.4672 - val_loss: 21.6091 - val_sparse_categorical_accuracy: 0.0131
2021-12-21 12:41:46.017580: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
/Users/venkat/miniforge3/envs/tf-metal/lib/python3.9/site-packages/keras/utils/generic_utils.py:494: CustomMaskWarning: Custom mask layers require a config and must override get_config. When loading, the custom mask layer must be passed to the custom_objects argument.
warnings.warn('Custom mask layers require a config and must override '
The validation loss/accuracy is hilariously bad, and I find that the model constantly predicts the same class. My guess is that MobileNetV3Small seems to contain some operations/layers that don't work well with tensorflow-metal
for whatever reason, and only Apple Engineers can fix this problem at a low level.