Considering the beta/in-development status of tensorflow-metal and current issues (i.e. see "Tensorflow-metal selected issues" below as noticed by this author in LSTM models alone), and that given it is a closed-source library, it is critical to have open and detailed benchmarks on the quality, accuracy, and performance of tensorflow-metal.
If the M1-type processors are to be trusted and heavily utilized by data scientists and ML engineers, we need a commitment to excellence.
Can apple create a Python pip-based package that can be used to test such benchmarks between tensorflow-metal releases?
Some useful benchmarking
https://github.com/tensorflow/benchmarks
https://github.com/tensorflow/models/tree/master/official
https://github.com/cgnorthcutt/benchmarking-keras-pytorch
https://github.com/tlkh/tf-metal-experiments
Some of these approaches, notably the Tensorflow official models may require further work on making tensorflow-addons and tensorflow-text available as binary M1 ARM64 packages. The latter is especially hard to compile on M1 Macs per existing Github issues.
Tensorflow-metal selected issues
https://developer.apple.com/forums/thread/695150
https://developer.apple.com/forums/thread/695216
https://developer.apple.com/forums/thread/695134
Post
Replies
Boosts
Views
Activity
Summary:. Deep learning on large datasets often requires training on hundreds of thousands or more than millions of records. This often is best done using cloud hardware (i.e. AWS Sagemaker or AWS EC2 instances with high CPU and often state-of-the-art GPU cards [i.e. Tesla V100/A100, etc.], given much faster time to results. One of the known workflow techniques is to save your TensorFlow model (i.e. using Keras h5 format or Tensorflow SavedModel), and then evaluate it (i.e. inference) on the test data set, or new data set on normal hardware.
Given just such a Keras h5 model, I tried to evaluate locally on a M1-based Mac both GRU and LSTM models that were pre-trained in AWS Sagemaker. While the GRU-based models evaluated with some minor variation on M1 CPU vs. GPU and checked on other systems, there was radical accuracy errors when evaluating the LSTM-based models on M1 GPU specifically. This gives me low confidence in using tensorflow-metal with LSTM models for inference.
Steps to Reproduce:
git clone https://github.com/radagast-the-brown/tf2-inference-m1-issue.git
Ensure you have Jupyter Lab or notebook installed, and all dependencies listed in top section of "results-summary.ipynb", i.e.
pip install jupyterlab notebook sklearn keras numpy pandas matplotlib scikitplot
Start jupyter in Terminal
jupyter notebook
Browse to "results-summary.ipynb", open, and execute. Note, you can install/uninstall tensorflow-metal locally to toggle on M1 GPU vs M1 CPU only. You'll need to restart the notebook kernel between runs.
Look at "results-summary-m1-cpu.pdf" and "results-summary-m1-gpu.pdf" for details.
System Details:
Model: Macbook Pro (16-inch, 2021)
Chip: Apple M1 Max
Memory: 64GB OS: MacOS 12.0.1
Key Libraries: tensforflow-metal (0.2), tensorflow-macos (2.6.0), python (3.9.7)
The official Apple documentation (https://developer.apple.com/metal/tensorflow-plugin/) for tensorflow-metal (0.2) indicates that "Multi-GPU support" is not currently supported.
What does this mean? I could imagine that this means one of the following:
Option 1: Your M1 Mac has so many GPU cores (i.e. 32 cores in the case of M1-Max). Only 1 core is usable by tensorflow-metal currently.
Option 2: All GPU cores are used, but show up as a single core. Consequently, you can't and don't need to use Distributed Training (https://www.tensorflow.org/guide/distributed_training), one strategy to use multiple GPUs in TensorFlow (https://www.tensorflow.org/guide/gpu)
Can you clarify? Thanks.
Summary:
Training an LSTM on M1 GPU vs CPU shows an astounding 168x slower training per epoch. This is based on a relatively simple example chosen for reproducibility:
https://www.machinecurve.com/index.php/2021/01/07/build-an-lstm-model-with-tensorflow-and-keras/#full-model-code
Steps to Reproduce:
git clone https://github.com/radagast-the-brown/tf2-keras-lstm-sample.git
cd tf2-keras-lstm-sample
python lstm.py
Results:
M1 CPU
Compute time: 7s per epoch
Loss: 0.34 - Accuracy: 86%
M1 GPU (tensorflow-metal)
Compute time: > 2h per epoch
Didn't allow to finish.
System Details:
Model: Macbook Pro (16-inch, 2021)
Chip: Apple M1 Max
Memory: 64GB
OS: MacOS 12.0.1
Key Libraries: tensforflow-metal (0.2), tensorflow-macos (2.6.0), python (3.9.7)
Summary:
I have noticed low test accuracy during and after training Tensorflow LSTM models on M1-based Macs with tensorflow-metal/GPU. While I have observed this over more than one model, I chose to use a standard Github repo to reproduce the problem, utilizing a community and educational example based on the Sequential Models Course, Deep Learning Specialization (Andrew Ng).
Steps to Reproduce:
git clone https://github.com/omerbsezer/LSTM_RNN_Tutorials_with_Demo.git
cd LSTM_RNN_Tutorials_with_Demo/SentimentAnalysisProject
python main.py
Results
Test accuracy (CPU only, without tensorflow-metal): ~83%
Test accuracy (GPU using tensorflow-metal): ~37%
A similar pattern can be observed in epoch steps for accuracy, loss etc.
System Details
Model: Macbook Pro (16-inch, 2021)
Chip: Apple M1 Max
Memory: 64GB
OS: MacOS 12.0.1
Key Libraries: tensforflow-metal (0.2), tensorflow-macos (2.6.0), python (3.9.7)