Why Python native on M1 Max is greatly slower than Python on old Intel i5?

I just got my new MacBook Pro with M1 Max chip and am setting up Python. I've tried several combinational settings to test speed - now I'm quite confused. First put my questions here:

  • Why python run natively on M1 Max is greatly (~100%) slower than on my old MacBook Pro 2016 with Intel i5?
  • On M1 Max, why there isn't significant speed difference between native run (by miniforge) and run via Rosetta (by anaconda) - which is supposed to be slower ~20%?
  • On M1 Max and native run, why there isn't significant speed difference between conda installed Numpy and TensorFlow installed Numpy - which is supposed to be faster?
  • On M1 Max, why run in PyCharm IDE is constantly slower ~20% than run from terminal, which doesn't happen on my old Intel Mac.

Evidence supporting my questions is as follows:


Here are the settings I've tried:

1. Python installed by

  • Miniforge-arm64, so that python is natively run on M1 Max Chip. (Check from Activity Monitor, Kind of python process is Apple).
  • Anaconda.: Then python is run via Rosseta. (Check from Activity Monitor, Kind of python process is Intel).

2. Numpy installed by

  • conda install numpy: numpy from original conda-forge channel, or pre-installed with anaconda.
  • Apple-TensorFlow: with python installed by miniforge, I directly install tensorflow, and numpy will also be installed. It's said that, numpy installed in this way is optimized for Apple M1 and will be faster. Here is the installation commands:
conda install -c apple tensorflow-deps
python -m pip install tensorflow-macos
python -m pip install tensorflow-metal

3. Run from

  • Terminal.
  • PyCharm (Apple Silicon version).

Here is the test code:

import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10

timecosts = []
for _ in range(runtimes):
    s_time = time.time()
    for i in range(100):
        a += 1
        np.linalg.svd(a)
    timecosts.append(time.time() - s_time)

print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')

and here are the results:

+-----------------------------------+-----------------------+--------------------+
|   Python installed by (run on)→   | Miniforge (native M1) | Anaconda (Rosseta) |
+----------------------+------------+------------+----------+----------+---------+
| Numpy installed by ↓ | Run from → |  Terminal  |  PyCharm | Terminal | PyCharm |
+----------------------+------------+------------+----------+----------+---------+
|          Apple Tensorflow         |   4.19151  |  4.86248 |     /    |    /    |
+-----------------------------------+------------+----------+----------+---------+
|        conda install numpy        |   4.29386  |  4.98370 |  4.10029 | 4.99271 |
+-----------------------------------+------------+----------+----------+---------+

This is quite slow. For comparison,

  • run the same code on my old MacBook Pro 2016 with i5 chip - it costs 2.39917s.
  • another post reports that run with M1 chip (not Pro or Max), miniforge+conda_installed_numpy is 2.53214s, and miniforge+apple_tensorflow_numpy is 1.00613s.
  • you may also try on it your own.

Here is the CPU information details:

  • My old i5:
$ sysctl -a | grep -e brand_string -e cpu.core_count
machdep.cpu.brand_string: Intel(R) Core(TM) i5-6360U CPU @ 2.00GHz
machdep.cpu.core_count: 2
  • My new M1 Max:
% sysctl -a | grep -e brand_string -e cpu.core_count
machdep.cpu.brand_string: Apple M1 Max
machdep.cpu.core_count: 10

I follow instructions strictly from tutorials - but why would all these happen? Is it because of my installation flaws, or because of M1 Max chip? Since my work relies heavily on local runs, local speed is very important to me. Any suggestions to possible solution, or any data points on your own device would be greatly appreciated :)

Probably a dependancy and or compiler issue... they are not all fully optomized for M1 yet i believe, but could be mistaken.

Hi graphitum, i have obtained the same results!! On my MacBook m1 pro results has been : 4.34 secs while on my iMac 27 with i7-10700K the program returned an incredible 0.75532 secs !!

I'm very confused about this !!! We have to see deep inside the question....

As I know, Intel has super powerful math library.. you can see more details at below link.: https://www.pugetsystems.com/labs/hpc/AMD-Ryzen-3900X-vs-Intel-Xeon-2175W-Python-numpy---MKL-vs-OpenBLAS-1560/

I have worked out a workaround solution - How to install numpy on M1 Max, with the most accelerated performance (Apple's vecLib)? Here's the answer as of Dec 6 2021.


Steps

I. Install miniforge

So that your Python is run natively on arm64, not translated via Rosseta.

  1. Download Miniforge3-MacOSX-arm64.sh, then
  2. Run the script, then open another shell
$ bash Miniforge3-MacOSX-arm64.sh
  1. Create an environment (here I use name np_veclib)
$ conda create -n np_veclib python=3.9
$ conda activate np_veclib

II. Install Numpy with BLAS interface specified as vecLib

  1. To compile numpy, first need to install cython and pybind11:
$ conda install cython pybind11
  1. Compile numpy by (Thanks @Marijn's answer) - don't use conda install!
$ pip install --no-binary :all: --no-use-pep517 numpy
  1. An alternative of 2. is to build from source
$ git clone https://github.com/numpy/numpy
$ cd numpy
$ cp site.cfg.example site.cfg
$ nano site.cfg

Edit the copied site.cfg: add the following lines:

[accelerate]
libraries = Accelerate, vecLib

Then build and install:

$ NPY_LAPACK_ORDER=accelerate python setup.py build
$ python setup.py install
  1. After either 2 or 3, now test whether numpy is using vecLib:
>>> import numpy
>>> numpy.show_config()

Then, info like /System/Library/Frameworks/vecLib.framework/Headers should be printed.

III. For further installing other packages using conda

Make conda recognize packages installed by pip

conda config --set pip_interop_enabled true

This must be done, otherwise if e.g. conda install pandas, then numpy will be in The following packages will be installed list and installed again. But the new installed one is from conda-forge channel and is slow.


Comparisons to other installations:

1. Competitors:

Except for the above optimal one, I also tried several other installations

  • A. np_default: conda create -n np_default python=3.9 numpy
  • B. np_openblas: conda create -n np_openblas python=3.9 numpy blas=*=*openblas*
  • C. np_netlib: conda create -n np_netlib python=3.9 numpy blas=*=*netlib*

The above ABC options are directly installed from conda-forge channel. numpy.show_config() will show identical results. To see the difference, examine by conda list - e.g. openblas packages are installed in B. Note that mkl or blis is not supported on arm64.

  • D. np_openblas_source: First install openblas by brew install openblas. Then add [openblas] path /opt/homebrew/opt/openblas to site.cfg and build Numpy from source.
  • M1 and i9–9880H in this post.
  • My old i5-6360U 2cores on MacBook Pro 2016 13in.

2. Benchmarks:

Here I use two benchmarks:

  1. mysvd.py: My SVD decomposition
import time
import numpy as np
np.random.seed(42)
a = np.random.uniform(size=(300, 300))
runtimes = 10

timecosts = []
for _ in range(runtimes):
    s_time = time.time()
    for i in range(100):
        a += 1
        np.linalg.svd(a)
    timecosts.append(time.time() - s_time)

print(f'mean of {runtimes} runs: {np.mean(timecosts):.5f}s')
  1. dario.py: A benchmark script by Dario Radečić at the post above.

3. Results:

+-------+-----------+------------+-------------+-----------+--------------------+----+----------+----------+
|  sec  | np_veclib | np_default | np_openblas | np_netlib | np_openblas_source | M1 | i9–9880H | i5-6360U |
+-------+-----------+------------+-------------+-----------+--------------------+----+----------+----------+
| mysvd |  1.02300  |   4.29386  |   4.13854   |  4.75812  |      12.57879      |  / |     /    |  2.39917 |
+-------+-----------+------------+-------------+-----------+--------------------+----+----------+----------+
| dario |     21    |     41     |      39     |    323    |         40         | 33 |    23    |    78    |
+-------+-----------+------------+-------------+-----------+--------------------+----+----------+----------+

I tested this on my M1 MacBook Pro 13" MacOS Big Sur, using python 3.8, numpy 1.19 (from Apple's TF 2.6.0 dependencies) and got 2.35s average.

Thank You for doing this test. I love Apple eco-system from Jobs. However, this python has been my workhorse for many things. And apparently, m1/m1-max can not make the python more efficient than intel based solution.

Jobs was claiming that the enclosed system could make things more efficient by controlling every detail. This M1/M1 max really makes me think stop to use Apple MacBook. It is immature, no matter how great the hardware as claimed.

really hope Apple can think their original intentions. This is not iPAD for convenient or specific purpose. This is Mac for serious work. Don’t know is any other application encountered same results or not since no live comparison for work.

How do I best install conda? I wiped my hard drive on my M1. I did the command $bash Miniforge3-MacOSX-arm64.sh and next I am tried $ conda create -n pym12022feb python=3.9 and Error is: $ bash: conda: command not found. Many Thanks.

Tried with my new M1 Pro here, 10 CPU, 16 GPU. Python 3.10.2 installed via python website. numpy installed via pip3

Tried with M2 MacBook Air 24GB ram. under 1 sec :)

Instead of all the above, I was able to get ~1 second results on the mysvd.py test by just running:

conda install numpy "libblas=*=*accelerate"

In my new MAC M1.

1)I installed the Minoforge3 (bash Miniforge3-MacOSX-arm64.sh) 2 Initialized a conda base environment (conda init) 3) Installed numpy properly: conda install numpy "libblas=*=*accelerate"

And then the suggested benchmark runs in 1.07172 secs

Why Python native on M1 Max is greatly slower than Python on old Intel i5?
 
 
Q