Bad MPSMatrixMultiplication performance in Big Sur 11.3

Question

Created May ’21

Replies 5

Boosts 0

Views 1.1k

Participants 2

Hi,

I.m referring to a simple MPSMatrixMultiplication performance test code provided in another post:

https://developer.apple.com/forums/thread/105534

You can save the code (the 2nd one) in some file, say testMM.swift and compile it from a terminal by executing:

swiftc -O testMM.swift

Then run the executable in Terminal by executing:

./testMM

The code performs a matrix multiplication using MPSMatrixMultiplication and reports the calculation performance of your GPU in GFlops. On my MacPro 2019 with AMD Vega II, I got 3500 GFlops.

Now I did this test again by recompiling the exact same code after installation of macOS Big Sur 11.3 and XCode 12.5. The performance dropped down to only 119 GFlops i.e. a dramatic performance loss of more than a factor of 30!

Are there any fundamental changes on how to use Metal performance shaders?
Any idea?

Boost

Answer 1

Graphics and Games Engineer OP

Apple

May ’21

Hi maccan,

This sounds like a significant regression. Can you create a request via Feedback Assistant? If you repost here I can have it routed to the MPS team.

0

Answer 2

Graphics and Games Engineer OP

Apple

May ’21

Hi maccan,

Some of our engineers have tried to reproduce this problem but could not. Can you give us some more details about how you tested this? To get an accurate measurement you really need to encode many matrix multiplication operations (most easily done by placing a MPSMatrixMultiplication.ecode calls into a loop, summing the data size, and dividing by the number of seconds).

0

Answer 3

maccan OP

May ’21

Hi,

I found out that the performance drop in Big Sur 11.3 is due to a poor transfer performance of the data to the GPU.

You can explore this using the attached code by compiling and running the attached code 2 times:

First:
Compile as it is: swicftC -O matrixMul.swift and run it by executing: ./matrixMul
On my MacPro running macOS Big Sur 11.3.1 I get the following output:

Values in matrix A[8192 x 8192]: 1.0 uniformly
Values in matrix B[8192 x 8192]: 2.0 uniformly
Starting calculation on AMD Radeon Pro Vega II
...
Values in matrix C = A * B: 16384.0 uniformly
1'099'444'518'912 floating point operations performed
Elapsed GPU time = 1.92 seconds -> 0.573 Teraflops

Second:
Comment out line 74 and line 75, and instead uncomment line 86 and line 87.
This shifts the starting point of the time measurement from the beginning of the encoding procedure to the beginning of the commit statement, i.e. the elapsed time reported reflects the time spent for the calculation on the GPU.

Compile again: swicftC -O matrixMul.swift and run: ./matrixMul
This time I get

Values in matrix A[8192 x 8192]: 1.0 uniformly
Values in matrix B[8192 x 8192]: 2.0 uniformly
Starting calculation on AMD Radeon Pro Vega II
...
Values in matrix C = A * B: 16384.0 uniformly
1'099'444'518'912 floating point operations performed
Elapsed GPU time = 0.164 seconds -> 6.704 Teraflops

As can be seen, the time needed for the encoding / transfer to the GPU is dominant.
This was not the case in macOS versions prior to 11.3!
I got 0.25 Seconds reported for the First procedure on average, i.e. the time for encoding / transfer was much shorter!
It seems that the data transfer / encoding in the latest macOS version is now far less efficient compared to previous versions of macOS. Maybe the underlaying framework is now more optimized for the M1 chips with some drawbacks for the MacPro 2019 architecture?

Hope you can reproduce this as well!

Thank you

matrixMul.swift

0

Answer 4

Graphics and Games Engineer OP

Apple

May ’21

Thanks for the info maccan!

Some MPS engineers were able to reproduce the problem and have already made some progress in investigating a fix.

0

Answer 5

maccan OP

May ’21

Thanks for investigating this!

0