maccan’s Profile | Apple Developer Forums

Bad MPSMatrixMultiplication performance in Big Sur 11.3

Hi, I.m referring to a simple MPSMatrixMultiplication performance test code provided in another post: https://developer.apple.com/forums/thread/105534 You can save the code (the 2nd one) in some file, say testMM.swift and compile it from a terminal by executing: swiftc -O testMM.swift Then run the executable in Terminal by executing: ./testMM The code performs a matrix multiplication using MPSMatrixMultiplication and reports the calculation performance of your GPU in GFlops. On my MacPro 2019 with AMD Vega II, I got 3500 GFlops. Now I did this test again by recompiling the exact same code after installation of macOS Big Sur 11.3 and XCode 12.5. The performance dropped down to only 119 GFlops i.e. a dramatic performance loss of more than a factor of 30! Are there any fundamental changes on how to use Metal performance shaders? Any idea?

Posted

by

maccan.

Last updated

.

Metal Performance Intel(R) UHD Graphics 630 vs AMD Radeon Pro 555X

My MacBook Pro 2018 has two GPUs:AMD Radeon Pro 555X and Intel(R) UHD Graphics 630.I supposed the AMD 555X would be superior in performance compared to the Intel(R) UHD Graphics 630.However, I observed a huge performance difference for Metal Performance Shaders (MPS) between the two GPUs.The Intel GPU performs the simple test code (a MPSMatrixMultiplication) 3 times faster compared to the AMD 555X.You can compile the attached code in a Terminal by 'swiftc -O matrixMul.swift'and run it by executing ./matrixMulIn the test code, I can select execution on the AMD 555X with the statementlet device = devices[0] // AMD Radeon Pro 555Xand I get the following:start calculation on GPU-device <BronzeMtlDevice: 0x1071bf000> name = AMD Radeon Pro 555X...GPU execution time = 12.612 secondsThe Intel(R) UHD Graphics 630 is selected bylet device = devices[1] // Intel(R) UHD Graphics 630and I getstart calculation on GPU-device <MTLIGAccelDevice: 0x10f9c5000> name = Intel(R) UHD Graphics 630...GPU execution time = 3.735 secondsAs you can see the Intel UHD 630 performed the MPSMatrixMultiplication 3 times faster than the AMD 555X.I thought the AMD 555X would be more powerful than the Intel UHD 630, but this test shows the opposite.Any idea?-------------------- test code import Metal import Accelerateimport MetalPerformanceShaderslet devices = MTLCopyAllDevices()print("available GPUs")for d in devices { print(d)}let device = devices[0] // AMD Radeon Pro 555X//let device = devices[1] // Intel(R) UHD Graphics 630let commandQueue = device.makeCommandQueue()!;let commandBuffer = commandQueue.makeCommandBuffer()!;let n = 8192 // matrix dimension (n x n)let rowsA = nlet columnsA = nlet rowsB = nlet columnsB = nlet rowsC = nlet columnsC = n// matrix A datavar arrayA = [Float](repeating: 1, count: rowsA * columnsA)for i in 0..<arrayA.count { arrayA[i] = Float(2 * drand48() - 1)}// matrix B datavar arrayB = [Float](repeating: 2, count: rowsB * columnsB)for i in 0..<arrayB.count { arrayB[i] = Float(2 * drand48() - 1)}// MTL data buffers for Matrices A,B,Clet bufferA = device.makeBuffer(bytes: arrayA, length: rowsA * columnsA * MemoryLayout<Float>.stride, options: [])!;let bufferB = device.makeBuffer(bytes: arrayB, length: rowsB * columnsB * MemoryLayout<Float>.stride, options: [])!;let bufferC = device.makeBuffer(length: rowsC * columnsC * MemoryLayout<Float>.stride, options: [])!;// Matrix descriptionslet descA = MPSMatrixDescriptor(dimensions: rowsA, columns: columnsA, rowBytes: columnsA * MemoryLayout<Float>.stride, dataType: .float32);let descB = MPSMatrixDescriptor(dimensions: rowsB, columns: columnsB, rowBytes: columnsB * MemoryLayout<Float>.stride, dataType: .float32);let descC = MPSMatrixDescriptor(dimensions: rowsC, columns: columnsC, rowBytes: columnsC * MemoryLayout<Float>.stride, dataType: .float32);// MTL matrix bufferslet matrixA = MPSMatrix(buffer: bufferA, descriptor: descA);let matrixB = MPSMatrix(buffer: bufferB, descriptor: descB);let matrixC = MPSMatrix(buffer: bufferC, descriptor: descC);let matrixMultiplication = MPSMatrixMultiplication(device: device, transposeLeft: false, transposeRight: false, resultRows: rowsC, resultColumns: columnsC, interiorColumns: columnsA, alpha: 1, beta: 0);matrixMultiplication.encode(commandBuffer: commandBuffer, leftMatrix: matrixA, rightMatrix: matrixB, resultMatrix: matrixC);print("start calculation on GPU-device \(device)")let start = DispatchTime.now().uptimeNanoseconds;commandBuffer.commit()commandBuffer.waitUntilCompleted()let end = DispatchTime.now().uptimeNanosecondslet execTime = String(format: "%.3f", 1e-9 * Double(end - start))// we look at the resultlet rawPointer = matrixC.data.contents();let count = matrixC.rows * matrixC.columns;let typedPointer = rawPointer.bindMemory(to: Float.self, capacity: count);let bufferedPointer = UnsafeBufferPointer(start: typedPointer, count: count);// Print the first 10 results, to make sure it's not all 0s or NaNs.print("\nFirst 5 elements:")for i in 0..<5 { print("element \(i) =", bufferedPointer[i]);}print("...")print("last element =", bufferedPointer[n * n - 1]);print("...")print("GPU execution time = \(execTime) seconds")exit(0)------------------ end test-code

Metal

Posted

by

maccan.

Last updated

.

GPU usage constantly high after MPS test program

I did a very simple test using Metal PerformanceShaders. It is basically a MPSMatrixMultiplication.I compiled it in Terminal with 'swiftc matrixMul.swift'Then you get the executable called matrixMul.Now, execute 'time matrixMul' and after approximately 10 seconds you will get the first then elements of the result matrix prows i/nted on screen.This all works as intended, however I realized in Activity Monitor (and also because of high GPU fan speed), that somehow the GPU isn't fully released. The Activity Monitor through the GPU monitor shows 100% activity for several minutes anfter program exit. Somehow the matrixMul process keeps the GPU busy after program exit. Do I miss some statement in my code which tells the system to free resources?Below is the simple test-code------------------------------------- import Metal import Accelerateimport MetalPerformanceShaderslet n = 8192let rowsA = nlet columnsA = nlet rowsB = nlet columnsB = nlet rowsC = nlet columnsC = nvar arrayA = [Float](repeating: 1, count: rowsA * columnsA)var arrayB = [Float](repeating: 2, count: rowsB * columnsB)var arrayC = [Float](repeating: 0, count: rowsC * columnsC)var device: MTLDevice!device = MTLCreateSystemDefaultDevice();guard device != nil else { fatalError("Error: This device does not support Metal")}let bufferA = device.makeBuffer(bytes: arrayA, length: rowsA * columnsA * MemoryLayout<Float>.stride, options: [])!;let bufferB = device.makeBuffer(bytes: arrayB, length: rowsB * columnsB * MemoryLayout<Float>.stride, options: [])!;let bufferC = device.makeBuffer(length: rowsC * columnsC * MemoryLayout<Float>.stride, options: [])!;let descA = MPSMatrixDescriptor(dimensions: rowsA, columns: columnsA, rowBytes: columnsA * MemoryLayout<Float>.stride, dataType: .float32);let descB = MPSMatrixDescriptor(dimensions: rowsB, columns: columnsB, rowBytes: columnsB * MemoryLayout<Float>.stride, dataType: .float32);let descC = MPSMatrixDescriptor(dimensions: rowsC, columns: columnsC, rowBytes: columnsC * MemoryLayout<Float>.stride, dataType: .float32);var matrixA: MPSMatrix!;var matrixB: MPSMatrix!;var matrixC: MPSMatrix!;matrixA = MPSMatrix(buffer: bufferA, descriptor: descA);matrixB = MPSMatrix(buffer: bufferB, descriptor: descB);matrixC = MPSMatrix(buffer: bufferC, descriptor: descC);let matrixMultiplication = MPSMatrixMultiplication(device: device, transposeLeft: false, transposeRight: false, resultRows: rowsC, resultColumns: columnsC, interiorColumns: columnsA, alpha: 1, beta: 0);var commandQueue: MTLCommandQueue!;commandQueue = device.makeCommandQueue();let commandBuffer = commandQueue.makeCommandBuffer()!;matrixMultiplication.encode(commandBuffer: commandBuffer, leftMatrix: matrixA, rightMatrix: matrixB, resultMatrix: matrixC);print("start calculation on GPU")let start = DispatchTime.now();commandBuffer.commit()commandBuffer.waitUntilCompleted()let end = DispatchTime.now()print("time =", 1e-9 * Double(end.uptimeNanoseconds - start.uptimeNanoseconds), "sec")// we look at the resultlet rawPointer = matrixC.data.contents();let count = matrixC.rows * matrixC.columns;let typedPointer = rawPointer.bindMemory(to: Float.self, capacity: count);let bufferedPointer = UnsafeBufferPointer(start: typedPointer, count: count);// Print the first 10 results, to make sure it's not all 0s or NaNs.print("\nFirst 10 results:")for i in 0..<10 { print(arrayC[i], bufferedPointer[i]);}-----------------------------

Metal

Posted

by

maccan.

Last updated

.

User Profile

maccan

Posts

Posts

Bad MPSMatrixMultiplication performance in Big Sur 11.3

Metal Performance Intel(R) UHD Graphics 630 vs AMD Radeon Pro 555X

GPU usage constantly high after MPS test program