Metal Performance Intel(R) UHD Graphics 630 vs AMD Radeon Pro 555X

My MacBook Pro 2018 has two GPUs:

AMD Radeon Pro 555X and Intel(R) UHD Graphics 630.

I supposed the AMD 555X would be superior in performance compared to the Intel(R) UHD Graphics 630.


However, I observed a huge performance difference for Metal Performance Shaders (MPS) between the two GPUs.

The Intel GPU performs the simple test code (a MPSMatrixMultiplication) 3 times faster compared to the AMD 555X.


You can compile the attached code in a Terminal by 'swiftc -O matrixMul.swift'

and run it by executing ./matrixMul


In the test code, I can select execution on the AMD 555X with the statement

let device = devices[0] // AMD Radeon Pro 555X


and I get the following:


start calculation on GPU-device <BronzeMtlDevice: 0x1071bf000>

name = AMD Radeon Pro 555X

...

GPU execution time = 12.612 seconds


The Intel(R) UHD Graphics 630 is selected by

let device = devices[1] // Intel(R) UHD Graphics 630


and I get


start calculation on GPU-device <MTLIGAccelDevice: 0x10f9c5000>

name = Intel(R) UHD Graphics 630

...

GPU execution time = 3.735 seconds


As you can see the Intel UHD 630 performed the MPSMatrixMultiplication 3 times faster than the AMD 555X.

I thought the AMD 555X would be more powerful than the Intel UHD 630, but this test shows the opposite.

Any idea?


-------------------- test code


import Metal


import Accelerate

import MetalPerformanceShaders

let devices = MTLCopyAllDevices()

print("available GPUs")

for d in devices {

print(d)

}

let device = devices[0] // AMD Radeon Pro 555X

//let device = devices[1] // Intel(R) UHD Graphics 630

let commandQueue = device.makeCommandQueue()!;

let commandBuffer = commandQueue.makeCommandBuffer()!;

let n = 8192 // matrix dimension (n x n)

let rowsA = n

let columnsA = n

let rowsB = n

let columnsB = n

let rowsC = n

let columnsC = n

// matrix A data

var arrayA = [Float](repeating: 1, count: rowsA * columnsA)

for i in 0..<arrayA.count {

arrayA[i] = Float(2 * drand48() - 1)

}

// matrix B data

var arrayB = [Float](repeating: 2, count: rowsB * columnsB)

for i in 0..<arrayB.count {

arrayB[i] = Float(2 * drand48() - 1)

}

// MTL data buffers for Matrices A,B,C

let bufferA = device.makeBuffer(bytes: arrayA,

length: rowsA * columnsA * MemoryLayout<Float>.stride,

options: [])!;

let bufferB = device.makeBuffer(bytes: arrayB,

length: rowsB * columnsB * MemoryLayout<Float>.stride,

options: [])!;

let bufferC = device.makeBuffer(length: rowsC * columnsC * MemoryLayout<Float>.stride,

options: [])!;

// Matrix descriptions

let descA = MPSMatrixDescriptor(dimensions: rowsA, columns: columnsA,

rowBytes: columnsA * MemoryLayout<Float>.stride,

dataType: .float32);

let descB = MPSMatrixDescriptor(dimensions: rowsB, columns: columnsB,

rowBytes: columnsB * MemoryLayout<Float>.stride,

dataType: .float32);

let descC = MPSMatrixDescriptor(dimensions: rowsC, columns: columnsC,

rowBytes: columnsC * MemoryLayout<Float>.stride,

dataType: .float32);

// MTL matrix buffers

let matrixA = MPSMatrix(buffer: bufferA, descriptor: descA);

let matrixB = MPSMatrix(buffer: bufferB, descriptor: descB);

let matrixC = MPSMatrix(buffer: bufferC, descriptor: descC);

let matrixMultiplication = MPSMatrixMultiplication(device: device,

transposeLeft: false, transposeRight: false,

resultRows: rowsC, resultColumns: columnsC,

interiorColumns: columnsA, alpha: 1, beta: 0);

matrixMultiplication.encode(commandBuffer: commandBuffer, leftMatrix: matrixA,

rightMatrix: matrixB, resultMatrix: matrixC);

print("start calculation on GPU-device \(device)")

let start = DispatchTime.now().uptimeNanoseconds;

commandBuffer.commit()

commandBuffer.waitUntilCompleted()

let end = DispatchTime.now().uptimeNanoseconds

let execTime = String(format: "%.3f", 1e-9 * Double(end - start))

// we look at the result

let rawPointer = matrixC.data.contents();

let count = matrixC.rows * matrixC.columns;

let typedPointer = rawPointer.bindMemory(to: Float.self, capacity: count);

let bufferedPointer = UnsafeBufferPointer(start: typedPointer, count: count);

// Print the first 10 results, to make sure it's not all 0s or NaNs.

print("\nFirst 5 elements:")

for i in 0..<5 {

print("element \(i) =", bufferedPointer[i]);

}

print("...")

print("last element =", bufferedPointer[n * n - 1]);

print("...")

print("GPU execution time = \(execTime) seconds")

exit(0)

------------------ end test-code

Replies

Yes, the integrated GPU will be much faster for this. It's because it's a quite light workload but involves a block of memory. The iGPU can work directly from system memory, while the dGPU has to copy to VRAM, do the work, then copy back. The processing is much faster but copying across the PCI bus takes enough time for it to be slower overall.