does MetalFX utilizes NPU?

I heard MetalFX TAA utilize NPU

I wonder SpatialScaling also utilizes NPU

Answered by philipturner in 727535022

I highly doubt that MetalFX utilizes the ANE. More information in https://developer.apple.com/forums/thread/707667. The reason is, switching contexts between accelerators incurs a lot of overhead, and the latency might be several milliseconds. Even if the Neural Engine has higher throughput, it's harder to access and less programmable. Furthermore, Apple GPUs, starting with the Apple7 generation, have hardware acceleration for matrix multiplication. It's called simdgroup_matrix and documented in the MSL specification. It increases the ALU utilization from 25% to 80%. The fact that this is limited to Apple7 and Apple8 GPUs - the only GPUs with simdgroup_matrix - further supports this hypothesis.

More explanation on how powerful simdgroup_matrix is: M1 Max has a GPU with 10 TFLOPS F32. Double that equals 20 TFLOPS F16, 80% is 16 TFLOPS F32. This is more processing power than the A14/M1's ANE, which is 11 TFLOPS F16. This could explain why Apple currently limits MetalFX to high-end Macs, where the GPU is more powerful than the ANE. On an A14/A15, it might be more power-efficient to use an image upscaling CoreML model on the ANE.

Accepted Answer

I highly doubt that MetalFX utilizes the ANE. More information in https://developer.apple.com/forums/thread/707667. The reason is, switching contexts between accelerators incurs a lot of overhead, and the latency might be several milliseconds. Even if the Neural Engine has higher throughput, it's harder to access and less programmable. Furthermore, Apple GPUs, starting with the Apple7 generation, have hardware acceleration for matrix multiplication. It's called simdgroup_matrix and documented in the MSL specification. It increases the ALU utilization from 25% to 80%. The fact that this is limited to Apple7 and Apple8 GPUs - the only GPUs with simdgroup_matrix - further supports this hypothesis.

More explanation on how powerful simdgroup_matrix is: M1 Max has a GPU with 10 TFLOPS F32. Double that equals 20 TFLOPS F16, 80% is 16 TFLOPS F32. This is more processing power than the A14/M1's ANE, which is 11 TFLOPS F16. This could explain why Apple currently limits MetalFX to high-end Macs, where the GPU is more powerful than the ANE. On an A14/A15, it might be more power-efficient to use an image upscaling CoreML model on the ANE.

MetalFX TAA actually does use the ANE. I was ray tracing at 120 Hz, and the GPU was using only 2 W (frame time was 4 milliseconds out of 8.3). It was using the lower 300-500 MHz clock speeds to decrease power consumption beyond what you'd think is possible. However, it also used the ANE to 80 mW. The ANE was at 0 watts when MetalFX was off, and 80 mW exactly the moment MetalFX turned on. I also got some error messages from the Xcode console about the ANE, whenever I used Metal Frame Capture.

This could explain why in Apple's MetalFX video, they stress giving you the ability to overlap work from different frames. I imagine the ANE has incredible latency to access, or some peculiarities in how it's accessed. MetalFX has a pipeline that runs, and automatically finishes in time for your next frame submission. It's executing work sporadically throughput the entire frame, presumably to hide some kind of latency. It might be shuffling work back and forth between the GPU and ANE.

MetalFX spatial upscaling probably does not use the ANE, because it is compatible with Intel Macs.

does MetalFX utilizes NPU?
 
 
Q