Apple planned to implement hardware ray tracing with the A16 chip in the iPhone 14 Pro, but unfortunately a fatal bug was found in the final design (a quick Google search will lead you to a MacRumors article on this). Unless that occurs a second time, it seems plausible that the A17 and M3 will both have hardware ray tracing. However, this is technically speculation, and is not professional advice or guaranteed to be correct.
The current chips are already quite good at ray tracing. They have a hardware instruction that performs comparison and selection in a single clock cycle, making ray-box intersections faster than other architectures. Although it's not an "RT core", it does shorten the performance gap with recent desktop GPUs. Performance also depends on the workload; Nvidia's RT acceleration sometimes doesn't provide more than a 2x speedup over shader cores. If you're using smaller acceleration structures or non-triangular geometry, that is where Apple chips might excel.
Post
Replies
Boosts
Views
Activity
The way Apple designed MetalFX, seems to be hiding the latency of accessing the neural engine. I noticed that the GPU's clock speed is consistently at ~400 MHz with light rendering workloads; heavier workloads will access higher clock speeds and decrease MetalFX upscaling time proportionally. The key observation is that the Apple GPU makes clock speed as low as possible to finish your frame in time, and minimize power consumption. You can overlap a lot of your own computations in parallel to the MetalFX upscaling, or between gaps in commands that are issued by MetalFX.
Here's another way to look at it. My M1 Max GPU can theoretically consume 50 W. MetalFX is taking 4 ms when I'm rendering at 120 Hz (8.3 ms/frame). However, it is not consuming (4/8.33 x 50 =) 24 W. It is consuming 2 W. The total latency also doesn't change much with different resolutions; I mostly use 768x768 -> 1536x1536.
MetalFX TAA actually does use the ANE. I was ray tracing at 120 Hz, and the GPU was using only 2 W (frame time was 4 milliseconds out of 8.3). It was using the lower 300-500 MHz clock speeds to decrease power consumption beyond what you'd think is possible. However, it also used the ANE to 80 mW. The ANE was at 0 watts when MetalFX was off, and 80 mW exactly the moment MetalFX turned on. I also got some error messages from the Xcode console about the ANE, whenever I used Metal Frame Capture.
This could explain why in Apple's MetalFX video, they stress giving you the ability to overlap work from different frames. I imagine the ANE has incredible latency to access, or some peculiarities in how it's accessed. MetalFX has a pipeline that runs, and automatically finishes in time for your next frame submission. It's executing work sporadically throughput the entire frame, presumably to hide some kind of latency. It might be shuffling work back and forth between the GPU and ANE.
MetalFX spatial upscaling probably does not use the ANE, because it is compatible with Intel Macs.
According to xcrun metal-opt, some Metal-supported devices implement mesh shaders through emulation. Others have it native. Not sure whether the M1 is one of those GPUs.
In macOS Ventura they removed the ability to call MTLCreateSystemDefaultDevice() from command-line apps. You have to use MTLCopyAllDevices().first! instead. However this problem shouldn't exist on iOS and older than v16.0.0.
It seems that Metal ray tracing uses HWRT on AMD GPUs. As for Apple GPUs, Apple has been working with Imagination for years on energy-efficient HWRT (according to rumors). Hopefully we'll see HWRT debut with the M2 Pro or M3 chip and the Apple9 family.
This repository: https://github.com/philipturner/metal-float64
The drop-downs in this comment: https://github.com/openmm/openmm/issues/3847#issuecomment-1317731445
You are correct that the best approach is 32-bit integer instructions, not double-single. Apple silicon has 64-bit integers, but it's so slow that 32-bit integers will be faster. If you are interested in helping me finish the metal-float64 library, that would be great.
While 4x performance penalty seems optimistic, I don't think it's physically possible. My theoretical calculations resulted in >27x for multiply and >>4x for addition, not including the time to process exponents.
This is also true of AMD with their public relations about ROCm. And the exact opposite of NVIDIA with very reliable support for CUDA users. The disparity in high-quality support is why many people try porting CUDA code to HIP, then give up. If Apple wants the M1 GPU to be viable for high-performance computing, they need a GPGPU-first API on par with HIP or SYCL. I'm currently working to make that a reality with SYCL.
To Apple's Metal team, here's what we need for the M1 GPUs to be viable for HPC, for the rest of the 21st century. This is what I request from you:
Let the GPU and CPU share the same address space. Graphics APIs typically never allow this, but compute-oriented APIs (OpenCL SVM, CUDA UVM, SYCL USM) use it quite frequently. Several code bases use pointer sharing between CPU and GPU to make implementing GPGPU easier. This should be trivial since the M1 GPU has shared memory in hardware.
Open-source the M1 OpenCL driver and fully document the AIR bytecode representation. We need high-fidelity translation of OpenCL-flavored SPIR-V -> AIR to create "MoltenCL" and a hipSYCL backend.
Open-source some kernels of Metal Performance Shaders, at least enough to create a BLAS library - just the M1 variants of these kernels. I will be using double-precision emulation to create the double-precision counterparts to single-precision functions.
I am trying to act courteous, but the nature of this comment may give a different impression. Please, could someone on the developer team address this issue? I'm fine with any communication medium.
Some of the comments I made above sounded a little impatient, which might be justified given how infrequently the document is updated, but I could have been more considerate. Setting that aside, there is one major inconsistency with the MSL specification and feature set tables. In section 6.15.2.6: Atomic Modify Functions (64 Bits), it says "see the Metal Feature Set Tables to determine which GPUs support this feature." No entry in the Tables describes 64-bit atomics, and the link in the PDF doesn't open any website.
This causes major difficulties because I have to physically test the feature on an A15, A14, AMD, and Intel GPUs just to see where it's supported. It seems to not work on Apple7 or Apple8, and I suspect it would fall under "Varies" for Mac2. Could the Tables or MSL be changed to fix this inconsistency, and to add A16 to Apple8?
I highly doubt that MetalFX utilizes the ANE. More information in https://developer.apple.com/forums/thread/707667. The reason is, switching contexts between accelerators incurs a lot of overhead, and the latency might be several milliseconds. Even if the Neural Engine has higher throughput, it's harder to access and less programmable. Furthermore, Apple GPUs, starting with the Apple7 generation, have hardware acceleration for matrix multiplication. It's called simdgroup_matrix and documented in the MSL specification. It increases the ALU utilization from 25% to 80%. The fact that this is limited to Apple7 and Apple8 GPUs - the only GPUs with simdgroup_matrix - further supports this hypothesis.
More explanation on how powerful simdgroup_matrix is: M1 Max has a GPU with 10 TFLOPS F32. Double that equals 20 TFLOPS F16, 80% is 16 TFLOPS F32. This is more processing power than the A14/M1's ANE, which is 11 TFLOPS F16. This could explain why Apple currently limits MetalFX to high-end Macs, where the GPU is more powerful than the ANE. On an A14/A15, it might be more power-efficient to use an image upscaling CoreML model on the ANE.
To reiterate, I DO NOT encourage that anybody try to reverse-engineer closed source code. I was just stating that as an example, in a theoretical sense.
However, in a theoretical sense (again), we can compare the processing power of a processor that does and doesn't support MetalFX.
A15 with 5 cores is 1.5 TFLOPS F32, or 3.0 TFLOPS F16 (my guess to what MFX is bottlenecked by).
M1 with 7 cores is 2.2 TFLOPS F32, or 4.4 TFLOPS F16.
A16 could be around 1.9 TFLOPS F32, or 3.8 TFLOPS F16 (based on a recent AnTuTu benchmark). That's SO close to M1, so Apple might enable it. If not, the next generation (A17) will surpass 7-core M1.
I just tried running MetalFX upscaling on my A15 device and isn't supported (spatial or temporal). However, M1 should work even though it's a lower device family. This is not good because an iPhone has a weaker GPU, and can't render to larger resolutions natively with software ray tracing. MetalFX upscaling would help the most on smaller chips.
Hopefully Apple will enable it on iPhones at some point, once they have a better GPU. Could someone who owns an iPhone 14 Pro tell whether the A16 supports MFX? Take this with a grain of salt, but my educated guess is, Apple will support MFX on iPhone with the A17/M3 chip series. They will probably have hardware-accelerated ray tracing from Imagination Technologies, and upscaling is especially needed for ray traced contexts. For example, Nvidia launched DLSS to complement RTX.
As an alternative, you could technically port AMD FSR to iOS, and that's probably what developers will do anyway. I don't recommend this (not legal), but someone could partially reverse-engineer MetalFX just to run it on A14/A15. It might use a neural network (like XeSS), using simdgroup_matrix with half precision. But it would be much easier to run the MetalFX that's optimized for M1 (instead of FSR), and do so legally.
They're finally up to date! Thanks for adding a description of Apple7 Mac GPUs and the Apple8 iOS GPU! Hopefully, you'll update it more quickly for the Apple9 family with A16/M2 in fall 2022. Those two chips may be produced on different nodes (4-nanometer vs. 3-nanometer), so we'll have to wait and see whether they have the same architecture.
https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf
P.S. Thanks for making the enumeration names reflect Swift. It seems like Objective-C has become a relic of the past, and most developers now use the newer language.
I suggest that you look for help on Swift Forums in the future. There are plenty of people there who can help you out with problems encountered with learning Swift. There's even a post topic called "Using Swift", dedicated to posts about learning language features and core Swift frameworks, among other things. Of course, there are also people on Apple developer forums who can help you out as well.
I went through the same path as you and unfortunately had no choice but to learn Metal. Very bad experience because Apple was too slow to move on from Objective-C to Swift.
That aside, I have something that I think can fit your solution: try the tutorial series for ARHeadsetKit. It teaches you the fundamental concepts you’re thinking of, especially the concepts that will be relevant 5 years from now with smart glasses. You have to dig into the tutorial series a bit though (about halfway).