This already has been implemented in current hardware. Intel integrated GPUs support USM pointers in oneAPI. However, I recently thought of a good idea that translates CPU addresses into GPU addresses. It has higher performance but limits the maximum memory possible. So the first point isn't as significant anymore, although still a useful feature.
Post
Replies
Boosts
Views
Activity
I see a strange phenomenon where certain posts on the developer forums don't show up, except when I view then under my login (e.g. one of the MetalFX comments). If the comment above was in fact censored, I apologize for addressing someone in such an unprofessional way and prompting that action. My main motivation: Apple is lagging behind other vendors for HPC, and I think we all want that to change.
Also, you can finish MetalFFT on your own if you can access an iPad and download Swift Playgrounds. That’s a benefit of MetalFFT being written entirely in Swift.
I can test it for you. I have an Apple silicon and Intel Mac. But, I strongly recommend that you thoroughly read through my MetalFFT project first. In fact, if you could transfer over code from VkFFT to MetalFFT, you’ll complete the project. I don’t have much time to spend, but we could work out some plan where I test or translate code for you.
If you want the best performance, you need to make native Metal shaders, not a virtualized graphics technology like MoltenVK or SPIR-V. And learning from the hard lessons of MetalFFT, performance can surprise you.
Please disregard this reply. The repository’s license has been changed to remove any restrictions possibly mentioned above.
Please disregard the replies above.
Would it be good if I linked the file in MetalFFT showing the profiling concerns? I mirrored the FFT sizes @CaptainHarlock told me and showed benchmarks concerning system-level cache thrashing. The cache bottleneck was one reason I gave up MetalFFT, but Apple might be better-suited to investigating it. This is about the GPU implementation, not vDSP, so I don't know if it's relevant.
The Metal backend is still nowhere near done yet, but I recommend looking at Swift for TensorFlow's repositories (linked in my above reply) and Swift-Colab.
The neural engine can't be used for training. It uses only 16-bit half precision, not 16-bit bfloat16. That means gradients can't propagate through it for ML, but ANE can be used for inference. Only if Apple did what everybody else is doing and added BFloat16 acceleration or GPU matrix cores! Kudos to them for AMX on the CPU at least.
Also you might be interested in the tutorial series on ARHeadsetKit (it’s hyperlinked in the link I gave above). Thanks for your interest!
Those don’t include AR headset experiences, and in my opinion one should learn content that prepares them for the future of AR headset technology. Even if ARHeadsetKit is never used, learning it gets you in the mindset of making 3D interfaces and accounting for stereoscopic rendering.
ARHeadsetKit already has all the capabilities of XROS, but it’s never going to be realized because nobody knows about it and Apple isn’t behind my effort.
I have a feeling that my last post on this thread didn't get through to you because it wasn't a reply to a comment you made.
I tried looking up the word “shared memory” in that context and it kept showing M1’s sharing between CPU and GPU too. It’s called groupShared in DirectX, which only complicates things.
Number: FB9797575
I just made a short message directing them to the comment this falls under. Is that good enough?