Metal: software controlled cache

Question

acceleratedHardware OP

Created Dec ’21

Replies 1

Boosts 0

Participants 2

Hello,

I was wondering if the Metal API gives programmers access to software controlled cache memory, like CUDA does on NVIDIA GPU's.

Those who have written CUDA code for NVIDIA GPU's probably have experience using a streaming multi-processor's shared memory to help speed up memory accesses during computations (for example when performing tiled matrix multiplication). Does Metal (or Swift) provide a similar feature?

So far I have only found documentation describing unified memory between CPU and GPU, but that is not what I am looking for.

Thanks!

Boost

Answer 1

philipturner OP

Dec ’21

There are two solutions. First, Metal has purgeable resources, so you could allocate an arbitrarily large cache and have the OS delete it when it exceeds available memory.

Second, I’m working on a Metal backend for Swift for TensorFlow, which should be multipurpose and might be interesting to you. I recommend going over the most recent issue under tensorflow/swift-apis on GitHub for more context. Also look at the closed issues under dl4s-team/dl4s-evolution for my last exploration into the topic.

If you’re talking about access to L1/L2 cache memory, I may have misunderstood you. Using threadgroup memory might serve your purposes - each 1024 threads can share and communicate through their own 32 KB of low-latency memory - that’s insanely large. It is used in matrix multiplication to reduce main memory accesses in Metal Performance Shaders and other approaches.

There is no equivalent to this on the CPU or ANE, although the ML accelerators (AMX) for the CPU have one insanely large 1024-word register (you can only access this indirectly through Accelerate). Apple has tried hard to hide this from the public, but you can find out about the AMX from a Google search.

0