There are two solutions. First, Metal has purgeable resources, so you could allocate an arbitrarily large cache and have the OS delete it when it exceeds available memory.
Second, I’m working on a Metal backend for Swift for TensorFlow, which should be multipurpose and might be interesting to you. I recommend going over the most recent issue under tensorflow/swift-apis on GitHub for more context. Also look at the closed issues under dl4s-team/dl4s-evolution for my last exploration into the topic.
If you’re talking about access to L1/L2 cache memory, I may have misunderstood you. Using threadgroup memory might serve your purposes - each 1024 threads can share and communicate through their own 32 KB of low-latency memory - that’s insanely large. It is used in matrix multiplication to reduce main memory accesses in Metal Performance Shaders and other approaches.
There is no equivalent to this on the CPU or ANE, although the ML accelerators (AMX) for the CPU have one insanely large 1024-word register (you can only access this indirectly through Accelerate). Apple has tried hard to hide this from the public, but you can find out about the AMX from a Google search.