Large ML Models on the Neural Engine (Billion+ Parameters)

I've been trying to run larger transformer models on the Neural Engine and am having trouble getting beyond models that have a billion parameters.

I have a 774M parameter model (gpt2-large) that runs entirely on the Neural Engine, but trying to run a 1.5B parameter model (gpt2-xl) with either cpuAndNeuralEngine or all compute units results in the model running entirely on the CPU and/or GPU.

Things I've tried:

  • Quantization using coremltools. The docs say that models are de-quantized before being run and I see that.
  • Splitting the model (like in ml-stable-diffusion) and running both parts as a pipeline (from the new coremltools utility). As far as I can tell, if the first model in the pipeline can run on the Neural Engine it will, but subsequent models will run on CPU (using cpuAndNeuralEngine).
  • Optimizing the model for the Neural Engine as described in ml-ane-transformers. I've tried both the split einsum and original attentions as in ml-stable-diffusion.

Is this the expected behavior? Are there other strategies or features of CoreML that could help here? 
I’m testing on a 2021 M1 Max MacBook Pro running macOS 13.2.1.

I also wonder the same thing. My guess is that ANE had a maximum tensor shape/size it can handle. If the size exceeds the maximum on any of the tensor dimensions, it would fail to use the ANE

Large ML Models on the Neural Engine (Billion+ Parameters)
 
 
Q