Large ML Models on the Neural Engine (Billion+ Parameters)

I've been trying to run larger transformer models on the Neural Engine and am having trouble getting beyond models that have a billion parameters.

I have a 774M parameter model (gpt2-large) that runs entirely on the Neural Engine, but trying to run a 1.5B parameter model (gpt2-xl) with either cpuAndNeuralEngine or all compute units results in the model running entirely on the CPU and/or GPU.

Things I've tried:

  • Quantization using coremltools. The docs say that models are de-quantized before being run and I see that.
  • Splitting the model (like in ml-stable-diffusion) and running both parts as a pipeline (from the new coremltools utility). As far as I can tell, if the first model in the pipeline can run on the Neural Engine it will, but subsequent models will run on CPU (using cpuAndNeuralEngine).
  • Optimizing the model for the Neural Engine as described in ml-ane-transformers. I've tried both the split einsum and original attentions as in ml-stable-diffusion.

Is this the expected behavior? Are there other strategies or features of CoreML that could help here? 
I’m testing on a 2021 M1 Max MacBook Pro running macOS 13.2.1.

Large ML Models on the Neural Engine (Billion+ Parameters)
 
 
Q