I've been trying to run larger transformer models on the Neural Engine and am having trouble getting beyond models that have a billion parameters.
I have a 774M parameter model (gpt2-large) that runs entirely on the Neural Engine, but trying to run a 1.5B parameter model (gpt2-xl) with either cpuAndNeuralEngine
or all
compute units results in the model running entirely on the CPU and/or GPU.
Things I've tried:
- Quantization using coremltools. The docs say that models are de-quantized before being run and I see that.
- Splitting the model (like in ml-stable-diffusion) and running both parts as a pipeline (from the new coremltools utility). As far as I can tell, if the first model in the pipeline can run on the Neural Engine it will, but subsequent models will run on CPU (using cpuAndNeuralEngine).
- Optimizing the model for the Neural Engine as described in ml-ane-transformers. I've tried both the split einsum and original attentions as in ml-stable-diffusion.
Is this the expected behavior? Are there other strategies or features of CoreML that could help here? I’m testing on a 2021 M1 Max MacBook Pro running macOS 13.2.1.