I've been trying to run larger transformer models on the Neural Engine and am having trouble getting beyond models that have a billion parameters.
I have a 774M parameter model (gpt2-large) that runs entirely on the Neural Engine, but trying to run a 1.5B parameter model (gpt2-xl) with either cpuAndNeuralEngine or all compute units results in the model running entirely on the CPU and/or GPU.
Things I've tried:
Quantization using coremltools. The docs say that models are de-quantized before being run and I see that.
Splitting the model (like in ml-stable-diffusion) and running both parts as a pipeline (from the new coremltools utility). As far as I can tell, if the first model in the pipeline can run on the Neural Engine it will, but subsequent models will run on CPU (using cpuAndNeuralEngine).
Optimizing the model for the Neural Engine as described in ml-ane-transformers. I've tried both the split einsum and original attentions as in ml-stable-diffusion.
Is this the expected behavior? Are there other strategies or features of CoreML that could help here?
I’m testing on a 2021 M1 Max MacBook Pro running macOS 13.2.1.
Post
Replies
Boosts
Views
Activity
In the ml-ane-transformers repo, there is a custom LayerNorm implementation for the Neural Engine-optimized shape of (B,C,1,S).
The coremltools documentation makes it sound like the layer_norm MIL op would support this natively. In fact, the following code works on CPU:
B,C,S = 1,768,512
g,b = 1, 0
@mb.program(input_specs=[mb.TensorSpec(shape=(B,C,1,S)),])
def ln_prog(x):
gamma = (torch.ones((C,), dtype=torch.float32) * g).tolist()
beta = (torch.ones((C), dtype=torch.float32) * b).tolist()
return mb.layer_norm(x=x, axes=[1], gamma=gamma, beta=beta, name="y")
However it fails when run on the Neural Engine, giving results that are scaled by an incorrect value.
Should this work on the Neural Engine?
I'm using some CoreML models from C++. I've been trying to profile them using the CoreML Instrument in Instruments. It seems that that only works when I sign my binaries with the get-task-allow entitlement.
Is there an easier way? Ideally I'd like to be able to profile a Python program that calls my C++ code and I would rather not re-sign Python.
I am testing the new scaled dot product attention CoreML op on macOS 15 beta 1. Based on the session video I was expecting to see a speedup when running on GPU however I see roughly equivalent performance to the same model on macOS 14.
I ran tests with two models:
one that simply repeats y = sdpa(y, k, v) 50 times
gpt2 124M converted from nanoGPT (the only change is not returning loss from the forward method)
I converted both models using coremltools 8.0b1 with minimum deployment targets of macOS 14 and also macOS 15. In Xcode, I can see that the new op was used for the macOS 15 target. Running on macOS 15 both target models take the same time, and that time matches the runtime on macOS 14.
Should I be seeing performance improvements?
I have several CoreML models that I've set up to run in sequence where one of the outputs from each model is passed as one of the inputs to the next.
For the most part, there is very little overhead in between each sub-model "chunk":
However a couple of the models (eg the first two above) spend a noticeable amount of time in "Prepare Neural Engine Request". From Instruments, it seems like this is spent doing some sort of model loading.
Given that I'm calling these models in sequence and in a fixed order, is there some way to reduce or amortize this cost? Thanks!