We recently had to change our MLModel
's architecture to include custom layers, which means the model can't run on the Neural Engine anymore.
After the change, we observed a lot of crashes being reported on A13 devices. It turns out that the memory consumption when running the prediction with the new model on the GPU is much higher than before, when it was running on the Neural Engine. Before, the peak memory load was ~350 MB, now it spikes over 2 GB, leading to a crash most of the time. This only seems to happen on the A13. When forcing the model to only run on the CPU, the memory consumption is still high, but the same as running the old model on the CPU (~750 MB peak). All tested on iOS 16.1.2.
We profiled the process in Instruments and found that there are a lot of memory buffers allocated by Core ML that are not freed after the prediction.
The allocation stack trace for those buffers is the following:
We ran the same model on a different device and found the same buffers in Instruments, but there they are only 4 KB in size. It seems, Core ML is somehow massively over-allocating memory when run on the A13 GPU.
So far we limit the model to only run on CPU for those devices, but this is far from ideal. Is there any other model setting or workaround that we can use to avoid this issue?