Running a local LLM on Swift Playgrounds

I am trying to run TinyLlama directly using Swift Playgrounds for iOS. I have tried multiple solutions, like libraries (LLM.swift, swift-transformers, ...) which never worked due to import issues, and also tried importing an exported mlmodel.

For the later, I followed the article about Llama 3.1 on CoreML. It was hard to understand how to do the inference with it, but I was able to export a mlpackage, that I then placed in a xcode project to generate the mlmodelc (compiled model) and the model class. I had to go with the first version described in the article, without optimizations, as I got errors during model loading with the flexible input shapes. I was able to run the model for one token generation.

But my biggest problem is that, though the mlmodelc is only 550 MiB, th model loads 24+GiB of memory, largely exceeding what I can have on an iOS device.

Is there a way to use do LLM inferences on Swift Playgrounds at a reasonable speed (even 1 token / s would be sufficient)?

Running a local LLM on Swift Playgrounds
 
 
Q