Does anyone have any insights on how to get Core ML models to run on real-time video with maximum throughput?
Using Metal directly, the standard approach is to use a semaphore to run a double or triple buffering scheme, so that the next request(s) will already be scheduled while the current one is still executing. That way the CPU and GPU never have to wait on each other.
This approach also works for Core ML (and Vision) but I notice a few strange things.
1. If you do more than one request at once using the same VNCoreMLRequest (but with a new VNImageRequestHandler instance each time), then sometimes the result is empty. No error but no observations either. It seems to work fine if you allocate 2 or 3 VNCoreMLRequests and alternate between those.
2. With double or triple buffering, using pure Core ML is significantly faster than using a VNCoreMLRequest. Together with point #1 this makes me think that using a semaphore and multiple instances is not the correct approach to use with Vision (or pure Core ML for that matter).
3. Both Core ML and VNCoreMLRequest are waaaaay slower than a Metal version that uses MPSCNN. Using the MobileNetV1 download from the Apple website I can squeeze out 53 FPS on the iPhone X running iOS 11.3. The exact same model but implemented using MPSCNN does 160 FPS on the same device. I measured similar differences in speed on other devices too. (FPS here is measured by doing a prediction on the same image over and over in a loop and dividing the number of frames by the total elapsed time).
While I'm happy my Metal implementations are so fast, I am wondering why the Core ML / Vision versions of the same model run so much slower. Am I not using Core ML properly? Is Core ML throttling GPU usage to save battery power? Is Core ML (or Vision) doing extra things that take up a lot of time?
I was just curious if anyone else has some insights to offer. :-)