CNN - style transfer performance

We’ve been trying several style transfer models available on public domain on iOS with Metal Performance Shaders CNN, mlmodelzoo CoreML models and we are not happy with its performance yet.

For MPSCNN we use 512 x 512 output images and for mlmodelzoo’s models we have to use 480 x 640. On an iPhone 6 the best MPSCNN gave us something like 0.8 secs / image and mlmodelzoo’s gave us near 5 secs / image. Note that this does not mean that CoreML is slower, the model used was the key, though.

By looking to some known apps out there we saw that Apple Clips has two filters that really run real time! **** real time!

Our best results seem to be close to Facebook app (that has also a few filters with style transfer).

Any idea what are the models used in Apple Clips?

Replies

Public domain models are often not optimized for mobile GPUs. Often researchers will use VGGNet for the CNN part of their network, which is really slow (but convenient to use if you're a researcher). One way to optimize the model is to replace the VGGNet layers with something like MobileNet or SqueezeNet, but this does require retraining of the model from scratch.

I wrote a converter for our neural style transfer networks to CoreML and they run much faster on current iOS devices then even our desktop implementation. So it can't be an issue with CoreML.


Also I'm most certain that Clips doens't use neural style transfer. Those filter look like "normal" image filters. You check out CoreImage for those.

For sure the issue is not CoreML. The issue is with model used and its associated graph.

As for clips, you might be right... Not sure, though.

We also use Philm as a benchmark and they claim to use deep learning for style transfer. Although they run real time, their results do not look that good. They seem to apply a filter with low resolution (maybe 256 x 256) and then blend it with original image.

kerfuffle do you have benchmarks that you can share?

VGG16 runs at 1-2 FPS on iPhone 6s and makes it very hot. MobileNet runs at 30 FPS on iPhone 6s and only makes it moderately hot. Accuracy of the two models is pretty much the same. MobileNet uses 4M parameters, VGG16 uses 138M parameters.


Although in all fairness, when VGG16 is used as the feature extractor in a larger model it may use fewer fully-connected layers and therefore is a bit faster than full VGG16.

What is your output resolution? Are you following what is described in "Perceptual Losses for Real Time Style Transfer and Super-Resolution"? We do and have instance normalization as described here "Instance Normalization: The Missing Ingredient for Fast Stylization"

I am not doing style transfer at all. Just pointing out that a lot of models published by researchers use VGG16 (or 19) as their feature extractor, which is incredibly slow on mobile.

Just a small note to the thread.

CoreML seems to be the best way to go now that it has the option to either run on GPU or CPU (>= beta 5). For iPhone 6 we have better performance running on CPU and for 7, it is much better to execute on GPU.

Having said this, you can take decisions in runtime and flag it appropreatly.