Dynamic coreml model inference is significantly slower than static model

devices:

iphone 11

config:

configuration.computeUnits = .all
let myModel = try! myCoremlModel(configuration: configuration).model

Set the Range for Each Dimension:

input_shape= ct.Shape(shape=(1,3,ct.RangeDim(lower_bound=128, upper_bound=384, default=256),ct.RangeDim(lower_bound=128, upper_bound=384, default=256)))

  • inference time as table(average of 100 runs)

The default size inference for dynamic models is the same as for static models, but 128128 and 384384 hundreds of times slow than fixed-size models. Is this normal? Is there any good solution?

  • model init time is too long

load model time about 2 minutes, Is there a way to speed it up? For example, load from the cache? Can converted mlparkage speed up the loading time?

Answered by Engineer in 744889022

For models with range flexibility, we currently only support running on the Neural Engine for the input's default shape. Other shapes will be run on either GPU or CPU, which is likely why you are seeing higher latency for non-default shapes.

One other option you have here is to use enumerated flexibility instead of range flexibility. If you only need a smaller set of sizes supported by the model, you can use ct.EnumeratedShapes type to specify each shape the model should support. For enumerated shape flexibility, each shape should be able to run on the Neural Engine. You can read more about the advantages of enumerated shapes here https://coremltools.readme.io/docs/flexible-inputs.

Accepted Answer

For models with range flexibility, we currently only support running on the Neural Engine for the input's default shape. Other shapes will be run on either GPU or CPU, which is likely why you are seeing higher latency for non-default shapes.

One other option you have here is to use enumerated flexibility instead of range flexibility. If you only need a smaller set of sizes supported by the model, you can use ct.EnumeratedShapes type to specify each shape the model should support. For enumerated shape flexibility, each shape should be able to run on the Neural Engine. You can read more about the advantages of enumerated shapes here https://coremltools.readme.io/docs/flexible-inputs.

Here is a simple example

import torch
import torch.nn as nn

import coremltools as ct

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv_pre1 = nn.ConvTranspose2d(128, 256, kernel_size=3, stride=2, padding=1, output_padding=1)
        self.conv_pre2 = nn.ConvTranspose2d(256, 256, kernel_size=3, stride=2, padding=1, output_padding=1)

        self.conv1 = nn.ConvTranspose2d(256, 256, kernel_size=3, stride=2, padding=1, output_padding=1)
        self.conv2 = nn.ConvTranspose2d(256, 256, kernel_size=3, stride=2, padding=1, output_padding=1)
        self.conv3 = nn.ConvTranspose2d(256, 256, kernel_size=3, stride=2, padding=1, output_padding=1)
        self.conv4 = nn.ConvTranspose2d(256, 3, kernel_size=3, stride=2, padding=1, output_padding=1)


    def forward(self, input1, input2):
        y = self.conv_pre1(input2)
        y = self.conv_pre2(y)
        
        x = input1 + y
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        nn_output = torch.clip(x, 0.0, 1.0)
        recon_img_out = torch.ceil(nn_output*255.0-0.5)
        return recon_img_out

model = Model()
model.cuda()


dummy_input_f = torch.randn(1,256, 68, 120, device='cuda')
dummy_input_z = torch.randn(1,128, 17, 30, device='cuda')

torch_model = model.eval()
trace_model = torch.jit.trace(torch_model, (dummy_input_f, dummy_input_z))


# Set the input_shape to use RangeDim for each dimension.
input_x1_shape = ct.EnumeratedShapes(shapes=[[1, 256, 128//16, 128//16],
                                          [1, 256, 8,8],
                                          [1, 256, 24,24]],
                                          default=[1, 3,16,16])
input_x2_shape = ct.EnumeratedShapes(shapes=[[1, 128, 2, 2],
                                          [1, 128, 2, 2],
                                          [1, 128, 6, 6]],
                                          default=[1, 128, 4, 4])

input_1=ct.TensorType(name="input_x1", shape=input_x1_shape)  
input_2=ct.TensorType(name="input_x2", shape=input_x2_shape)  
outputs=ct.TensorType(name="output_img")  
# outputs=ct.ImageType(name="output_img", color_layout=ct.colorlayout.RGB)
mlmodel = ct.convert(
    trace_model,
    inputs=[input_1, input_2],
    outputs=[outputs],
)
mlmodel.save("check.mlmodel")

Except default shape , the other two are still too slow.

  1. input1: 8x8  input2: 2x2 50ms
  2. input1: 24x24  input2: 6x6 50ms
  3. input1: 16x16  input2: 4x4(default) 1.8ms

Then i change model of one input by remove input2, non-default shape inference times speed up a bit, but still unusual.

Enumerate Model Inference Speed:

  1. input1: 8x8  1.9ms
  2. input1: 24x24  12.14ms
  3. input1: 16x16  (default) 1.8ms

8x8 and 24x24  inference times with a fixed size model is ~0.5ms and ~4ms.

Are these results normal? Does the single-input enumerate model also slow down by 3 to 4 times??

Dynamic coreml model inference is significantly slower than static model
 
 
Q