Post

Replies

Boosts

Views

Activity

ML ANE Model unloaded on first call to predict
Hi, I am trying to take advantage of my device ANE. I have created a model from torch using coremltools and adapted it until xcode model performance preview indicates it will run on my device ANE. But when i profile my integration into my app, i can see from the com.apple.ane logs the model has been loaded on device: Timestamp Type Process Category Message 00:00.905.087 Debug MLBench (11135) client doLoadModel:options:qos:error:: model[0x2804500c0] : success=1 : progamHandle=10 000 241 581 886: intermediateBufferHandle=10 000 242 143 532 : queueDepth=32 :err= but when i call predict on my model, the ANE is unloaded and the prediction run on CPU: Timestamp Type Process Category Message 00:00.996.015 Debug MLBench (11135) client doUnloadModel:options:qos:error:: model[0x2804500c0]=_ANEModel: { modelURL=file:///var/mobile/Containers/Data/Application/0A9F356B-B8C7-4B86-90A5-6812EF48CC94/tmp/math_custom_trans_decoder_seg_0DB63A47-E84E-4887-A606-BC9986B2C662.mlmodelc/ : key={"isegment":0,"inputs":{"extras":{"shape":[2,1,1,1,1]},"memory":{"shape":[128,5,1,1,1]},"proj_key_seg_in":{"shape":[128,39,1,1,1]},"state_in_k":{"shape":[32,1,1,20,2]},"tgt":{"shape":[5,1,1,1,1]},"state_in_v":{"shape":[32,1,1,20,2]},"pos_enc":{"shape":[128,1,1,1,1]}},"outputs":{"attn_seg":{"shape":[1,5,1,4,1]},"state_out_v":{"shape":[32,2,1,20,2]},"output":{"shape":[292,5,1,1,1]},"state_out_k":{"shape":[32,2,1,20,2]},"extras_tmp":{"shape":[2,1,1,1,1]},"proj_key_seg_in_tmp":{"shape":[128,39,1,1,1]},"attn":{"shape":[1,1,1,5,2]},"proj_key_seg":{"shape":[128,1,1,20,1]}}} : string_id=0x70ac000000015257 : program=_ANEProgramForEvaluation: { programHandle=10000241581886 : intermediateBufferHandle=10000242143532 : queueDepth=32 } : state=3 : programHandle=10000241581886 : intermediateBufferHandle=10000242143532 : queueDepth=32 : attr=... : perfStatsMask=0}  i dont see any obvious error messages in com.apple.ane, com.apple.coreml or com.apple.espresso. where/what should i look for to understand what is going on? and in particular why the ANE model was unloaded? Thank you
2
0
1.7k
Feb ’23
CoreML, Invalid indexing on GPU
i believe i am encountering a bug in the MPS backend of CoreML. i believe there is an invalid conversion of a slice_by_index + gather operation resulting in indexing the wrong values on GPU execution. the following is a python program using the coremltools library illustrating the issue: from coremltools.converters.mil import Builder as mb from coremltools.converters.mil.mil import types dB = 20480 shapeI = (2, dB) shapeB = (dB, 22) @mb.program(input_specs=[mb.TensorSpec(shape=shapeI, dtype=types.int32), mb.TensorSpec(shape=shapeB)]) def prog(i, b): lslice = mb.slice_by_index(x=i, begin=[0, 0], end=[1, dB], end_mask=[False, True], squeeze_mask=[True, False], name='slice_left') rslice = mb.slice_by_index(x=i, begin=[1, 0], end=[2, dB], end_mask=[False, True], squeeze_mask=[True, False], name='slice_right') ldata = mb.gather(x=b, indices=lslice) rdata = mb.gather(x=b, indices=rslice) # actual bug in optimization of gather+slice x = mb.add(x=ldata, y=rdata) # dummy ops to make a bigger graph to run on GPU x = mb.mul(x=x, y=2.) x = mb.mul(x=x, y=.5) x = mb.mul(x=x, y=2.) x = mb.mul(x=x, y=.5) x = mb.mul(x=x, y=2.) x = mb.mul(x=x, y=.5) x = mb.mul(x=x, y=2.) x = mb.mul(x=x, y=.5) x = mb.mul(x=x, y=2.) x = mb.mul(x=x, y=.5) x = mb.mul(x=x, y=2.) x = mb.mul(x=x, y=.5) x = mb.mul(x=x, y=2.) x = mb.mul(x=x, y=.5) x = mb.mul(x=x, y=1., name='result') return x input_types = [ ct.TensorType(name="i", shape=shapeI, dtype=np.int32), ct.TensorType(name="b", shape=shapeB, dtype=np.float32), ] with tempfile.TemporaryDirectory() as tmpdirname: model_cpu = ct.convert(prog, inputs=input_types, compute_precision=ct.precision.FLOAT32, compute_units=ct.ComputeUnit.CPU_ONLY, package_dir=tmpdirname + 'model_cpu.mlpackage') model_gpu = ct.convert(prog, inputs=input_types, compute_precision=ct.precision.FLOAT32, compute_units=ct.ComputeUnit.CPU_AND_GPU, package_dir=tmpdirname + 'model_gpu.mlpackage') inputs = { "i": torch.randint(0, shapeB[0], shapeI, dtype=torch.int32), "b": torch.rand(shapeB, dtype=torch.float32), } cpu_output = model_cpu.predict(inputs) gpu_output = model_gpu.predict(inputs) # equivalent to prog expected = inputs["b"][inputs["i"][0]] + inputs["b"][inputs["i"][1]] # what actually happens on GPU actual = inputs["b"][inputs["i"][0]] + inputs["b"][inputs["i"][0]] print(f"diff expected vs cpu: {np.sum(np.absolute(expected - cpu_output['result']))}") print(f"diff expected vs gpu: {np.sum(np.absolute(expected - gpu_output['result']))}") print(f"diff actual vs gpu: {np.sum(np.absolute(actual - gpu_output['result']))}") the issue seems to occur in the slice_right + gather operations when executed on GPU. the wrong items in input "i" are selected. the program outpus diff expected vs cpu: 0.0 diff expected vs gpu: 150104.015625 diff actual vs gpu: 0.0 this behavior has been tested on MacBook Pro 14inches 2023, (M2 pro) on mac os 14.7, using coremltools 8.0b2 with python 3.9.19
3
0
393
Sep ’24