Hi,
I have a custom object detection CoreML model and I notice something strange when using the model with the Vision framework.
I have tried two different approaches as to how to process an image and do inference on the CoreML model. The first one is using the CoreML "raw": initialising the model, getting the input image ready and using the model's .prediction() function to get the models output.
The second one is using Vision to wrap the CoreML model in a VNCoreMLModel, creating a VNCoreMLRequest and using the VNImageRequestHandler to actually perform the model inference. The result of the VNCoreMLRequest is of type VNRecognizedObjectObservation.
The issue I now face is in the difference in the output of both methods. The first method gives back the raw output of the CoreML model: confidence and coordinates. The confidence is an array with size equal to the number of classes in my model (3 in my case). The second method gives back the boundingBox, confidence and labels. However here the confidence is only the confidence for the most likely class (so size is equal to 1). But the confidence I get from the second approach is quite different from the confidence I get during the first approach.
I can use either one of the approaches in my application. However, I really want to find out what is going on and understand how this difference occurred.
Thanks!
Hello,
However here the confidence is only the confidence for the most likely class (so size is equal to 1).
That is not actually what this confidence value represents. Currently, the "overall" confidence value of a VNRecognizedObjectObservation
is the sum of each of the confidences for each label in the labels array.
Note that this is an implementation detail that is subject to change, so you should not rely on this behavior always being the case.
The confidence scores for each label should match what you receive when you run inference through CoreML.