Apple Developer community,
I recently updated Xcode and Core ML from version 13.0.1 to 14.1.2 and am facing an issue with the MLOneHotEncoder in my Core ML classifier. The same code and data that worked fine in the previous version now throw an error during predictions. The error message is:
MLOneHotEncoder: unknown category String [TERM] expected one of
This seems to suggest that the MLOneHotEncoder is not handling unknown strings, as it did in the previous version. Here's a brief overview of my situation:
- Core ML Model: The model is a classifier that uses MLOneHotEncoder for processing categorical data.
- Data: The same dataset is used for training and predictions, which worked fine before the update.
- Error Context: The error occurs at the prediction stage, not during training.
I have checked for data consistency and confirmed that the dataset is the same as used with the previous version.
Here are my questions:
- Has there been a change in how MLOneHotEncoder handles unknown categories in Core ML version 14.1.2?
- Are there any recommended practices for handling unknown string categories with MLOneHotEncoder in the updated Core ML version?
- Is there a need to modify the model training code or data preprocessing steps to accommodate changes in the new Core ML version?
I would appreciate any insights or suggestions on how to resolve this issue. If additional information is needed, I am happy to provide it.
Thank you for your assistance!
Hi Simon, thanks for writing. The expected behavior for a model with a one hot encoder trained in CreateML on an unknown category is that an error is thrown. Here is a reference guide for OneHotEncoder in CoreML. Within CoreML, you can set the enum HandleUnknown
to either ErrorOnUnknown
(expected behavior) or IgnoreUnknown
. There's arguments for both approaches, and a lot of it depends on the data you have and the model you're trying to train. A version of CreateML on a previous OS ignored unknown values, and so using the same data to train a new model on a newer OS might produce the result you're seeing.
If you're using the exact same data set for training and predictions, this shouldn't be an issue, because all of the categorical values found in the data being used for predictions should be present while training. Would you mind sharing the dataset (or a small portion of it that reproduces the issue) if you feel comfortable, so that we can help isolate what's causing the error? Thanks!