CreateML: MLDataTable & DataFrame differences

It seems that a DataFrame (TabularData framework) can be used in CreateML, instead of an MLDataTable - which makes sense, given the description of the TabularData API. However, there are differences.

One is that when using a DataFrame, the randomSplit method creates a tuple of DataFrame slices, which cannot then be used in MLLinearRegressor without first converting back to DataFrame (i.e. initialising a new DataFrame with the required slice). Using an MLDataTable as the source data, the output from randomSplit can be used directly in MLLinearRegressor.

I'm interested to hear of any other differences and whether the behaviour described above is a feature or a bug.

TabularData seems to have more features for data manipulation, although I haven't done any systematic comparison. I'm a bit puzzled as to why there are 2 similar, but separate, frameworks.

Replies

randomSplit is more efficient in TabularData because it doesn't allocate memory, it returns slices which effectively point to the existing DataFrame. Converting them to a DataFrame is the right thing to do in this case until MLLinearRegressor supports taking a slice instead.

In terms of differences TabularData is more in line with Swift in terms of protocols like Collection and behaviours like copy-on-write. Going forward you should use TabularData if that is an option for you.

  • ParsingOptions() in MLDataTable has skipRows option where as CSVReadingOptions in DataFrame lacks the skipRows

Add a Comment