MLRegressor error clarification

I've been playing around with MLDataTable and MLRegressor for the first time. I set up the data so the model would predict a Boolean value. Actually, it was a data table of NHL games, with the first two columns being "home team" and "away team", and the final "result" column recording the Boolean value of "home team won". The model was supposed to predict this value.


Everything worked quite nicely, and the maximum error came out as 0.56. It's been awhile since I studied statistics, but I suspect this means that the model's predictions would be correct no more often than a chance guess (which was the expected result).


I'd be grateful if someone more statistically endowed than I am would verify this!

Replies

I did not use MLRegressor, but have some background in stats.


Definition of maximumError in MLRegressor is

The largest absolute difference between the expected values and the model's predicted values during testing or training.


So that means, as results may be 0 or 1, that for an expected 1 you get always a prediction above 0.44 and for an expected 0 a prediction less than 0.56.


So, building an estimator by comparing the result to 0.5 and deciding 0 if below 0.5 and 1 if above, would give already a pretty robust prediction.


That means also that winner is really influenced by playing at home or outside (winner playing at home more often probably).


So, it is not a toss, because with a toss, the maximum error would be close to 1 (in some cases, predicting zero where it should be 1).


have you measured the rootMeanSquaredError ? That could help evaluating the error probability with the above estimate.


maximumError was 0.558 and rootMeanSquaredError was 0.498.


So it seems I completely misinterpreted these results. I must dig out my old statistics books!

Could you post the code of your example, to play a little with it ?


In fact, it is a bit difficult to understand what this model training real means.

What is the output of the model: 0 or 1 or a value between 0 and 1 (should be otherwise max error would necessarily be 0 or 1)

But the point here is that 0.56 is the maximum error, not the average error. If average error was 0.56, it would probably mean it is a random choice. That's different with max error.


rootMeanSquaredError of 0.498 shows thar error is largely distributed between 0 and 0.56 ; I would guess an average error (if the model can provide it) of about 0.3

I wonder if regressor is best suited in this case. Do you uise a Linear regressor (that's what I guess) or decisionTree ?


As real values are discrete, linear regressor is probably not well suited.

Here's the code from the playground I'm using:


let home_wins_A = Bundle.main.url(forResource: "home_wins_A", withExtension: "csv")
var dataTable_A = try MLDataTable(contentsOf: home_wins_A!)
let home_wins_B = Bundle.main.url(forResource: "home_wins_B", withExtension: "csv")
var dataTable_B = try MLDataTable(contentsOf: home_wins_B!)

// Regression
let (evaluationTable_A, trainingTable_A) = dataTable_A.randomSplit(by: 0.2, seed: 5)
let regressor = try MLRegressor(trainingData: trainingTable_A, targetColumn: "result")
let regressorEvaluation = regressor.evaluation(on: evaluationTable_A)
regressorEvaluation.maximumError
regressorEvaluation.rootMeanSquaredError

// Classification
let (evaluationTable_B, trainingTable_B) = dataTable_B.randomSplit(by: 0.2, seed: 5)
let classifier = try MLClassifier(trainingData: trainingTable_B, targetColumn: "result")
let classifierEvaluation = classifier.evaluation(on: evaluationTable_B)
classifierEvaluation.classificationError


The file home_wins_A has three integer fields: home, away, and result. The first two are the id numbers of the teams. The last field is 1 if the home team won, and 0 otherwise. The file home_wins_B has three string fields with the same names. The first two are three-letter abbreviations of the team names (e.g. "MTL" for Montreal). The last field is "W" if the home team won, and "L" otherwise.


Both files were generated from the same data set, which lists the 11,434 games played in the NHL since 2010. Unfortunately, I don't see any way to attach the CSV files to this post.


Here are the results I'm now getting:

regressorEvaluation.maximumError = 0.852

regressorEvaluation.rootMeanSquaredError = 0.496

classifierEvaluation.classificationError = 0.45


Many thanks for your explanations: they've been very helpful!