The sole aim of predictive analytics is to make a prediction; better prediction, better model. When predicting if a customer will buy a product the possible outcomes to predict are buy or not buy. Accuracy is then measured as the total number of correctly predicted cases as a share of total number of cases, e.g. if we have 10.000 customers and the model correctly predicts the outcome for 8.700 we would say that the model accuracy is 87%. But why is it that so many machine learning experts may get high accuracy in their model but still do poorly in production? Let me explain why.
Common mistakes even experienced professionals make
The title of this article might be misleading as high accuracy is of course what you want most of the time, but HOW the machine learning professional builds his high accuracy model is a key question. The most important thing in assessing the accuracy of a model is having extra data that was not used during the model building. If you have 10.000 customers in your original dataset you should split the dataset randomly into training and testing. Training set should be around 80% of the original data and the test set the remaining 20%. A modeller should never under any circumstances use the test set during model building. This is because we want to test the model on unseen data, which in this case the test set is our unseen data. It’s kind of like the future, something we never have access to in real life.
One of the most common mistakes even experienced professionals make is to use information from the test set when cleaning the data or during model training. That’s kind of like peaking into the future, doing the undoable since, obviously, we never have information from the future. This is actually more common than you think and this might sound trivial to some, but this mistake will lead to significantly inflated accuracy. That means you think you have a model with, for example, 92% accuracy when the true accuracy could be 75% or lower. It is important to keep this in mind when selecting an analytical agency for your business or machine learning expert who claims he has achieved outstanding accuracy on a complex problem.
We here at Sumo Analytics are aware of all those common pitfalls in machine learning. We strive to get as high accuracy as possible while at the same time following all the rules regarding handling the data and model training when helping our clients. If we claim our model achieves 92% accuracy, then that’s what you’ll get in production.