Beware the lure of easy data science
Machine learning pitfalls: there are many, and they can be alluring
Today being a data scientist is exciting! We are witnessing a large uptake of data science and machine learning solutions across various industries, often resulting in interesting datasets and applications. Further, the landscape of publicly available methods and tooling is extremely varied, sophisticated, and growing by the day. Interesting data and the availability of large scale, cloud-based, enterprise-level computing power, place an impressive toolkit at a data scientist’s fingertips.
One classic and perhaps overused example relates to building machine learning models that classify images, such as the famous cat versus dog identification problem. There are plenty of tutorials out there that teach us how to create production-ready classification models for this problem with state-of-the-art performance. With broadly available frameworks for deep learning and approaches such as transfer learning, it is possible to build – literally within a few hours – a model telling us whether it is a cat or a dog we are looking at, with impressive accuracy. One may question the usefulness of such a model (who does not need some help in telling a cat apart from a dog, from time to time?). However, the very same techniques can be employed to create models that can detect fatigue cracks or dents in coating, read car plates, or identify diseases in X-ray images or CT scans, which are all far more enticing and useful applications.
The landscape I have just depicted is truly amazing and there is value in the current trend of placing all these wonderful tools and knowledge out there in the public domain and to test these in different applications. Whilst acknowledging this value, I see, however, the danger that overly accessible data science tooling may bring about.
When is machine learning the right tool?
For one thing, we know that if we give ourselves a hammer, everything will start looking like a nail. As exciting as machine learning can get, it isn’t necessarily always the right tool. Through my profession, I have developed my own rule-of-thumb to guess whether a certain problem is a good candidate for machine learning.
Think about the problem you are trying to solve. Now think about what a solution that solves this problem would need to do. Better yet, try to write on a piece of paper in a sequence, the things that this solution should perform to solve the problem. If the sequence you end up with is clear, made up of actions you can clearly articulate, then probably you do not need machine learning. If, on the contrary, there is no one single logical sequence to break the problem down, and individual steps are hard to explain, machine learning may be your friend. To give an example, try to think what we need to do to distinguish dogs from cats. As easy a task this is, something we innately have done since our early childhood, writing down in a sequence what makes us succeed at this task is excruciatingly difficult, if not impossible. Should we begin with the nose? No wait, maybe with the eyes? Even so, how do we write down what makes a dog’s eyes different from a cat’s? Or maybe it’s both nose and eyes taken together, with their size and relative position? And how does this change under different camera angles, exposures, light conditions, backgrounds, and so on? Right there, we have a good candidate for machine learning.
Other applications correctly gauge the need for machine learning but fall short of appreciating the hunger for data that some of the fancy approaches have. Whilst deep learning is all the rage these days and a plethora of these models are up for grabs on GitHub, these usually originate from fairly large, data-intensive efforts. As such, one has to be aware that copious amounts of effort and data may be needed for the customization of said models to specific applications. Failing to provide such efforts may result in a useless model.
What are the most common machine learning pitfalls?
Indeed, the list of machine learning pitfalls is long, and includes misleading sourcing of training and test datasets, poor selection of model performance metrics and model overfitting, to name the most common.
The problem in data science is, easy as it has become to build machine learning models, building useful models is still difficult. Yet more difficult is determining whether a given machine learning model delivers the desired value or not.
We have recently launched a Recommended Practice (RP; if you wonder what a Recommended Practice is, get in touch with us) that provides a framework to help determining the value of a machine learning model. It is inspired by the well-known Cross-Industry Standard Process for Data-Mining (CRISP) framework. The RP takes the readers through a so-called machine learning assurance case, i.e., a number of claims covering the quality of the model, of its development, deployment, and of the organization responsible for the model. Each claim represents a requirement and is associated with pieces of evidence that are needed to confirm that the claim is satisfied.
Evidence of machine learning quality
By and large, evidence amounts to documentation that is routinely generated during a professional data science project. To exemplify, one claim in our Recommended Practice concerns whether sources for the needed data are available and whether plans for accessing and managing the data exist. Possible evidence to support this claim may be a list of data sources, and plans covering collection, annotation and storing of data both during model development and post-deployment. Each claim for which a proper validation is lacking is placed in a risk register and measured on a risk scale, to evaluate the potential impact of the missing requirement on the overall quality of the machine learning model.
Granted, determining the quality of a machine learning model is a difficult endeavor, one which many are grappling with. This Recommended Practice is our first attempt at providing a standardized approach in this space, something we feel is needed as machine learning is being used more and more. We trust and hope our work will help many to ensure the models they are building and/or using fulfil performance expectations and are fit for the intended purpose. If you want to know more, feel free to reach out to me!