Treat data as an asset and benefit from data smart solutions (part 2)
This is the second part out of a mini-series of blog posts. The first part discussed the terms “data as an asset” and “data smart”, while the second part provides an example where we have played around with machine learning techniques to potentially automate parts of our software support process.
Please note that this example is an experiment. Our team consisting of my colleagues Wu Bin Wei and Kaijia Han carried it out within a matter of days and we would probably spend a bit more time to create a production ready system. None of us had any experience with machine learning at all prior to this experiment. We are a group of people with experience within traditional software engineering, structural analysis, propulsion system design and hydrodynamics. I think our example clearly demonstrates some of the “magic” with machine learning, the low threshold for learning and how mature the developer-tooling has become. The target audience for this blog post is primarily the type of engineers or scientist that have little or no knowledge about machine learning and automation of knowledge work. Thus, it is deliberately written in a simple way. We want to embrace “machine learning for the people” and democratise machine learning with this simple example.
A good starting point is to follow the cross-industry standard for data mining, also called the CRISP-cycle. Altough our experiment is relatively simple, we will try to apply the various elements in this cycle all the way from business understanding to deployment.
Since DNV GL – Digital Solutions offer a lot of software products, solutions and services, we have a 24/7 global support organization. During an average year we get in the range of 15.000 – 20.000 support requests for our products. These requests are handled in what we call “0th”,1st, 2nd and 3rd line support. The most complicated support cases are handled in 3rd line as it sometimes involve debugging of the software with workspaces or input data coming directly from the end user. However, quite many cases starts in “0th” line as they lack a lot of information. For instance, it could be like “Hey, we experience a problem when using one of your products <product name>” and does not have any further detailing, no input files, no screenshots or similar. In such cases a person has to read the text and try to understand which support queue it should go into. So, this was the goal of our simple experiment. Could we automate the whole “0th” line support process with a good accuracy?
We are using one of the major cloud-based CRM-systems on the market, so it is relatively easy to extract some relevant datasets. Our experiment-team are rarely using this system as we normally “talk” with our support-teams through other tools like email, Slack, Skype and so on. Thus, we are not that familiar with how the database schemes are, how many attributes you have on the data, what is in frequent use, what is not used, what is used in the wrong way, what is tweaked and so on. Accordingly we had a small workshop with the CRM-team to extract the right type of data and the right amount. We also got a good understanding about the usage and the work process related to support. This was sufficient to understand the relation between the domain (in this case a software support process) and the corresponding features and labels in a data science context. To put it simle, a feature can be one of the columns in your data input set. Examples on this could be “contact name”, “product group brand”, “subject”, “description”, “actual case reason” or much more. I guess we have 30-50 features in the dataset. A label represent the final choise. If you for instance try to apply machine learning to recognize types of animales on a dataset with animal images, the label would be “dog”, “cat” or something like that. In our experiment, the labels would typically be “product group brand” or “product name”.
The data preparation phase is a bit more technical and requires relevant tooling. It covers all activities to construct the final dataset, that means data that will be fed into the modeling tools from the initial raw data. The data preparation tasks will normally need to be performed multiple times, and not in any prescribed order. Typical tasks would include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools. In this phase we got some experience and know-how from other parts of the DNV GL organization. Luckily we are around 12000 people distributed in more than 100 locations and quite many of us are involved in some sort of text- or document-mining projects. For instance, our Maritime business area have done some exciting work in their DATE-project (Direct Access To Technical Expert). After some back and forth we ended up using tooling like Jupyter Notebook, Natural Language Toolkit (NLTK), Pandas and Matplot.
We selected and tried out various modelling techniques and calibrated the to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed. We had to do some “feature engineering”, that means understanding the correlation between features and labels. Then we tried to improve the accuracy (predicting the right labels) by applying bigger data sets and tree structures. In order to avoid performance issues when training the machine learning models, we had to chunk the data into small data sets per product line (for our software products) and train individually as the initial approach ended up with a lot of CPU time and memory issues. The tooling applied in this phase was Jupyter Notebook, scikit-learn and Pandas. Even though we had a variety of input quality, some vague and some crystal clear support cases, we ended up with a 95% accuracy on predicting the right support case queue or “product line” the feature name is in our system.
At this stage in our experiment we had built a model that appeared to have high quality seen from a data analysis perspective. Before proceeding to the final deployment of the model, it is important to evaluate the model more thoroughly and review the steps executed to construct the model, to be certain it properly achieves the business objectives which in this was to predict the right “queue” for any type of support case. We also reached out to our more experienced data science peers and got some advice on how to apply more innovative/creative ways of testing the new machine learning algorithms.
If our experiment was directly related to the more industrial type of machine learning use cases we are facing on our Veracity data platform, we would deploy and scale the new algorithms in a more careful manner. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring or data mining process. However, we just applied the simple approach and deployed directly to an Microsoft Azure app-service and provided a simple front end where you could input your support case text and get the output/prediction in another field. We used tooling like Python Pickle (serialize/deserialize trained model) and Flask (“ASP.NET in Pyhton”).
Our main conclusion from running this experiment, is that “machine learning is for the people”. With a basic understanding about programming (Python or similar), you can actually achieve a lot because the tool and infrastructure is more or less free. The challenges are primarily related to getting hold on good data sets and understanding the domain or problem to be solved. Some use cases are absolutely relevant for machine learning, while others are not. You need to figure out a way of finding the right use cases. Of course it will become more difficult with more challenging use cases, but the technology is evolving with an impressive speed.
If we would have continued on this project (it was just a 10% evening-hobby project), we would probably have focused on the following:
- Investigate how to train the model continuously and “feed” them with live data
- Experiment with convulutional neural networks or other approaches (just to learn more about ML)
- More cooperation with domain/support-people to improve the algorithms
- Improve the “data awareness culture” to shorten the “distance” from registration to good predictions, i.e. treating data as an asset and focusing more on data quality at an earlier stage
- Providing a far better user experiences by implementing bots of similar