Machine learning is a hot topic in modern development. We can find the fruits of data science labors in our digital personal assistants, streaming recommendations, fraud detection, and cancer research. The application of machine learning is boundless and can bring new solutions to long-standing issues. For .NET developers, its never been easier to take advantage of machine learning with the arrival of ML.NET.
In this post, we’ll break down one of the ML.NET samples into its two critical components: Training and Prediction. We’ll be building a console application that allows us to train and store a model. The prediction component will allow us to recall a trained model and predict the classification of new data. Let’s get started!
Machine Learning Basics
For those with limited or no experience with machine learning, we can think of it as a mechanism to categorize any dataset. When we look at data, each data point will have features and labels. A feature is an attribute of the current data instance. For example, we may describe a dog as an animal with four legs. The animal’s four legs are a feature. A label is a known fact of the data instance that categorizes it. For example, in our previous case, we label the data instance as a dog.
Using features and labels, we can build a dataset of known instances. We use this dataset to create a Model. We train the model with a certain percentage of our dataset while withholding a certain portion to test and verify our new model’s accuracy. Once we have a model, we can use it to predict the categorization of any further input with a certain probability. The more training data we have, the more accurate our predictions will be.
There are various types of models we can train:
- Binary Classification: True or False
- Multi-class Classification: What type is this thing?
- Recommendation: Given a set of data, what other things might fit into it.
- Regression: Predicting a future value based on the features.
- Time Series forecasting: Similar to regression but with time as the main component.
- Anomaly Detection: Is a data point an outlier given our current understanding.
- Clustering: Categorization based on features.
- Ranking: Ordering by “importance”
- Computer Vision: detection and classification of objects
In this post, we’ll be building a binary classification engine. It will determine if a particular sentiment falls into the category of Toxic
or Non-toxic
. Let’s start with training our model.
We’ll be using an adapted version of the sample found at the official ML.NET repository. To download this solution, you can go to my GitHub repository.
Training Our Sentiment Analysis Model
To train a model, we need a dataset. In this example, we’ll be using a dataset of sentiment pulled from Wikipedia moderators. Be warned, some of the data can be a little nasty. The two most essential columns in our dataset include label
and comment
. Before we load this dataset, we need to create a data object.
This object allows ML.NET to parse the tab-separated value file. We also need our output class, named SentimentPrediction
. We’ll be using our prediction class later in this post.
We can load our training data with the following code.
After loading our data, we need to split our data into two parts: The training and the testing sets.
Before we can begin training our model, we need to transform our data. ML.NET has built-in transformers. We need to turn our text input into a feature.
The next two lines of code allow us to use the label in the dataset to create a classification trainer.
Finally, we can train our model.
Our next optional step sees us verifying our newly trained model.
We can use a console helper in the sample project to write out the results of our test data against our model.
We can see that our model shows a 94.69% accuracy rate against our test data. That’s pretty decent. We only need to train the model once, and we can now save it to disk.
We will be using this trained model in the next section.
Using Our Trained Model
We may have noticed that training our model can take a bit of time. We don’t want to retrain our model every time we need a prediction. Luckily ML.NET allows us the ability to store and load existing trained models.
Given the path to a saved model, we can load it into memory.
Once loaded, we can use our prediction engine.
Running our engine shows our prediction model in action.
Note the probability next to the prediction. Our engine scores each outcome. We can think of it as confidence. In the toxic forecast, we can see that our engine is 97% the text is toxic. In the non-toxic prediction, we can see that our model is 82% (100 - 18) the result is not toxic. Remember that the confidence is answering the question, “How confident are we that this sentiment is toxic?”
Oakton For A Better ML Experience
Oakton is a library for parsing command-line arguments and building command-line actions. It’s a perfect companion for ML.NET, as it allows us to separate the training of a model, create clean up commands, and expose a testing fixture.
To get started, we first need to create a console application project. In the Main
method, we want to scan and register all new Oakton commands.
Next, we can create two commands: TrainingCommand
and CheckCommand
. Each one has its own input as well. Let’s look at the TrainingCommand
first.
We can use Oakton to parse our input arguments. We can run the training command with the following:
The command trains our model using the passed in data set, as in the previous section. The advantage of using Oakton is we can define various usages for our newly created commands. In this case, we have a default output for our trained model.
Let’s take a look at the CheckCommand
.
We can pass in an entirely different trained model, or use the default model. We will also let a user type as many lines of sentiment as they want.
Using Oakton, we can continue to evolve our ML.NET app with new commands.
Conclusion
ML.NET has been on my list of technology for a while now. Once we break down the sample project into its two components, it is clear that it is easy to consume trained models. I expect to look at the other classification models, and see where we could use it in our existing applications. The addition of Oakton makes it easy to focus on the different parts of data science and leaves open the possibility of future enhancements.
I hope you found this post helpful, and thanks to the ML.NET team for making such a great library and a vast library of samples.
Remember you can download this sample project from my GitHub repository.