Machine learning is a hot topic in modern development. We can find the fruits of data science labors in our digital personal assistants, streaming recommendations, fraud detection, and cancer research. The application of machine learning is boundless and can bring new solutions to long-standing issues. For .NET developers, its never been easier to take advantage of machine learning with the arrival of ML.NET.
In this post, we’ll break down one of the ML.NET samples into its two critical components: Training and Prediction. We’ll be building a console application that allows us to train and store a model. The prediction component will allow us to recall a trained model and predict the classification of new data. Let’s get started!
Machine Learning Basics
For those with limited or no experience with machine learning, we can think of it as a mechanism to categorize any dataset. When we look at data, each data point will have features and labels. A feature is an attribute of the current data instance. For example, we may describe a dog as an animal with four legs. The animal’s four legs are a feature. A label is a known fact of the data instance that categorizes it. For example, in our previous case, we label the data instance as a dog.
Using features and labels, we can build a dataset of known instances. We use this dataset to create a Model. We train the model with a certain percentage of our dataset while withholding a certain portion to test and verify our new model’s accuracy. Once we have a model, we can use it to predict the categorization of any further input with a certain probability. The more training data we have, the more accurate our predictions will be.
There are various types of models we can train:
- Binary Classification: True or False
- Multi-class Classification: What type is this thing?
- Recommendation: Given a set of data, what other things might fit into it.
- Regression: Predicting a future value based on the features.
- Time Series forecasting: Similar to regression but with time as the main component.
- Anomaly Detection: Is a data point an outlier given our current understanding.
- Clustering: Categorization based on features.
- Ranking: Ordering by “importance”
- Computer Vision: detection and classification of objects
In this post, we’ll be building a binary classification engine. It will determine if a particular sentiment falls into the category of Toxic
or Non-toxic
. Let’s start with training our model.
We’ll be using an adapted version of the sample found at the official ML.NET repository. To download this solution, you can go to my GitHub repository.
Training Our Sentiment Analysis Model
To train a model, we need a dataset. In this example, we’ll be using a dataset of sentiment pulled from Wikipedia moderators. Be warned, some of the data can be a little nasty. The two most essential columns in our dataset include label
and comment
. Before we load this dataset, we need to create a data object.
public class SentimentIssue
{
[LoadColumn(0)]
public bool Label { get; set; }
[LoadColumn(2)]
public string Text { get; set; }
}
This object allows ML.NET to parse the tab-separated value file. We also need our output class, named SentimentPrediction
. We’ll be using our prediction class later in this post.
public class SentimentPrediction
{
// ColumnName attribute is used to change the column name from
// its default value, which is the name of the field.
[ColumnName("PredictedLabel")]
public bool Prediction { get; set; }
// No need to specify ColumnName attribute, because the field
// name "Probability" is the column name we want.
public float Probability { get; set; }
public float Score { get; set; }
}
We can load our training data with the following code.
// Create MLContext to be shared across the model creation workflow objects
// Set a random seed for repeatable/deterministic results across multiple trainings.
var mlContext = new MLContext(seed: 1);
// STEP 1: Common data loading configuration
var dataView = mlContext.Data.LoadFromTextFile<SentimentIssue>(input.DatasetPath, hasHeader: true);
After loading our data, we need to split our data into two parts: The training and the testing sets.
var trainTestSplit = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
var trainingData = trainTestSplit.TrainSet;
var testData = trainTestSplit.TestSet;
Before we can begin training our model, we need to transform our data. ML.NET has built-in transformers. We need to turn our text input into a feature.
// STEP 2: Common data process configuration with pipeline data transformations
var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: nameof(SentimentIssue.Text));
The next two lines of code allow us to use the label in the dataset to create a classification trainer.
// STEP 3: Set the training algorithm, then create and config the modelBuilder
var trainer = mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "Label", featureColumnName: "Features");
var trainingPipeline = dataProcessPipeline.Append(trainer);
Finally, we can train our model.
// STEP 4: Train the model fitting to the DataSet
ITransformer trainedModel = trainingPipeline.Fit(trainingData);
Our next optional step sees us verifying our newly trained model.
// STEP 5: Evaluate the model and show accuracy stats
var predictions = trainedModel.Transform(testData);
var metrics = mlContext.BinaryClassification.Evaluate(data: predictions, labelColumnName: "Label", scoreColumnName: "Score");
We can use a console helper in the sample project to write out the results of our test data against our model.
* Accuracy: 94.69 %
* Area Under Curve: 94.07 %
* Area under Precision recall Curve: 77.40 %
* F1Score: 63.95 %
* LogLoss: .21
* LogLossReduction: .52
* PositivePrecision: .88
* PositiveRecall: .5
* NegativePrecision: .95
* NegativeRecall: 99.32 %
We can see that our model shows a 94.69% accuracy rate against our test data. That’s pretty decent. We only need to train the model once, and we can now save it to disk.
// STEP 6: Save/persist the trained model to a .ZIP file
mlContext.Model.Save(trainedModel, trainingData.Schema, outputPath);
We will be using this trained model in the next section.
Using Our Trained Model
We may have noticed that training our model can take a bit of time. We don’t want to retrain our model every time we need a prediction. Luckily ML.NET allows us the ability to store and load existing trained models.
Given the path to a saved model, we can load it into memory.
var mlContext = new MLContext(seed: 1);
var transformer = mlContext.Model.Load(modelPath, out _);
var engine =
mlContext.Model.CreatePredictionEngine<SentimentIssue, SentimentPrediction>(transformer);
Once loaded, we can use our prediction engine.
var loop = true;
Console.CancelKeyPress += (sender, args) => loop = false;
while (loop)
{
Console.Write("$> ");
var line = Console.ReadLine();
var example = new SentimentIssue {Text = line };
var prediction = engine.Predict(example);
var result = prediction.Prediction ? "Toxic" : "Non Toxic";
Console.WriteLine($"=============== Single Prediction ===============");
Console.WriteLine($"Text: {example.Text} \n" +
$"Prediction: {result} ({prediction.Probability})");
}
Running our engine shows our prediction model in action.
$> this is awesome!
=============== Single Prediction ===============
Text: this is awesome!
Prediction: Non Toxic (0.18389213)
$> this sucks
=============== Single Prediction ===============
Text: this sucks
Prediction: Toxic (0.97032255)
Note the probability next to the prediction. Our engine scores each outcome. We can think of it as confidence. In the toxic forecast, we can see that our engine is 97% the text is toxic. In the non-toxic prediction, we can see that our model is 82% (100 - 18) the result is not toxic. Remember that the confidence is answering the question, “How confident are we that this sentiment is toxic?”
Oakton For A Better ML Experience
Oakton is a library for parsing command-line arguments and building command-line actions. It’s a perfect companion for ML.NET, as it allows us to separate the training of a model, create clean up commands, and expose a testing fixture.
To get started, we first need to create a console application project. In the Main
method, we want to scan and register all new Oakton commands.
class Program
{
private static async Task<int> Main(string[] args)
{
var executor = CommandExecutor.For(_ =>
{
_.RegisterCommands(typeof(Program).GetTypeInfo().Assembly);
});
return await executor.ExecuteAsync(args);
}
}
Next, we can create two commands: TrainingCommand
and CheckCommand
. Each one has its own input as well. Let’s look at the TrainingCommand
first.
using System;
using System.IO;
using Common;
using MachineLearningHelloWorld.Structures;
using Microsoft.ML;
using Oakton;
namespace MachineLearningHelloWorld
{
public class TrainingInput
{
[FlagAlias('i')]
[Description("training dataset path", Name = "data")]
public string DatasetPath { get; set; }
[FlagAlias('o')]
[Description("training model output (zip file)", Name = "output")]
public string OutputPath { get; set; }
}
public class TrainingCommand : OaktonCommand<TrainingInput>
{
public const string DefaultModelPath = "./";
public static string GetModelPath(string path)
{
return Path.Combine(path, "model.zip");
}
public TrainingCommand()
{
Usage("Default output").Arguments(x => x.DatasetPath);
Usage("Override output").Arguments(x => x.DatasetPath, x => x.OutputPath);
}
public override bool Execute(TrainingInput input)
{
if (string.IsNullOrEmpty(input.DatasetPath))
{
ConsoleHelper.ConsoleWriteException("a dataset is required to train the model.");
return false;
}
var outputPath = GetModelPath(
string.IsNullOrEmpty(input.OutputPath)
? DefaultModelPath
: input.OutputPath
);
// Create MLContext to be shared across the model creation workflow objects
// Set a random seed for repeatable/deterministic results across multiple trainings.
var mlContext = new MLContext(seed: 1);
// STEP 1: Common data loading configuration
var dataView = mlContext.Data.LoadFromTextFile<SentimentIssue>(input.DatasetPath, hasHeader: true);
var trainTestSplit = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
var trainingData = trainTestSplit.TrainSet;
var testData = trainTestSplit.TestSet;
// STEP 2: Common data process configuration with pipeline data transformations
var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: nameof(SentimentIssue.Text));
// STEP 3: Set the training algorithm, then create and config the modelBuilder
var trainer = mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "Label", featureColumnName: "Features");
var trainingPipeline = dataProcessPipeline.Append(trainer);
// STEP 4: Train the model fitting to the DataSet
ITransformer trainedModel = trainingPipeline.Fit(trainingData);
// STEP 5: Evaluate the model and show accuracy stats
var predictions = trainedModel.Transform(testData);
var metrics = mlContext.BinaryClassification.Evaluate(data: predictions, labelColumnName: "Label", scoreColumnName: "Score");
ConsoleHelper.PrintBinaryClassificationMetrics(trainer.ToString(), metrics);
// STEP 6: Save/persist the trained model to a .ZIP file
mlContext.Model.Save(trainedModel, trainingData.Schema, outputPath);
Console.WriteLine("The model is saved to {0}", outputPath);
return true;
}
}
}
We can use Oakton to parse our input arguments. We can run the training command with the following:
dotnet run training data.tsv
The command trains our model using the passed in data set, as in the previous section. The advantage of using Oakton is we can define various usages for our newly created commands. In this case, we have a default output for our trained model.
Let’s take a look at the CheckCommand
.
public class CheckInput
{
[FlagAlias('m')]
[Description("path of the trained model")]
public string ModelPath { get; set; }
}
public class CheckCommand : OaktonCommand<CheckInput>
{
public CheckCommand()
{
Usage(default).Arguments();
Usage("with model").Arguments(x => x.ModelPath);
}
public override bool Execute(CheckInput input)
{
var mlContext = new MLContext(seed: 1);
var modelPath = input.ModelPath.IsEmpty()
? TrainingCommand.GetModelPath(TrainingCommand.DefaultModelPath)
: input.ModelPath.EndsWith(".zip")
? input.ModelPath
: TrainingCommand.GetModelPath(input.ModelPath);
var transformer = mlContext.Model.Load(modelPath, out _);
var engine =
mlContext.Model.CreatePredictionEngine<SentimentIssue, SentimentPrediction>(transformer);
// User has to Ctrl+C out
Console.WriteLine($"=============== Zoltar Of Sentiment ===============");
var loop = true;
Console.CancelKeyPress += (sender, args) => loop = false;
while (loop)
{
Console.Write("$> ");
var line = Console.ReadLine();
var example = new SentimentIssue {Text = line };
var prediction = engine.Predict(example);
var result = prediction.Prediction ? "Toxic" : "Non Toxic";
Console.WriteLine($"=============== Single Prediction ===============");
Console.WriteLine($"Text: {example.Text} \n" +
$"Prediction: {result} ({prediction.Probability})");
}
return loop;
}
}
We can pass in an entirely different trained model, or use the default model. We will also let a user type as many lines of sentiment as they want.
dotnet run check
Using Oakton, we can continue to evolve our ML.NET app with new commands.
Conclusion
ML.NET has been on my list of technology for a while now. Once we break down the sample project into its two components, it is clear that it is easy to consume trained models. I expect to look at the other classification models, and see where we could use it in our existing applications. The addition of Oakton makes it easy to focus on the different parts of data science and leaves open the possibility of future enhancements.
I hope you found this post helpful, and thanks to the ML.NET team for making such a great library and a vast library of samples.
Remember you can download this sample project from my GitHub repository.