I’ve recently been developing a utility to help validate my blog posts against a set of rules to ensure that I get notified if I accidentally publish an incomplete article. At its core, it’s a set of crawlers that scrape technical content, validate it and then throw some metadata about the content into Cosmos DB (I’m using the Graph API in case you were curious).
It got me wondering if I could make the tool generic enough to be useful to my fellow CDA colleagues, which opened a pandora’s box of possibilities in terms of potential features.
One of the features I’ve been exploring is trying to estimate the popularity of an article based its length, sentiment score and if it contains code snippets.
As I initially wrote the crawler for just my site, it was pretty easy to determine if the programming language used in a snippet, it was almost always going to C# or possibly Swift. I could look at the tags on the post and make an educated decision which yielded accurate results. However, this approach is wholly inadequate when looking at the content my colleagues have been producing as its a wild-west of different technology stacks.
This prompted me to explore how I could utilise the Visual Studio ML.NET model builder preview extension to create a custom model. It aims to make machine learning accessible to those who have limited to no experience with the domain, so I’m the perfect guinea pig! In order to get started, all you need is some training data. In my case, this meant a few evenings manually copy and pasting code snippets into folders.
Once I was approaching ~2000 code snippets, I fired up Visual Studio. If you want to follow along with the below instructions, then grab you’ll need a copy of the data from my GitHub account.
Assuming you’ve created a new solution, you’ll want to right-click the solution explorer and click Add>Machine Learning.
You’ll then be presented with the ML.NET Model Builder scenario picker. Here you can explore the myriad of options available for creating ML models with zero experience! In this example, we’re going to use the ‘Text Classification’ scenario.
For some scenarios, you have the ability to utilise cloud computing, and thus you’re able to select either a local environment or the cloud. Here I’m only offered my local environment, but with the minimal amount of learning required, my machine is more than good enough.
Next up, we need to supply Model Builder with our training data. Below you can see the contents of the CSV file, which incredibly basic. I’ve created two columns, the syntax type and then an example code snippet.
Once we’re on the Add Data step, we add the merged.csv file and select the Language column as the predict. Once we’ve done this, we’ll see a preview of the data, and if all looks good, then we’re ready to train.
By default, Model Builder will suggest a ‘Time to train’ value which is inadequate and will only yield errors. I’ve found 300 seconds (5 minutes) works perfectly for the amount of training data supplied. You might find you can get away with less time if you experiment.
Once trained, you’ll instantly see the training results, which contains the accuracy percentage of the model. If you’re happy with the accuracy, then you might want to try it out by clicking the Evaluate button.
Passing in a C# snippet, I can see that the model is working well enough for my current needs.
I can then have the Model Builder extension generate the C# projects needed to use the custom ML model in production. It’s as simple as clicking ‘Add Projects’.
And with that, I’m ready to start using my custom machine learning model without having written any code. Model Builder has looked at my sample data, picked the averaged perceptron ova training class and then trained the model, producing an excellent start point.