I’ve recently been developing a service to help validate my blog posts against a set of rules to ensure that I be notified if I accidentally publish an incomplete article. At its foundation, it’s a collection of crawlers that scrape technical text, verify it and then generates metadata about the content for insertion into Cosmos DB.
It made me wonder if I could produce a tool that’s generic enough to be helpful to my fellow Advocate colleagues at Microsoft. This question opened a pandora’s box of opportunities and possible features.
One feature I’ve been examining how to estimate the potential popularity of an article based on its length, sentiment score and its code snippets contents.
As I originally created the crawler to only work on my personal blog, it was straightforward to detect the programming language used in a snippet; it was virtually always C# or Swift. I could peek at the tags on the post and make a choice, which provided valid results. However, this technique is altogether inadequate when considering the content of my colleagues, as they have been developing with a myriad of different technology stacks.
This prompted me to examine how I could utilise the Visual Studio ML.NET model builder preview extension to produce a machine learning model to detect programming syntax. The Visual Studio extension aims to make machine learning available to those who have limited to no understanding with ML, so I’m the ideal guinea pig!
Assuming you wish to create your own syntax detection model, all you require is some training data. In my case, I dedicated a few evenings manually copy and pasting code snippets from GitHub into folders labelled after the language.
Once I was approaching ~2000 code snippets, I opened Visual Studio to get started on creating the ML model. If you want to skip the tedious part of collecting the training data, you can clone the data I collected from GitHub.
Assuming you’ve created a new solution, you’ll need to right-click the solution explorer and click Add > Machine Learning.
Visual Studio will then present the ML.NET Model Builder scenario picker. Here you can delve into the countless options for creating ML models! In this case, we will use the ‘Text Classification’ scenario.
For some scenarios, you can utilise cloud computing, and therefore you’re able to choose either a local environment or the cloud. Here I’m only showed a local environment, but with the nominal amount of training needed, my system is more than enough.
Next up, we need to supply Model Builder with the training data. Below you can see the contents of the CSV file, which is simple by design. I’ve made two columns, the syntax type and then an example code snippet.
Once we’re on the Add Data step, we add the merged.csv file and choose the ‘Language’ column as the predict. Once we’ve completed this, we’ll see a preview of the data, and if all looks correct, then we’re ready to start training.
By default, Model Builder will propose a ‘Time to train’ value which is inadequate and will simply yield a failure. I’ve found 300 seconds (5 minutes) is adequate for the volume of training data supplied. You might discover you can produce valid results with less time if you experiment.
Once trained, you’ll immediately see the training results, which include the accuracy percentage of the model. If you’re content with the accuracy, then you might wish to test the effectiveness by clicking the ‘Evaluate’ button.
Passing in a C# snippet, I can see that the model performs sufficiently for my current needs.
I can now have the Model Builder extension generate the C# projects required to run the custom ML model in production. It’s as straightforward as clicking the ‘Add Projects’ button.
And with that, I’m ready to use my custom machine learning model without having written any code.
Model Builder has looked at the sample data, picked the best algorithm for training (averaged perceptron ova), trained the design and then generated all the boiler-plate code required to use the model. Who could ask for more when seeking a simple starting point?!