Recognizing Programming Languages using a Neural Network (C#)

Thomas Daniels

4.74/5 (10 votes)

Mar 3, 2018

CPOL

11 min read

16369

447

This article describes how to use a neural network to recognize programming languages, as an entry for CodeProject's Machine Learning and Artificial Intelligence Challenge.

Download language recognition project (C#) - 175.6 KB

Introduction
The Algorithm
The Implementation
The Testing
The Results
History

Introduction

This article is my entry for the language detector part of CodeProject's Machine Learning and Artificial Intelligence Challenge[^]. The goal of the challenge was to train a model to recognize programming languages, based on a provided training dataset with 677 code samples.

I've used C#. The solution has a LanguageRecognition.Core project which is a library with the machine learning code, and a LanguageRecognition project which is a console application that tests the code. The project has SharpLearning[^] (more specific, SharpLearning.Neural) as dependency.

The Algorithm

I decided to go with a neural network[^] to train my model. A neural network takes a vector[^] of floating-point numbers as input, the features of the object we're trying to classify. This vector is also known as the input layer. A unit of a layer is also known as a neuron. A neural network also has an output layer. In case of a network that's trained for classification (such as the one for this project), the output layer will have as many elements as there are classification categories, where the values indicate where the network would classify a given input. (For example, the result of a classification with 3 categories could be [0.15 0.87 0.23], indicating that the network preferred the second category). Between the input layer and the output layer, you can also have one or more hidden layers, with a number of units that you can choose. How do you get from one layer to the next? A matrix multiplication[^] is performed with the first layer and a weight matrix, and that result goes through an activation function[^] and then we have the values of the next layer. (For the network in this article, the rectifier[^] is used because this one is used by SharpLearning.) That layer is then used to calculate the values for the next layer, and so on. For the last layer, we'll also use the softmax function[^] and not just an activation function (one important difference is that an activation function works independently on all units of a layer, whereas the softmax function has to be applied on an array of values). 'Between' every two layers, there is a different weights matrix. It's really the values of these weight matrices that decide what the output of the network is going to look like. So if a neural network is going to be trained, the values of these weight matrices are going to be adjusted so the actual outputs will match the expected outputs better. SharpLearning uses gradient descent[^] for this (more specific, mini-batch gradient descent[^]).

I am not going to go in depth about the details and mathematics of neural networks, because SharpLearning takes care of that. I'm going to focus on how to apply it to recognizing programming languages. If you're interested in learning more, there is plenty of material available; the links in the previous paragraph can be used as a start.

I mentioned that a neural network takes a floating-point vector, the features, as input. What are those going to be here? For this challenge, the number of features (and the features themselves) cannot be pre-defined because that would require assumptions on how many languages and which languages we have to be able to classify. We don't want to make this assumption; instead, we're going to derive the number of features from the code samples that we use as training. Deriving the features is the first step in our training process.

First, the features that appear to be significant for each language are derived separately. I decided to derive three types of features: the most common symbols used in the code, the most common words and the most common combinations of two words. Those features seemed most important to me. For example, in HTML, the < and > symbols are significant, and so are keywords such as body, table and div. The keyword import would be significant for both Java and Python, and there the combinations will be a good help: combinations like import java would be significant for Java, and combinations like import os would be significant for Python.

After having derived those features per language, we combine them: we want to tell our neural network about the presence (or absence) of all features that could point to a specific language. The total number of input neurons will be the sum of all symbols, keywords and combinations selected for each language (duplicates are filtered out of course; we don't need multiple input neurons for the presence of the keyword import for example). The number of output neurons will be the number of languages that are presented in our training dataset.

Let me clarify that with an example. Imagine there were only 3 languages in the training set: C#, Python and JavaScript. For all these languages, the 10 most common symbols, 20 most common words, and 30 most common combinations are selected. That's 60 features per language, so 180 for the three languages combined. However, most of the symbols and some of the keywords/combinations will be duplicated. For the sake of this example, let's say that there are 11 unique symbols overall, 54 unique words and 87 unique combinations, then our neural network will take 11+54+87 values as input. Each input value corresponds to one symbol/word/combination and the value will be the number of occurrences of the symbol/word/combination in an arbitrary piece of code.

What about the hidden layers, the ones between the input and the output layer? I went with 4 hidden layers: if S is the sum of all symbols, keywords, combinations, and possible output languages, then the hidden layers respectively have S / 2, S / 3, S / 4 and S / 5 units. Why those numbers? Because those gave me the one of the best results when testing my model - there isn't much more to it. Using S units in all those four layers gave comparable results (perhaps even slightly better on average), but the training was much slower then.

After having selected the features to use, it's time for the actual training. For the neural network maths, I'm using the SharpLearning library. For each code sample, the previously selected symbols/words/combinations are counted and used as neural network inputs. All to-be-recognized languages get an index, and those indexes will be passed to SharpLearning as outputs for the training data.

When the training is done, we have a model that can recognize languages for code samples. To predict a language, the inputted code sample will be transformed into an input vector exactly like the pre-processing for the training samples (i.e. counting of certain symbols, words, and combinations) and SharpLearning will take care of the maths to return the index of the predicted programming language.

The Implementation

CharExtensions: Defining 'symbol'

In the previous section, I said we'd select the most common symbols for all given languages as a part of the neural network features. Let's first define what a "symbol" means in this context. The following seems like a sensible definition to me: a char is a symbol if it's not a letter, not a digit, no whitespace, and no underscore (because underscores are perfectly valid in variable names). Translating that into code:

static class CharExtensions
{
    internal static bool IsProgrammingSymbol(char x)
    {
        return !char.IsLetterOrDigit(x) && !char.IsWhiteSpace(x) && x != '_';
    }
}

LanguageTrainingSet: Deriving Features Per Language

Next, we'll work on our classes that will derive the features from the given code samples. As I previously said, we'll first do this per language, then combine the features. The LanguageTrainingSet class takes care of the former and also holds all training samples for one language. This class has the following properties to keep track of the samples and symbol/keyword/combination counts:

List<string> samples = new List<string>();
public List<string> Samples { get => samples; }

Dictionary<char, int> symbolCounters = new Dictionary<char, int>();
Dictionary<string, int> keywordCounters = new Dictionary<string, int>();
Dictionary<string, int> wordCombinationCounters = new Dictionary<string, int>();

These collections will be filled when a new training sample is added to the training set. That's what the AddSample method is for:

public void AddSample(string code)
{
    code = code.ToLowerInvariant();

    samples.Add(code);

    var symbols = code.Where(CharExtensions.IsProgrammingSymbol);
    foreach (char symbol in symbols)
    {
        if (!symbolCounters.ContainsKey(symbol))
        {
            symbolCounters.Add(symbol, 0);
        }

        symbolCounters[symbol]++;
    }

    string[] words = Regex.Split(code, @"\W").Where
                     (x => !string.IsNullOrWhiteSpace(x)).ToArray();
    foreach (string word in words)
    {
        if (!keywordCounters.ContainsKey(word))
        {
            keywordCounters.Add(word, 0);
        }

        keywordCounters[word]++;
    }

    for (int i = 0; i < words.Length - 1; i++)
    {
        string combination = words[i] + " " + words[i + 1];
        if (!wordCombinationCounters.ContainsKey(combination))
        {
            wordCombinationCounters.Add(combination, 0);
        }

        wordCombinationCounters[combination]++;
    }
}

Let's walk through this step by step:

The code is converted to lowercase. For case-insensitive languages, not doing this would most likely harm the recognition results. And for case-sensitive languages, it won't matter too much.
The training sample is added to the samples list.
All symbols are extracted from the code sample using LINQ's Where method and the IsProgrammingSymbol method that we created before.
We iterate over all found symbols and for each symbol, we increment the value associated with the symbol in the symbolCounters dictionary.
The code is split on "non-word characters" (that is, all characters that are not A-Z, a-z, 0-9 or underscores) to extract all words.
Exactly like the symbols, the counters in the keywordCounters dictionary are increased.
We do the same for all combinations of two words that follow on each other.

When more samples are added using this method, the counters will gradually increase and we can get a good ranking that indicates what keywords appear most often and what keywords appear less. Eventually, we want to know what keywords, symbols, and combinations appear most and use those as features for our neural network. To select those, the class has a ChooseSymbolsAndKeywords method. It's internal because we want to be able to call it from other classes in the LanguageRecognition.Core assembly, but not outside the assembly.

const int SYMBOLS_NUMBER = 10;
const int KEYWORDS_NUMBER = 20;
const int COMBINATIONS_NUMBER = 30;

internal IEnumerable<char> Symbols { get; private set; }
internal IEnumerable<string> Keywords { get; private set; }
internal IEnumerable<string> Combinations { get; private set; }

internal void ChooseSymbolsAndKeywords()
{
    Symbols = symbolCounters.OrderByDescending(x => x.Value).Select
                                    (x => x.Key).Take(SYMBOLS_NUMBER);
    Keywords = keywordCounters.OrderByDescending(x => x.Value).Select
               (x => x.Key).Where(x => !int.TryParse(x, out int _)).Take(KEYWORDS_NUMBER);
    Combinations = wordCombinationCounters.OrderByDescending
                   (x => x.Value).Select(x => x.Key).Take(COMBINATIONS_NUMBER);
}

The point of the .Where call to select the keywords is to exclude 'keywords' that are only numbers. Those wouldn't be useful at all. Numbers in combinations with letters are not excluded (and they shouldn't be; for example 1px is still useful).

TrainingSet: Bringing Together LanguageTrainingSets

The TrainingSet class manages all LanguageTrainingSets so you don't need to worry about that when you use the LanguageRecognition.Core library. And when the LanguageRecognizer class (which we'll talk about later) wants to perform the neural network training, the TrainingSet class will combine the .Symbols, .Keywords and .Combinations that are picked by each LanguageTrainingSet's ChooseSymbolsAndKeywords so we also have TrainingSet.Symbols, TrainingSet.Keywords and TrainingSet.Combinations -- the features that will be used in our neural network.

public class TrainingSet
{
    Dictionary<string, LanguageTrainingSet> languageSets = 
                              new Dictionary<string, LanguageTrainingSet>();
    internal Dictionary<string, LanguageTrainingSet> LanguageSets { get => languageSets; }

    internal char[] Symbols { get; private set; }
    internal string[] Keywords { get; private set; }
    internal string[] Combinations { get; private set; }
    internal string[] Languages { get; private set; }

    public void AddSample(string language, string code)
    {
        language = language.ToLowerInvariant();

        if (!languageSets.ContainsKey(language))
        {
            languageSets.Add(language, new LanguageTrainingSet());
        }

        languageSets[language].AddSample(code);
    }

    internal void PrepareTraining()
    {
        List<char> symbols = new List<char>();
        List<string> keywords = new List<string>();
        List<string> combinations = new List<string>();

        foreach (KeyValuePair<string, LanguageTrainingSet> kvp in languageSets)
        {
            LanguageTrainingSet lts = kvp.Value;
            lts.ChooseSymbolsAndKeywords();
            symbols.AddRange(lts.Symbols);
            keywords.AddRange(lts.Keywords);
            combinations.AddRange(lts.Combinations);
        }

        Symbols = symbols.Distinct().ToArray();
        Keywords = keywords.Distinct().ToArray();
        Combinations = combinations.Distinct().ToArray();
        Languages = languageSets.Select(x => x.Key).ToArray();
    }
}

The PrepareTraining method will be called by the LanguageRecognizer class when it needs to know all features for the network input, and the possible languages for the output.

LanguageRecognizer: Training and Prediction

The LanguageRecognizer class is where the actual work happens: the neural network is trained, and we get a model that we can use to predict the language of a code sample. Let's first take a look at the fields of this class:

[Serializable]
public class LanguageRecognizer
{
    NeuralNet network;
    char[] symbols;
    string[] keywords;
    string[] combinations;
    string[] languages;
    ClassificationNeuralNetModel model = null;

First, let's note that the class is Serializable: if you've trained the model and want to reuse it later, you shouldn't have to retrain it, but you can just serialize it and restore it later. The symbols, keywords, combinations, and languages fields are the features for the neural network input - they will be taken from a TrainingSet. NeuralNet is a class from SharpLearning and so is ClassificationNeuralNetModel, where the latter is the trained model and the former is used for the training.

Next, we have a static CreateFromTraining method, taking a TrainingSet and returning an instance of LanguageRecognizer. I decided to go with a static method and not a constructor, because the constructor guidelines[^] say to do minimal work in a constructor, and training the model is not quite "minimal work".

The LanguageRecognizer.CreateFromTraining method constructs the neural network and its layers in the way I described previously in this article. It will go through all training samples and transform the code into an input vector. These input vectors are combined into one input matrix, and this matrix is passed to SharpLearning, alongside with the expected outputs.

public static LanguageRecognizer CreateFromTraining(TrainingSet trainingSet)
{
    LanguageRecognizer recognizer = new LanguageRecognizer();

    trainingSet.PrepareTraining();
    recognizer.symbols = trainingSet.Symbols;
    recognizer.keywords = trainingSet.Keywords;
    recognizer.combinations = trainingSet.Combinations;
    recognizer.languages = trainingSet.Languages;

    recognizer.network = new NeuralNet();

    recognizer.network.Add(new InputLayer(recognizer.symbols.Length + 
               recognizer.keywords.Length + recognizer.combinations.Length));
    int sum = recognizer.symbols.Length + recognizer.keywords.Length + 
              recognizer.combinations.Length + recognizer.languages.Length;
    recognizer.network.Add(new DenseLayer(sum / 2));
    recognizer.network.Add(new DenseLayer(sum / 3));
    recognizer.network.Add(new DenseLayer(sum / 4));
    recognizer.network.Add(new DenseLayer(sum / 5));
    recognizer.network.Add(new SoftMaxLayer(recognizer.languages.Length));

    ClassificationNeuralNetLearner learner = 
    new ClassificationNeuralNetLearner(recognizer.network, loss: new AccuracyLoss());

    List<double[]> inputs = new List<double[]>();
    List<double> outputs = new List<double>();

    foreach (KeyValuePair<string, LanguageTrainingSet> languageSet in trainingSet.LanguageSets)
    {
        string language = languageSet.Key;
        LanguageTrainingSet set = languageSet.Value;

        foreach (string sample in set.Samples)
        {
            inputs.Add(recognizer.PrepareInput(sample));
            outputs.Add(recognizer.PrepareOutput(language));
        }
    }

    F64Matrix inp = inputs.ToF64Matrix();
    double[] outp = outputs.ToArray();
    recognizer.model = learner.Learn(inp, outp);

    return recognizer;
}

This method refers to PrepareInput and PrepareOutput. PrepareOutput is very simple: for a given language, it returns the index of that language in the list of known languages. PrepareInput constructs a double[] with the features to feed to the neural network: the count of symbols we care about, keywords we care about, and keyword combinations we care about.

double[] PrepareInput(string code)
{
    code = code.ToLowerInvariant();

    double[] prepared = new double[symbols.Length + keywords.Length + combinations.Length];

    double symbolCount = code.Count(CharExtensions.IsProgrammingSymbol);
    for (int i = 0; i < symbols.Length; i++)
    {
        prepared[i] = code.Count(x => x == symbols[i]);
    }

    string[] codeKeywords = Regex.Split(code, @"\W").Where(x => keywords.Contains(x)).ToArray();

    int offset = symbols.Length;

    for (int i = 0; i < keywords.Length; i++)
    {
        prepared[offset + i] = codeKeywords.Count(x => x == keywords[i]);
    }

    string[] words = Regex.Split(code, @"\W").ToArray();
    Dictionary<string, int> cs = new Dictionary<string, int>();
    for (int i = 0; i < words.Length - 1; i++)
    {
        string combination = words[i] + " " + words[i + 1];
        if (!cs.ContainsKey(combination))
        {
            cs.Add(combination, 0);
        }
        cs[combination]++;
    }

    offset = symbols.Length + keywords.Length;
    for (int i = 0; i < combinations.Length; i++)
    {
        prepared[offset + i] = cs.ContainsKey(combinations[i]) ? cs[combinations[i]] : 0;
    }

    return prepared;
}

double PrepareOutput(string language)
{
    return Array.IndexOf(languages, language);
}

Lastly, after having created and trained the recognizer, we obviously want to use it to actually recognize languages. That's a very simple piece of code: the input just need to be turned into an input vector with PrepareInput, passed into SharpLearning's trained model which gives an index as output.

public string Recognize(string code)
{
    return languages[(int)model.Predict(PrepareInput(code))];
}

The Testing

The downloadable LanguageRecognition has two projects: LanguageRecognition.Core as library with all learning-related code, and LanguageRecognition as console application that trains the recognizer based on the dataset provided by CodeProject. The dataset contains 677 samples. 577 of this are used for training, the remaining 100 for testing how good the model turned out to be.

The test code extracts the code samples, shuffles them, takes the first 577, performs training with those, then tests serialization and de-serialization of the model, and eventually it performs the prediction testing.

static void Main(string[] args)
{
    // Reading and parsing training samples:
    string sampleFileContents = File.ReadAllText("LanguageSamples.txt").Trim();
    string[] samples = sampleFileContents.Split(new string[] { "</pre>" }, 
                       StringSplitOptions.RemoveEmptyEntries);
    List<Tuple<string, string>> taggedSamples = new List<Tuple<string, string>>();
    foreach (string sample in samples)
    {
        string s = sample.Trim();
        string pre = s.Split(new char[] { '>' }, 2)[0];
        string language = pre.Split('"')[1];
        s = WebUtility.HtmlDecode(s.Replace(pre + ">", "")); // The code samples 
                                   // are HTML-encoded because they are in pre-tags.
        taggedSamples.Add(new Tuple<string, string>(language, s));
        taggedSamples = taggedSamples.OrderBy(x => Guid.NewGuid()).ToList();
    }

    // Setting up training set and performing training:
    TrainingSet ts = new TrainingSet();
    foreach (Tuple<string, string> sample in taggedSamples.Take(577))
    {
        ts.AddSample(sample.Item1, sample.Item2);
    }

    LanguageRecognizer recognizer = LanguageRecognizer.CreateFromTraining(ts);

    // Serialization testing:
    BinaryFormatter binaryFormatter = new BinaryFormatter();
    LanguageRecognizer restored;
    using (MemoryStream stream = new MemoryStream())
    {
        binaryFormatter.Serialize(stream, recognizer);

        stream.Seek(0, SeekOrigin.Begin);
        restored = (LanguageRecognizer)binaryFormatter.Deserialize(stream);
    }

    // Prediction testing:
    int correct = 0;
    int total = 0;
    foreach (Tuple<string, string> sample in taggedSamples.Skip(577))
    {
        if (restored.Recognize(sample.Item2) == sample.Item1.ToLowerInvariant())
        {
            correct++;
        }
        total++;
    }
    Console.WriteLine($"{correct}/{total}");
}

The Results

On average, the accuracy on unseen samples appears to be approximately 85%. The accuracy differs every time you run the test application though, because the code samples are shuffled (so the selected features will be somewhat different) and the neural network is initialized with different random weights every time. Sometimes, the accuracy is just below 80%, sometimes it's just above 90% as well. I wanted to test with bigger training sets as well, but I did not have the time to gather these. I believe that it would increase the accuracy though, because a bigger training set means a better selection of features and a better training of the neural network.

History

3^rd March, 2018: Initial version