Contents

Introduction
Download
Winnow in a nutshell
- How Winnow works
- Winnow in Action
Problem formulation
High Level diagram
Breakdown
Evaluation
Areas of improvement
References

Introduction

Context-sensitive spelling correction is the task of fixing spelling errors that result in valid words, such as I’d like to eat desert, where dessert was typed when desert was intended.
These errors will go undetected by conventional spell checkers, which only flag words that are not found in a word list.
Context-sensitive spelling correction involves learning to characterize the linguistic contexts in which different words, such as dessert and desert, tend to occur. The problem is that there is a multitude of features one might use to characterize these contexts: features that test for the presence of a particular word nearby the target word; features that test the pattern of parts of speech around the target word; and so on. In general, the number of features will range from a few hundred to over ten thousands.

Now comes another problem, how do we determine which features to actually consider? How to separate the wheat from the chaff?

Note that I'm not talking about Feature selection, because even after filtering or removing noisy features, you will still end up with a large feature space, while the target concept (a context in which desert can occur in this case) depends only on a small subset.

For this problem, we will employ the Winnow Algorithm.

This article follows the work of ANDREW R. GOLDING and DAN ROTH, A Winnow-Based Approach to Context-sensitive Spelling Correction. (License)

Download

Fork or clone from bitbucket.

Winnow in a nutshell

Put simply, winnow is a machine learning technique for discerning the disjunction, (i.e. the ORing function) from labelled examples.
Concretely, what this means is, given m labelled samples, each with a vector of n Boolean features x = (x₁, . . . , x_n), winnow learns which features to union to get the correct label.

Take the example of email spam filtering, on the basic level, we will take the body of the many emails, and tokenize them into array of words, this will be our features vector.
Each feature (word) is denoted 1 if it exists in the example, 0 if it doesn't.

Now you will end up with this:

x₁	x₂	x₃	x₄	x₅	...	x_n	spam? (yes/no)
$$$	100% free	viagra	the	weight loss	...		f(x)
0	1	0	1	0	...		?

Now you see the problem, the feature set can be very large, with many features set to true (the for example occurs in almost every email) yet the correct labelling depends only on a small subset of features r.

How Winnow works

Initialize the weights w₁ = w₂ = . . . = w_n= 1 on the n variables (features vector).
Given an example x = (x₁, . . . , x_n)
output 1 if

else output 0.
θ is a threshold value we set to our discretion
If the algorithm makes a mistake:
- If it predicts 0 when f (x) = 1, then for each x_iequal to 1, increase the value of w_i.(value was too low, promote contributing features)
- If it predicts 1 when f (x) = 0, then for each x_i equal to 1, decrease the value of w_i.(value was too high, demote contributing features)

Winnow gives us a mistake bound of O(r log n).
Where:
r:relevant features
n:total number of features

Winnow in Action

set θ=6 (features count)

x₁ w₁₌₁	x₂ w₂=1	x₃ w₃=1	x₄ w₄=1	x₅ w₅=1	x₆ w₆=1	prediction of f(x)
1	0	0	0	0	0	0 (Σ_ix_iw_i=1 ≥6?)
0 w₁₌₁	1 w₂=2	0 w₃=1	1 w₄=2	1 w₅=2	1 w₆=2	0 (Σ_ix_iw_i=4 ≥6?) double w_i for x_i=1
0	0	0	0	0	1	0 (2 ≥6?)
0 w₁₌₁	0 w₂=2	0 w₃=1	1 w₄=1	1 w₅=1	1 w₆=1	1 (6 ≥6?) halve w_i for x_i=1
0	0	0	0	1	1	0 (2 ≥6?)
...						...

(mistakes in red; the target f(x) = x2 ∨ x3, n = 6, r = 2)

public class Sample
    {
        public bool[] Features { get; set; }
        public bool Class { get; set; }

        public Sample ToggleClass()
        {
            var invertedSample = (Sample)this.MemberwiseClone();
            invertedSample.Class = !invertedSample.Class;
            return invertedSample;
        }

        public override string ToString()
        {
            StringBuilder sb = new StringBuilder("(" + Class + ")[");

            for (int i = 0; i < Features.Length; i++)
            {
                sb.Append(Convert.ToInt16(Features[i]));

                if (i < Features.Length - 1)
                {
                    sb.Append(',');
                }
            }

            sb.Append("]");

            return sb.ToString();
        }
    }

    /// <summary>
    /// Algorithm for learning monotone disjunction function from labeled examples
    /// </summary>
    public class Winnow : ISupervisedLearning
    {
        #region Member Variables

        private readonly double[] _weights;
        private readonly double _threshold;
        private readonly double _promotion;
        private readonly double _demotion;

        #endregion

        public int Mistakes { get; private set; }

        #region Constructor

        public Winnow(int featuresCount, double threshold, double promotion, double demotion, double initialWeight)
        {
            this._threshold = threshold;
            this._promotion = promotion;
            this._demotion = demotion;
            this._weights = new double[featuresCount];

            for (int i = 0; i < _weights.Length; i++)
            {
                _weights[i] = initialWeight;
            }
        }

        #endregion

        #region ISupervisedLearning

        public void Train(IEnumerable<Sample> samples)
        {
            Train(samples.ToArray());
        }

        public void Train(params Sample[] samples)
        {
            lock (this)
            {
                foreach (var s in samples)
                {
                    bool prediction = Predict(s);

                    if (prediction != s.Class)//prediction was wrong
                    {
                        Mistakes++;

                        if (!prediction && s.Class)
                        {
                            AdjustWeights(s, _promotion);
                        }
                        else
                        {
                            AdjustWeights(s, _demotion);
                        }
                    }
                }
            }
        }

        public bool Predict(Sample sample)
        {
            double sum = GetScore(sample);

            return sum >= _threshold;
        }

        public double GetScore(Sample sample)
        {
            double sum = 0;

            for (int i = 0; i < _weights.Length; i++)
            {
                if (sample.Features[i])
                {
                    sum += _weights[i];
                }
            }

            return sum;
        }

        #endregion

        #region Public Methods

        public override string ToString()
        {
            var sb = new StringBuilder();

            foreach (var item in _weights)
            {
                sb.Append(item);
                sb.Append(",");
            }

            return sb.ToString();
        }

        #endregion

        #region Private Methods

        private void AdjustWeights(Sample s, double adjustment)
        {
            for (int i = 0; i < _weights.Length; i++)
            {
                if (s.Features[i])
                {
                    _weights[i] *= adjustment;
                }
            }
        }

        #endregion

    }

Problem formulation

Right, so now we have an idea how Winnow works, let's get back to the problem at hand; context-sensitive spelling correction as a word disambiguation task.
We will need to feed Winnow 2 things; a vector of Boolean features, and a label; this comprises a single training example.
The label here would be one of 2 or more confused words, together these words form a confusion set.
a confusion set C ={W₁;....;W_n} means that each word W_i in the set is ambiguous with each other word.
Concretely, if C ={hear, here}, then the output label will be either hear or here.
obviously we have many confusion sets; each with their own distinct features vector:
C₁ ={hear, here}
C2 ={by, buy}
C3 ={to, two,too}
and so on...In the code I used some sets from Words Commonly Confused.

So these were the labels, how about the other component (features)?
Given a sentence, we want to extract a particular linguistic pattern in the context of the target word.

We'll use two types of features: context words and collocations.
Context-word features test for the presence of a particular word within ±k words of the target word.
Collocations features test for a pattern of up to ℓ contiguous words and/or part-of-speech tags around the target word. In the code, k was set to 10 and ℓ to 2.

Examples of useful features for the confusion set {weather, whether} include:

(1) cloudy within ±10 words
(2)__to VERB

Feature (1) is a context-word feature that tends to imply weather. Feature 2 is a collocation that checks for the pattern “to VERB” immediately after the target word, and tends to imply whether (as in I don’t know whether to laugh or cry).

High Level diagram

The following operation takes place for a single confusion set

Breakdown

Corpus Preprocessing

Corpus is first split into complete sentences; each sentence is tokenized into list of words and pos tags. Richard Northedge POS tagger was used.

public class Sentence
    {
        public string[] Words { get; set; }
        public string[] POSTags { get; set; }
    }

private IEnumerable<Sentence> PreProcessCorpora(IEnumerable<string> corpora)
        {
            var sentences = new ConcurrentBag();

            Parallel.ForEach(corpora, phrase =>
            {
                string[] tokens = SplitIntoWords(phrase);
                string[] posTags;

                lock (_lock)
                {
                    posTags = _posTagger.Tag(tokens);
                }

                sentences.Add(new Sentence
                {
                    Words = tokens,
                    POSTags = posTags
                });
            });

            return sentences;
        }

Getting context features

Combing through the training corpus (i.e. for each sentence), each time a word in the confusion set occurs, get all possible features for the context—namely, one context-word feature for every distinct word within ±k words.

public class ContextFeaturesExtractor : AbstractExtractor
    {
        public ContextFeaturesExtractor(int k)
            : base(k)
        {

        }

        public override HashSet<string> Extract(string[] tokens, int targetPosition)
        {
            var features = new HashSet<string>(StringComparer.OrdinalIgnoreCase);

            int backward = targetPosition - 1;
            int forward = targetPosition + 1;

            for (int counter = 0; counter < n; counter++)
            {
                if (backward <= -1 && forward >= tokens.Length)
                {
                    break;
                }

                if (backward > -1)
                {
                    AddFeature(features, tokens[backward]);
                    backward--;
                }

                if (forward < tokens.Length)
                {
                    AddFeature(features, tokens[forward]);
                    forward++;
                }
            }

            return features;
        }
    }

Getting collocation features

...and 2 collocations (occurring before and after word) for every way of expressing a pattern of up to ℓ contiguous elements.

public class CollocationFeaturesExtractor : AbstractExtractor
    {
        public CollocationFeaturesExtractor(int l)
            : base(l)
        {

        }

        public override HashSet<string> Extract(string[] posTags, int targetPosition)
        {
            var features = new HashSet<string>(StringComparer.OrdinalIgnoreCase);

            for (int c = 0; c < n; c++)
            {
                var preceding = string.Empty;
                int backward = targetPosition - (c + 1);

                while (backward < targetPosition && backward > -1)
                {
                    preceding += posTags[backward] + " ";
                    backward++;
                }
                if (!string.IsNullOrEmpty(preceding))
                {
                    preceding += "_";
                    AddFeature(features, preceding);
                }

                string succeeding = "_ ";

                int forward = targetPosition + 1;
                for (int j = 0; j <= c && forward < posTags.Length; j++, forward = targetPosition + j + 1)
                {
                    succeeding += posTags[forward] + " ";
                }
                succeeding = succeeding.TrimEnd();
                if (succeeding != "_")
                {
                    AddFeature(features, succeeding.TrimEnd());
                }
            }

            return features;
        }
    }

Statistics collection

Statistics of occurrence of the features are collected in the process as well, namely, for each feature F we collect for target word W, we record:
N11 : Number of documents (sentences) containing both F and W .
N10 : Number of documents containing F only.
N01 : Number of documents containing W only.
N00 : Number of documents where neither W nor F were found.
N : Total number of documents.

Feature pruning

After these steps, an exhaustive list of all features found in the training corpus is generated, at this point; pruning criteria may be applied to eliminate unreliable or uninformative features. two criteria are used (which make use of the aforementioned statistics of occurrence):
(1) How much information the presence/absence of a feature F contributes to making the correct classification decision on W. (Mutual information)
(2) The presence of the feature is not significantly correlated with the identity of the target word (determined by a chi-square test at the 0.05 significance level).

public class MutualInformation : IFeatureSelector
    {
        #region Member Variables

        private readonly int k;

        #endregion

        #region Constructor

        public MutualInformation(int k)
        {
            this.k = k;
        }

        #endregion

        #region IFeatureSelector

        public IDictionary<string, Stats> Select(IDictionary<string, Stats> terms)
        {
            var features = new Dictionary<string, double>(terms.Count, StringComparer.OrdinalIgnoreCase);

            foreach (var term in terms)
            {
                var value = Calculate(n: term.Value.N,
                    n00: term.Value.N00,
                    n01: term.Value.N01,
                    n10: term.Value.N10,
                    n11: term.Value.N11);
                features.Add(term.Key, value);
            }

            var orderedFeatures = features
               .OrderByDescending(s => s.Value)
               .Select(s => s.Key);
            var prunedFeatures = orderedFeatures
                .Take(k)
               .ToDictionary(key => key, key => terms[key]);

            return prunedFeatures;
        }

        #endregion

        #region Private Methods

        private double Calculate(int n, double n00, double n01, double n10, double n11)
        {
            double n1_ = n10 + n11;
            double n_1 = n01 + n11;
            double n0_ = n00 + n01;
            double n_0 = n10 + n00;

            double sum = 0;
            sum += (n11 / n) * Math.Log((n * n11) / (n1_ * n_1), 2);
            sum += (n01 / n) * Math.Log((n * n01) / (n0_ * n_1), 2);
            sum += (n10 / n) * Math.Log((n * n10) / (n1_ * n_0), 2);
            sum += (n00 / n) * Math.Log((n * n00) / (n0_ * n_0), 2);

            return sum;
        }

        #endregion
    }

 

public class ChiSquaredTest : IFeatureSelector
    {
        #region Member Variables

        private readonly double _significanceLevel;
        private readonly Dictionary<double, double> _distributionTable;

        #endregion

        #region Constructor

        public ChiSquaredTest(double significanceLevel)
        {
            this._significanceLevel = significanceLevel;
            _distributionTable = new Dictionary<double, double> { 
                {0.1,2.71},
                {0.05,3.84},
                {0.01,6.63},
                {0.005,7.88},
                {0.001,10.83}
            };
        }

        #endregion

        #region IFeatureSelector

        public IDictionary<string, Stats> Select(IDictionary<string, Stats> features)
        {
            var prunedFeatures = new Dictionary<string, Stats>(StringComparer.OrdinalIgnoreCase);

            foreach (var feature in features)
            {
                var featureValue = Calculate(feature.Value.N, feature.Value.N00, feature.Value.N01, feature.Value.N10, feature.Value.N11);

                if (featureValue > _distributionTable[_significanceLevel])
                {
                    prunedFeatures.Add(feature.Key, feature.Value);
                }
            }

            return prunedFeatures;
        }

        #endregion

        #region Private Methods

        private double Calculate(int n, double n00, double n01, double n10, double n11)
        {
            var numerator = n * Math.Pow(((n11 * n00) - (n10 * n01)), 2);
            var denominator = (n11 + n01) *
                (n11 + n10) *
                (n10 + n00) *
                (n01 + n00);
            var value = numerator / denominator;
            return value;
        }

        #endregion
    }

Adding labelled samples

A training example consists of a sentence, represented as a set of active features, together with the word W_c in the confusion set that is correct for that sentence.

   private class RoughSample
        {
            public HashSet<string> Features { get; set; }
            public string Word { get; set; }
        }

Training

Now that we have a list of proper training data, we train a classifier for each word in our confusion set.

var cloudClassifiers = new Dictionary<string, ISupervisedLearning[]>(confusionSet.Count());

            foreach (var word in confusionSet)
            {
                var classifiers = new ISupervisedLearning[1];
                int i = 0;

                for (double beta = 0.5; beta < 1 && i < classifiers.Length; beta += 0.1, i++)
                {
                    classifiers[i] = new Winnow.Winnow(features.Length, threshold: 1, promotion: 1.5, demotion: beta, initialWeight: 1);
                }

                cloudClassifiers[word] = classifiers;
            }

Training proceeds in an on-line manner, each example is treated as a positive example for the classifiers for W_c, and as a negative example for the classifiers for the other words in the confusion set.(One Vs All)

Parallel.ForEach(trainingData, sample =>
            {
                var positive = new Sample
                  {
                      Class = true,
                      Features = CreateFeaturesVector(sample.Features, features)
                  };

                Sample negative = positive.ToggleClass();

                foreach (var cloud in cloudClassifiers)
                {
                    var example = cloud.Key == sample.Word ? positive : negative;

                    foreach (var classifier in cloud.Value)
                    {
                        classifier.Train(example);
                    }
                }
            });

Prediction

When a sentence is presented, the system checks the classifiers of each confusion set, if the sentence contains a word from a confusion set, features are extracted and run against the corresponding classifier, and returns the prediction with the highest score.

/// <summary>
        /// Checks every word in given sentence to check its contextually correct
        /// </summary>
        /// <param name="phrase">sentence to check</param>
        /// <returns> a Dictionary of the wrong word position in sentence as Key and its correction as Value</returns>
        public Dictionary<int, string> Predict(string phrase)
        {
            string[] tokens = SplitIntoWords(phrase);
            var correctWords = new Dictionary<int, string>();

            foreach (var comparator in _comparators)
            {
                foreach (var confusedWord in comparator.ConfusionSet)
                {
                    for (int i = 0; i < tokens.Length; i++)
                    {
                        if (!tokens[i].Equals(confusedWord, StringComparison.OrdinalIgnoreCase))
                        {
                            continue;
                        }

                        string[] posTags;

                        lock (_lock)
                        {
                            posTags = _posTagger.Tag(tokens);
                        }

                        var features = new HashSet<string>(StringComparer.OrdinalIgnoreCase);

                        features.UnionWith(_contextFeaturesExtractor.Extract(tokens, i));
                        features.UnionWith(_collocationtFeaturesExtractor.Extract(posTags, i));

                        var sample = new Sample
                        {
                            Features = CreateFeaturesVector(features, comparator.Features)
                        };

                        var predictions = new Dictionary<string, double>(comparator.Cloud.Count);

                        foreach (var classifier in comparator.Cloud.Keys)
                        {
                            predictions[classifier] = comparator
                                                 .Cloud[classifier]
                                                 .Sum(c => c.GetScore(sample));
                        }

                        string correctedWord = predictions.Aggregate((a, b) => a.Value > b.Value ? a : b).Key;

                        if (!tokens[i].Equals(correctedWord, StringComparison.OrdinalIgnoreCase))
                        {
                            correctWords.Add(i, correctedWord);
                        }
                    }

                }
            }

            return correctWords;
        }

Evaluation

In the experiments that follow, the training and test sets were drawn from the written part of the American National Corpus (OANC) (which is about 11 million words, split into random 80-20% by sentence) and 28 confusion sets.

Results
Confusion Set	Accuracy %
where-were	89
to-too-two	95
hour-our	95.4
by-buy-bye	98.7
their-there	90
knew-new	95.98
later-latter	77.9
weather-whether	95
loose-lose	87
hear-here	89
threw-through	98.8
brake-break	95.83
peace-piece	86.7
cite-site-sight	88.79
passed-past	83
sea-see	92
quiet-quit-quite	79
weak-week	88
coarse-course	96.4
fourth-forth	92.9
council-counsel	89
principal-principle	70
vain-vane-vein	79
rain-reign-rein	80
desert-dessert	89.79
complement-compliment	92
plaine-plane	100
waist-waste	87

Overall occuarcy is 94%

Running time for feature extraction + training + testing took on average 59 minutes on my i7, 3.4 GHz, 8 GB RAM machine.

The above results were achieved with pruning disabled, when enabled, accuracy went down and running time increased.

Areas of improvement

In training, for each word there's an array of classifiers, in the code I kept the number to 1, we can add more algorithms like logistic regression or even neural networks, then pick the best result using Weighted Majority Algorithm.
In context features extraction, we can use a dictionary API to properly stem and singularize the tokens.

References