An algorithm of “phrases similarity calculation” and its usage in “autocompletion” feature

VictorSotnikov

4.00/5 (2 votes)

Aug 16, 2013

CPOL

13 min read

23063

384

an approach to the "autocompletion"

Download the Demo - 31.3 KB

Introduction

The article describes an algorithm used to build "autocompletion lists" for a textbox in a GUI:

The article has the Demo application attached with an implementation of the algorithm. The Demo is implemented on C# as a Silverlight Application. The Silverlight application is already deployed on http://files.rsdn.ru/44022/AutocompleteTextBox.html; you can try it right now without downloading and compiling the attached sources.

Audience

The possible audience of the article are developers that want to implement a "smart" autocompletion algorithm in their applications. The word "smart" means here the following: an algorithm that calculates the "relevancy rank" for the matched "auto-suggestion strings", and orders found results by decreasing of the "rank".

Disclaimers and restrictions

The Demo attached is not a ready-to-use "control" you can "drag and drop" into your application. It is rather an example how to implement the algorithm described in the article.

The attached Demo is a Silverlight application; but of course this approach may be used in practically any type of GUI applications – on a Web Page (with JavaScript usage), in a desktop application etc.

The algorithm implementation in the Demo has several "hardcoded" facts that are specific for English language only (a list of the "2^nd class words" like "the"). So the Demo should be used with English queries only. If you want to use the algorithm for other languages, then populate the list with the "2^nd class words" specific for your language.

Background

The idea of the "autocompletion" may be defined as "to suggest an "autocompletion list" when a user is typing a text in a text control". This technique has become a de facto standard to the present time. Below are the screenshots of several well-known applications with this feature:

And so on and so forth.

Microsoft, for example, recommends the Autocompletion use in its "Designing great productivity apps for Windows" article (http://msdn.microsoft.com/en-us/library/windows/apps/hh868273.aspx):

The Autocompletion feature has been added into various GUI controls in various toolkits. For example, .NET Silverlight has a control called AutoCompleteBox (http://msdn.microsoft.com/en-us/library/system.windows.controls.autocompletebox(v=vs.95).aspx). .NET WPF Toolkit contains its own AutoCompleteBox too (http://iserialized.com/using-the-autocompletebox-in-the-wpf-toolkit/). These are good controls; but their "matching filters" are very simple. I.e. the filters that "match" the entered text against the "array of possible values" and make the "autocompletion list". These filters suggest the following "matching rules": Contains, Starts With, Equals (with case-sensitivity or case-insensitivity); and this is all. And these filters, of course, do not try to order the matched values "by decreasing of relevance".

This is why I tried to implement an algorithm that is "smarter" than the ones used in typical "autocompletion controls".

Inputs, outputs and usage scenario of the algorithm

The algorithm supposes there is an application with a text field (a "textbox" etc). The field has a list of "possible values" for the auto-completion feature.

The attached Demo is an example of such application. It issues search queries to a database of movies:

But the nature of the queries does not matter for the algorithm, of course.

When a User of the application starts typing a text in such a field, the algorithm matches the entered text (we call the text as Query) against the Possible Values. The algorithm returns a list of the "matched" Possible Values. The list is sorted in descending order by "relevance". The "relevance" here is a degree of relevance between a Possible Value and given Query.

The application is supposed to display the returned list as the "auto-completion list" for the given Query:

The Main Ideas of the algorithm

The "similarity factor"

The main idea of the algorithm is usage of what we call "similarity factor".

In other words, the algorithm considers the following question: "how similar" are given Possible Value and Query to each other? The algorithm calculates a number called SimilarityRank than represents a "degree of similarity" between them. The output list of the algorithm is ordered by the decreasing SimilarityRank. Thus, a SimilarityRank value calculated for a (PossibleValue, Query) pair represents "relevance" of the PossibleValue to the Query.

Of course, the Possible Values that "are not similar at all" to given Query will not be included into the output list.

An example:

On the picture above we see that the algorithm has "matched" several of Possible Values against the entered Query "st". The value "Streets" has largest SimilarityRank and is at 1^st position of the output list. The value "Streets of Fire" has lower SimilarityRank and is at 2^nd position of the output list; and so on.

How the SimilarityRank is calculated

The main ideas used for the SimilarityRank calculation are the following:

The string comparison (used for the "matching") is case-insensitive
The "matching" of a PossibleValue against a Query is a "word-wise" operation. In other words, both the PossibleValue and Query will be "split" into words.
We suppose a Possible Value "matches" against a Query when all the Query’s words are found in the Possible Value, in the same order.
The "found in the Possible Value" means: a Query’s word should be equal to a Possible Query word; or to be a prefix of a Possible Value.
An example: suppose we have a Query "leading spaces". Then a PossibleValue "the leading and trailing Spaces" matches the Query. But a PossibleValue "spaces that are leading or trailing" does not match the "leading spaces" – both the Query’s words are found in the Possible Value; but not in the proper order.
Another example: suppose we have a Query "lead space". Then a PossibleValue "the leading and trailing spaces" matches the Query. But a PossibleValue "cheerleaders and spaces" does not match the Query – the word "lead" is found in the word "cheerleaders", but not in the very begin, so it is not a "prefix" of the word "cheerleaders".

A "not matched" Possible Value will not be included into the output list of the algorithm.
For every pair of a Query word and its "matching" PossibleValue word we use the following rule: the closer are their lengths to each other, the more "similar" are the words.
An example: suppose we have a PossibleValue word "relativeness" and a Query word "rel". They match each other, all right. But the "degree of similarity" is not very high: length of the "rel" is only 25% of the "relativeness" length.

Now let’s suppose the Query word is "relative" instead of the "rel". The "degree of similarity" between the "relative" and "relativeness" is essentially higher than the "degree of similarity" between the "rel" and "relativeness". Length of the "relative" is 66% of the "relativeness" length.

We call this as "words length factor".
We increase the SimilarityRank for every pair of a Query word and its "matching" PossibleValue word if the following situation is met:
- a Query word contains capital letters;
  and
- the occurrence of the Query word within the Possible Value is exactly (case-sensitively) equal to the Query word;
An example: suppose we have a Query "Main" and two Possible Values: "Maine" and "maine".
They both math the Query; but SimilarityRank of the "Maine" is higher. The reason is: the Query "Main" contains a capital letter. And the substring "Main" that is an occurrence of the Query in the "Maine" is exactly (case-sensitively) equal to the Query; whereas the substring "main" from the second PossibleValue is not.
We call this as "capital letters factor".
For every pair of a Query word and its "matching" PossibleValue word the SimilarityRank depends of a "word class" the Possible Value word belongs to.

If the PossibleValue word belongs to "1^st class", the SimilarityRank is higher. If the Possible Value word belongs to "2^nd class", the SimilarityRank is lower.
The "class" of a word is defined by the following rule (for English only; other languages may require other rules): the "2^nd class" consists of words like articles, prepositions etc. These words are "less important" than others. In some search engines they are called "stop words" and are ignored at all. Our algorithm does not ignore them completely; but considers them as "less important ones". And the "1^st class" is all the other words except the "2^nd class" ones.

In the current algorithm implementation the following words belong to the "2^nd class": "the", "a", "at", "in", "on", "of", "off", "into", "onto", "by".
Of course, the list may be extended.
We call this as "word class factor".
The closer is position of a "matched" a PossibleValue word to begin of the phrase, the higher is the SimilarityRank.

An example: suppose we have a Query "green" and 2 Possible Values: "green light" and "light green". They both math the Query; but SimilarityRank of the "green light" is higher. The reason is: the word "green" is located in 1^st position in the "green light" and only in 2^nd position in the "light green".

We call this as "word position factor".
The closer is length of a Query to length of a "matched" PossibleValue, the higher is the SimilarityRank.

We call this as "phrase length factor".

An example: suppose we have a Query "green" and 2 Possible Values: "green light" and "green light in the window tonight". They both math the Query; but SimilarityRank of the "green light" is higher. The reason is: length of the Query "green" is ~50% of length of the "green light"; and is only ~15% of the "green light in the window tonight".

Details of the algorithm

In the previous section I described the main ideas of the algorithm. In this section I will give the full description – how a numeric value SimilarityRank is calculated for a pair (Query, PossibleValue).

I recommend you to look at the attached sources while you are reading the description below.

Split both the Query and PossibleValue on separate words using the following separators: {space}, {tabulation}, !, ., {comma}, ;, (, ), \, /, +, -, :, ", [, ], ?, {, }, |, {long dash}, {short dash}.
Find all the occurrences of the Query in the PossibleValue. An "occurrence" happens when:
- every word of a Query is equal to a PossibleValue word; or is a prefix of a PossibleValue word;
  and
- the matched PossibleValue words are situated in the same order as their corresponding Query words
The comparison of the words is performed case-insensitively.
An example: suppose we have PossibleValue "Aa b c a bb" and Query "a b". The PossibleValue has two "occurrences" of the "a b": "Aa b c a bb" and "Aa b c a bb".
An example: suppose we have PossibleValue = "a b" and Query = "b a". The PossibleValue has no "occurrences" of the "b a", Indeed, the PossibleValue contains the "a" and "b" words, but not in the order the Query contains them.
If the PossibleValue has no occurrences of the Query, the SimilarityRank is 0. Exit the algorithm.
Calculate a value called PhraseLengthFactor for the given (Query, PossibleValue) pair. See below for the details of the calculation.
Calculate a value called SimilarityRankOfTheOccurence for each occurrence in the following way:

5.1. Calculate a value called WordsSimilarityRank for every pair of (<word of Query>, <word of PossibleValue that matches the word of Query>). It shows a "degree of similarity" between the words.

An example: suppose we have PossibleValue "aa bbb c d" and Query = "a b". Then we calculate a value WordsSimilarityRank for the pair ("a", "aa"); and another value WordsSimilarityRank for the pair ("b", "bbb");

See below for the details of the calculation.

5.2. Calculate a value called WordPositionFactor for every pair of (<word of Query>, <word of PossibleValue that matches the word of Query>).

An example: suppose we have PossibleValue = "aa bbb c d" and Query = "a b". Then we calculate a value WordPositionFactor for the pair ("a", "aa"); and another value WordPositionFactor for the pair ("b", "bbb").

See below for the details of the calculation.

5.3. Calculate the SimilarityRankOfTheOccurence of the given occurrence as the following:
```
SimilarityRankOfTheOccurence = WordsSimilarityRank * WordPositionFactor * PhraseLengthFactor 
```
Now we have a value SimilarityRankOfTheOccurence calculated for each of the "occurrences" in the given (Query, PossibleValue) pair. Find the largest one among all the SimilarityRankOfTheOccurence values. This will be the SimilarityRank of the (Query, PossibleValue) pair.

How the PhraseLengthFactor is calculated

On the input here we have a pair (Query, PossibleValue). They are already "split" to separate words; and can be represented as arrays of words.

We calculate the PhraseLengthFactor as the following (a code below is C#):

double AddendForWordWeightCalculation = 10.0;
double MaxQueryRelativeWeight = 1.0; 
double MinQueryRelativeWeight = 0.5; //Should be in]0, MaxQueryRelativeWeight[ range

double weightOfQuery = 0.0;
for (int i = 0; i < Query.Words.Length; ++i)
{
	weightOfQuery += Query.Words[i].Length + AddendForWordWeightCalculation;
}

double weightOfPossibleValue = 0.0;
for (int i = 0; i < PossibleValue.Words.Length; ++i)
{
	weightOfPossibleValue += PossibleValue.Words[i].Length + AddendForWordWeightCalculation;
}

double prevVal = weightOfQuery / weightOfPossibleValue;

double phraseLengthFactor = MinQueryRelativeWeight + prevVal * (MaxQueryRelativeWeight - MinQueryRelativeWeight);

The resulting PhraseLengthFactor in the given algorithm implementation will be in the [0.5, 1.0] range. You can change the range if you want to increase or decrease the influence of the PhraseLengthFactor on the final SimilarityRank. For example, if you use MaxQueryRelativeWeight = 2.0 instead of the current 1.0, the influence of the PhraseLengthFactor will be higher.

How the WordsSimilarityRank is calculated

On the input here we have a pair of words (WordOfPossibleValue, WordOfQuery). The WordOfQuery is a word of the given Query; the WordOfPossibleValue is a word of the given PossibleValue; and this word "matches" the WordOfQuery. An example: a WordOfQuery is"th" and a "matching" WordOfPossibleValue word is "That". As we stated above, a WordOfQuery must be equal to WordOfPossibleValue OR to be its prefix to "match" it. The comparison here is case-insensitive.

We calculate the WordsSimilarityRank as the following:

Define a constant called IncreasingForUppercases = 1.1

Define a constant called DecreasingFor2ndClassWord = 0.2

WordsSimilarityRank = WordOfQuery.Length / PossibleValue.Length
IF the WordOfQuery contain uppercase letter(s)

AND
the WordOfQuery is equal to WordOfPossibleValue OR is its prefix (in case-sensitive manner)
THEN
WordsSimilarityRank = WordsSimilarityRank * IncreasingForUppercases
IF the WordOfPossibleValue belongs to the "2nd class" (see above about the "word classes")

THEN
WordsSimilarityRank = WordsSimilarityRank * DecreasingFor2ndClassWord

NOTE:

The "WordsSimilarityRank = WordOfQuery.Length / PossibleValue.Length" formula implements what we called above as the "words length factor".

The IncreasingForUppercases here implements what we called above as the "capital letters factor".

The DecreasingFor2ndClassWord here implements what we called above as the "word class factor".

How the WordPositionFactor is calculated

On the input here we have an integer value PositionInPossibleValue. This is a (zero-based) word-wise position of a Query word in a "matched" PossibleValue. For example, we have a Query "th" and a PossibleValue "Color of the night". The word "th" is found in the 2^nd (using zero-based numbering) word of the PossibleValue; so the PositionInPossibleValue is 2 for the given occurrence of the Query.

We calculate the WordsSimilarityRank as the following:

Define a constant called WordPositionFactorBonusFor1stWord = 2.0

Define a constant called WordPositionFactorAddendForCalculation = 10.0

Define a constant called WordPositionFactorMinValue = 0.3;

WordPositionFactor = WordPositionFactorAddendForCalculation / (WordPositionFactorAddendForCalculation + positionInPossibleValue)
IF positionInPossibleValue is 0

THEN
WordPositionFactor = WordPositionFactor * WordPositionFactorBonusFor1stWord
IF WordPositionFactor < WordPositionFactorMinValue

THEN
WordPositionFactor = WordPositionFactorMinValue

NOTE: The WordPositionFactorBonusFor1stWord plays a role of a "bonus" when the Query word is found in the very begin of a PossibleValue. Of course, a WordPositionFactorBonusFor1stWord value must be more than 1.0.

The value WordPositionFactorBonusFor1stWord = 2.0 used in the given implementation is not a "magic number" that cannot be changed. You can, for example, increase it if you want the "bonus for a position in the very begin of the PossibleValue" be more essential.

The WordPositionFactorMinValue must be a positive number < 1.0

The implementation

The Demo attached is a simple implementation of the algorithm on C#. The class AutocompletionEngine.TextSearcher.Processor comprises all the logic of the algorithm. Create an object of the class, populate it with the PossibleValues (using the AddPossibleValue method) and then call the method Search with a Query as its argument. The method returns the "resulting list" of the algorithm – a list of "matched" PossibleValues ordered by decreasing of their SimilarityRank.

The static list AutocompletionEngine.TextSearcher.Word.ArticlesEtc contains the hardcoded list of the "second class" words. If you want the algorithm to work correctly with non-English queries, populate the list with "second-class words" of your language.

The project UnitTest contains a set of NUnit-based automated tests of the AutocompletionEngine.TextSearcher.Processor class. This is why the project requires nunit.framework.dll library.