Click here to Skip to main content
15,867,453 members
Articles / Programming Languages / C

Using Ternary DAGs for Spelling Correction

Rate me:
Please Sign up or sign in to vote.
4.97/5 (32 votes)
10 Jan 2009Zlib9 min read 57.4K   739   52   9
A new data structure for spellchecking and spelling correction

Introduction

The basic algorithm for spell checking is simple: for each word in the text, check if it's in your dictionary. If it is not, report a misspelling. The dictionary is usually implemented with a hash table. Handling affixes and capital letters adds some complexity, but the program is still straightforward.

Suggesting the correct word is harder. The usual approach is to check all variations of the word obtained by inserting a letter, replacing a letter with another, skipping a letter, or transposing adjacent letters. For the misspelled word "teh", they will be:

   replacing:             inserting:             skipping:   transposing:
aeh   tah   tea      ateh   taeh   teah   teha      eh           eth
beh   tbh   teb      bteh   tbeh   tebh   tehb      th           the
ceh   tch   tec      cteh   tceh   tech   tehc      te
...                  ...
zeh   tzh   tez      zteh   tzeh   tezh   tehz

The program looks up for each variation in the dictionary, and reports the ones that are found (in this case, tea, tech, eh, and the). The most probable spelling suggestions are then shown to the user. This approach is used in Ispell and the spelling corrector in Python.

Aspell implements a more complicated scheme using pre-compiled information about similarly sounding words. Other ideas for spelling correction are mentioned in the survey by Roger Mitton.

This article outlines a new solution to the problem based on ternary DAGs.

Ternary Search Tries

In his book "Algorithms in C", Robert Sedgewick describes a data structure that is suitable for finding similar words. It's called ternary search trie (TST).

Each node of the tree contains three links: the middle one leads to the tree for the next character; the left and right links lead to the nodes of the characters that are smaller and larger than the current one. Each level of the TST is essentially a BST (binary search tree) for searching the character in the current position. When the character is found, the program follows its middle link, which leads to the tree of all words starting with the already scanned characters.

Ternary search trie

The figure shows a TST for the words red, read, reading, and reads. The null links are not shown. Because one word may be a prefix of another (e.g., read and reading), you need a special node to denote the end of the word. A node with a null character conveniently fits this purpose (shown as an empty circle on the figure). The word characters are stored in nodes, so there is no need to keep the strings in memory, as with a hash table.

You can find more explanations on TSTs in the book by Sedgewick and in his article in Dr. Dobbs (co-authored with Jon Bentley).

When the dictionary is stored as a TST, you can traverse the adjacent nodes and quickly find spelling suggestions for a given word. Only the words that are in TST are checked. For example, when looking for suggestions for "readimg", you don't need to check all of readiAg, readiBg, readiCg, etc. Only the beginnings of existing words (readiLy, readiEr, readiNess, readiNg, etc.) are checked.

Stemming

It would be interesting to try TST for spelling correction, but it's just too large: a medium-sized English dictionary (100,000 words) requires 320,000 nodes, which is around 5 MB.

Many nodes are taken by affixes: -s, -ed, -ing, un-, etc. The spellcheckers based on hash tables usually strip affixes before searching for a word, so that the hash table contains only root words.

Affix stripping is relatively easy in English, because the language uses inflection limitedly. Other languages (e.g., German, Russian, and Finnish) have more cases and declensions: see synthetic language vs. analytic language and stemming algorithms.

Designing a stemming engine that can be customized for any language is a challenging task. Ispell follows this path; you can find the inflection rules of different languages in the .aff files of Ispell dictionaries.

LanguageNumber of rules
Estonian9249
Finnish (medium-sized)8861
Polish5454
Spanish3274
Czech2544
Russian1077
Danish380
German361
Brazilian Portuguese271
Italian210
Dutch116
English45

More rules do not necessarily mean a more complex morphology because some dictionary authors keep inflected variants in a word list, while others add more affixes for them.

Ternary DAGs

When using TST, we can choose another approach. Note that TST "compresses" equal prefixes; e.g., on the figure above, read, reading, and reads use the same nodes for the first four characters. What if we try to compress the suffixes, too? The resulting data structure will have only 55,000 nodes instead of 320,000. Let's call it a ternary DAG (a ternary directed acyclic graph).

Miyamoto et al used this data structure for pattern matching, but it's not limited to storing suffixes of a word. An arbitrary set of words can be stored in the DAG:

Ternary DAG

The DAG contains the words red, read, reading, reads, and listen, listening, listens. The middle links of N in listen and D in read lead to the same subgraph containing the suffixes -ing, -s, and a zero-length suffix.

Building a ternary DAG

The DAG can be built from a TST. Let's make a hash table of the TST subtrees (a node and all its children are hashed). When identical subtrees are found, replace all links to them with a link to a single subtree.

By using this procedure, you can automatically find common endings. Affix stripping is not necessary, which has multiple advantages:

  • Spelling suggestion search in TST is simplified when all inflected forms are stored in the dictionary.
  • The program can find more common endings than a human who makes an affix list. The affix list for English usually contains -ed, -ing, and -s. The program will also find -ment in agreement and increment, -logist in biologist and psychologist, etc. It can find "suffixes" that do not make sense from a linguistic viewpoint, e.g., -ead in head and read, but it's not a problem for spellchecking.

The program can not only find single common endings, but also subtrees including multiple endings. On the figure above, there is a subtree including -ing, -s, and a zero-length suffix.

The found subtree may be a part of another subtree. Given the words lock, locking, head, heads, headed, heading, and news, the program can compress TST to the following DAG (roots are shown as a single node to save space):

Subtree may be a part of another subtree

As mentioned above, each level of the TST is a BST. A word list is usually sorted alphabetically, so the BSTs are unbalanced. You can use a usual balancing algorithm to fix this.

The attached program (create_dictionary.c) implements these techniques.

Searching

Searching in a ternary DAG is not different from searching in a TST. You can use the recursive algorithm by Sedgewick and Bentley for exact matching.

A fuzzy matching algorithm for DAG examines adjacent nodes to find a spelling suggestion. Common error patterns (inserting a letter, deleting a letter, etc.) can be expressed elegantly using recursion:

C++
// Correct double letters (including multiple errors in one word, e.g. Misisipi)
static void match_doubles(ST_FILE_NODE* node, const char* word, char* result) {
    while(node != NULL) {
        if(*word < node->ch)
            node = node->left;
        else if(*word > node->ch)
            node = node->right;
        else {
            *result = *word;
            if(*result == '\0') {
                // Found the word
                report_suggestion();
                return;
            }
            if(*(word + 1) == *word) {
                // When found two equal letters, try to find a single letter
                match_doubles(node->middle, word + 2, result + 1);
            } else {
                // Otherwise, try to find two equal letters here
                match_doubles(node->middle, word, result + 1);
            }
            // Also find the exact match
            node = node->middle;
            word++;
            result++;
        }
    }
}

The attached spelling corrector (spellchecker.c) handles a single missing letter, an extra letter, a wrong letter, transposed letters, and a missing space.

Performance Comparison

The attached benchmark contains a hash table spelling corrector (benchmark.c), which is similar to the one found in Ispell. It generates all possible variants of the misspelled word, and tries to find them in the hash table.

The tests were run on a Pentium M processor; execution time was measured in ticks using the RDTSC instruction. The dictionary was the medium-sized word list from SCOWL (100,000 English words).

<tody>
AlgorithmSpell checking timeSpelling correction timeSize, MB
MSVCGCCMSVCGCC
hash table9001400125'700167'6002.2
ternary DAG1700170081'400111'2000.4

Results:

  • The hash table implementation is faster for spellchecking. A balanced tree is usually slower than a hash table for searching because of more memory accesses, so this was expected.
  • The ternary DAG implementation is faster for spelling correction.
  • The ternary DAG takes much less memory than the hash table.

No affix stripping scheme is used in the benchmark (all inflected forms are stored in the hash table). In a real-world spell checker, the hash table would contain only the root words, so it will be smaller, but also slower, which means that the ternary DAG result is really not so bad.

Instead of keeping a word in the hash table, you can store one bit that determines if the word is correct, as described in Programming Pearls book and Bloom Filters blog post. In this case, the size of hash table can be even less than the size of ternary DAG, however, an incorrect word will be reported as correct if their hash values are the same.

Using the Code

Run the create_dictionary program to generate the dictionary. Include spellchecker.c and spellchecker.h in your project. The following functions are declared in the header file:

C++
// Open a dictionary, returns NULL in case of error
ST_FILE_NODE * ST_open_dict (const char * filename);

// Returns a non-zero value if the word is present in the dictionary
int ST_find_exact (ST_FILE_NODE * dict, const char * word);

// Enumerate spelling suggestions (fuzzy matches)
void ST_enum_suggestions (ST_FILE_NODE * dict, const char * word,
                          void (* enumproc)(const char*, void*), void * param);

// Close the dictionary
void ST_close_dict (ST_FILE_NODE * dict);

First, you open the dictionary, and then call ST_find_exact for each word of the text. Splitting the text into words is your program's responsibility. If the user wants to see the spelling suggestions, call ST_enum_suggestions to find them. Finally, close the dictionary.

The spellchecker_test.c demo program checks the spelling of the text in the console window.

Future Work and Ideas

  • Improving the suggestion engine. The quality of spelling suggestions is far from that of Microsoft Word or Aspell. The program suggests only the words that are an edit distance of 1 from the misspelled word. A better spelling corrector would also check edit distance 2. The suggestions should be sorted from the most probable to the least probable. Sound-like data and a list of common errors can be included in the corrector.
  • Inflection rules support. When using the create_dictionary program, you should provide a word list with all the inflected forms (e.g., address, addresses, address's, addressing, addressed). It's not a problem for English, for which the number of forms is relatively small. For synthetic languages, it may be handier to use a word list with roots only and a separate list of inflection rules, like Ispell does.
  • Adding words at runtime. The current program compiles the dictionary into ternary DAG format, and then reads from it. Online construction of the ternary DAG is possible, but requires keeping more information in memory (a hash table for TST, and auxiliary counters in the nodes). In this case, a more complex algorithm for adding words should be implemented to use the existing suffixes in DAG and to keep the BSTs balanced.

Conclusion

Ternary DAG is a compact representation of a spell-checking dictionary, which is faster than a hash table for spelling correction, but not for spellchecking. With a ternary DAG, you can avoid affix stripping, which complicates a spell checker.

License

This article, along with any associated source code and files, is licensed under The zlib/libpng License


Written By
Software Developer
Czech Republic Czech Republic
Peter is the developer of Aba Search and Replace, a tool for replacing text in multiple files. He likes to program in C with a bit of C++, also in x86 assembly language, Python, and PHP.

Comments and Discussions

 
GeneralSuggestion engine Pin
ritk13-Oct-14 3:01
ritk13-Oct-14 3:01 
QuestionHuge word list Pin
ATILLA6725-Sep-12 22:30
ATILLA6725-Sep-12 22:30 
NewsNew article using modified DAGTree Pin
Dan Randolph19-Aug-12 6:02
Dan Randolph19-Aug-12 6:02 
GeneralMy vote of 5 Pin
Dan Randolph8-Jul-12 17:47
Dan Randolph8-Jul-12 17:47 
GeneralNice article! Pin
wtwhite1-Jan-09 22:23
wtwhite1-Jan-09 22:23 
GeneralRe: Nice article! Pin
Peter Kankowski2-Jan-09 16:46
Peter Kankowski2-Jan-09 16:46 
QuestionDid you look at MaxEditDistance? Pin
Keith Vinson30-Dec-08 10:46
Keith Vinson30-Dec-08 10:46 
AnswerRe: Did you look at MaxEditDistance? Pin
Peter Kankowski30-Dec-08 15:22
Peter Kankowski30-Dec-08 15:22 
Thanks, I knew it. It's used in some spell checkers to evaluate the suggestions. They are then sorted from the most probable (the smallest Levenshtein distance) to the least probable (the largest Levenshtein distance). The algorithm can be improved to consider the swapped letters.
GeneralRe: Did you look at MaxEditDistance? Pin
Keith Vinson2-Jan-09 8:25
Keith Vinson2-Jan-09 8:25 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.