Introduction

The algorithm and library based upon it were created by your humble servant as a part of a stochastic Markov-chain-like engine generating naturally sounding alphanumeric identifiers used for 'canned' passwords for a bid enterprise web platform. It can as well be used in code transducing, or as a back-tracking fixture for data-mining algorithms working with sequences. I'm also planning to use it to generate melodic voice lines which follow the rules of strict counterpoint for one of my numerous pet projects.

Assume we have a finite alphabet of symbols, say { A, B, C } which is limited but quite sufficient to demonstrate the approach. We also have an incoming stream of characters, appearing one at a time, say

(1) A, A, B, A, C, A, C, C, ...

Now, if our sequence recognizer is set to recognize sequences:

0: [], 1: [A], 2: [B], 3: [C]

Sequence 0 is the initial state when the first symbol is yet to come from the in-stream. Others just correspond to an occurrence of a particular symbol, and we'll be calling them trivial because of their length being < 2. Its recognized sequence indexes (SIs) for the above mentioned symbol in-stream contents are then going to be

0, 1, 1, 2, 1, 3, 1, 3, 3, ...

At this point, this looks more like a basic finite-state transducer. We can observe that in order to establish a sequence recognizer on an alphabet of N characters, we need N+1 trivial states so it can at least recognize each individual symbol from this alphabet. Let's though make things more interesting. We're adding some non-trivial sequences to the above-mentioned recognizer (Note: They are not in any particular order.)

4: [AA], 5: [BA], 6: [AC], 7: [ACC]

Now, for the same in-stream (1), we should recognize SIs as follows:

0, 1, 4, 2, 5, 6, 1, 3, 7, ...

A sequence can be any non-full-length substring for any other sequences, e.g. [AAB], [BAAB], [AABCCC], etc. The only rule for sequences is that they should all be different.

Implementation

My first idea was to use a circular buffer with a length K equal to the length of the longest sequence to be detected. This way, I could keep track of the last K symbols to match them with the predefined sequences. This brute-force approach requires that we match symbols in the buffer with the ones of each sequence in an iterative fashion, and then the longest match wins. The complexity of this approach for each new symbol is O(SUM(sequenceLength)) whereas in theory the ideal completixy of one step should be O(1), since after all, we add but one symbol on each step. This can be achieved by implementing an algorithm to build a state graph for a finite-state machine (FSM) capable of 'accepting' symbols one by one. Each new symbol moves the FSM into a new state (possibly back into the same one). Moving into a new state is just a travel along the corresponding out-edge of the current-state node which is a pure O(1) kind of stuff. But, in order to be able to do so, we need something to establish the Mealy state machine statement.

NEXT_STATE = SOME_FUNCTION(NEXT_SYMBOL, CURRENT_STATE)

This can be done by building a state graph for such an FSM. The state graph will have one initial state, as mentioned above, but no final state, since the our recognizer operates in an on-line fashion. For this example, we'll denote the symbols with their zero-based indexes with apostrophes. Now, let's assume we have a 4-symbol alphabet, and the sequences are as follows:

0: [], 1: [0'], 2: [1'], 3: [2'], 4: [3'], 5: [1'3'], 6: [2'2'2'], 7: [2'2'1'1'], 8: [2'2'2'2'], 9: [2'2'2'3']

Building of the state graph can go in stages as follows.

1. Add Trivial Sequences to the Tree

Note: For all the subsequent pictures, edges are labeled with corresponding character indexes; nodes are labeled with <SI>s<MachineStateIndex>n; machine state index serves as a unique identifier for the node and is assigned in a sequential fashion when the node gets created. Initially, we add nodes for all trivial states, now the resulting tree looks like this:

As we can see, the tree now contains all trivial states which are reachable from the initial state. Nice, but not nearly enough to make our humble FSM operational. This stage can be safely merged with the next one, and is separated out just for the sake of conceptual clarity.

2. Add Non-Trivial Sequences to the Tree

Now, we will iterate through all non-trivial sequences making a prefix tree, or trie out of our trivial tree. For each sequence, we scan through its symbols, starting from the initial node, we add missing edges and nodes to the trie. All atomic characters should be covered by the previous stage though. At this point, some non-leaf nodes will not have associated SIs and are marked with underscores. This is because those nodes do not represent any full-fledged sequence being just intermediary states; this SI absence will be addressed on the next stage. If the last node for a particular sequence already has its SI assigned, then we've came across a non-unique item in the predefined sequence set, and should error out.

3. Write in Missing SIs

At this stage, we need to write in the missing SIs for all nodes. There is a subroutine

public static Node FindNode(this StateGraph graph, IList<int> sequenceReversed)

that plays a key role on the current and next stages. It sifts a given sequence, potentially incomplete, through the trie, and finds the sequence-indexed node corresponding to it, thus finding the longest complete sequence prefixing the current one. Let's give an example of its operation for nodes 6 and 8. For node 6, sequence is 2'2', so the function initially attempts to find the node with an SI for it. It fails, so it chops the first symbol off of the sequence thus getting 2', and then it tries it and finds node 3 with SI 3, so it writes SI 3 into node 6. On the picture below, search iterations are ~~horribly~~ drawn in different colors with the corresponding steps in black.

For node 8, it follows the same approach by chopping 2'2'1 to 2'1 and then to just 1', thus writing SI 2 into node 8.

Now, each node in our trie has its SI (note: SIs are not unique). The only problem is - not all transition edges are in place. Each node is supposed to all set with them for each symbol in the alphabet. Those edges will be connecting it to other non-initial nodes. Therefore, we need one more stage.

4. Add Missing Transitions

We will be using the same FindNode function to fill in the gaps in the transition edges. The process looks rather similar to the previous stage. We iterate though the edges of a node, and if the transition is missing, we add the corresponding symbol to the node sequence (do not confuse with node SI) and then search for the corresponding node which can be itself, adding a transition to it into place. For example, for node 10 transition 2' will be pointing to itself as well as several others. Here is the final graph.

It can also be represented as a matrix of size N cols x M rows where N is the number of symbols in the alphabet, M is the number of nodes/states.

     0'    1'     2'     3'
0    1     2      3      4    
1    1     2      3      4    
2    1     2      3      5    
3    1     2      6      4    
4    1     2      3      4    
5    1     2      3      4    
6    1     8      7      4    
7    1     8     10     11    
8    1     9      3      5
9    1     2      3      5    
10   1     8     10     11    
11   1     2      3      4

Now, the trie has become a full-fledged transition graph which can be consumed by a state machine. Now, let's take a quick glance at the code associated with all the concepts we've gone through.

Using the Code

StateGraph along with its nested class StateGraph.Node and extension methods represent all logic associated with the creation of state graph. To facilitate debugging and unit testing, the above-described state graph creation stages are implemented as separate functions so they can be called as such. The main 'production' constructor unites them.

public StateGraph(int alphabetSize, int[][] sequences)
    : this(alphabetSize)
{
    this.BuildTrie(sequences); // stages 1, 2
    this.WriteInMissingSequenceIndices(); // stage 3
    this.WriteInMissingTransitions(); // stage 4
}

After we are done with the graph, StateMachine becomes the main actor. This class is so straightforward as to list its entire code here.

public class StateMachine
{
    public StateMachine(StateGraph stateGraph)
    {
        Graph = stateGraph
            ?? throw new ArgumentNullException(nameof(stateGraph));

        _currentNode = stateGraph.Root;

        Symbol = StateGraph.DefaultSymbol;
    }

    public readonly StateGraph Graph;
    private StateGraph.Node _currentNode;

    public int Symbol { get; private set; }

    public int State => _currentNode.NodeIndex;

    public void AcceptSymbol(int symbol)
    {
        try
        {
            _currentNode = _currentNode.Transitions[symbol];
        }
        catch (IndexOutOfRangeException e)
        {
            throw new ArgumentException("symbol is out of state range", e);
        }

        Symbol = symbol;
    }

    public int AcceptSymbolAndGetSequenceIndex(int state)
    {
        AcceptSymbol(state);

        return _currentNode.SequenceIndex;
    }

    public void Reset()
    {
        _currentNode = Graph.Root;
        Symbol = StateGraph.DefaultSymbol;
    }

    public int Sequence => _currentNode.SequenceIndex;
}

The code is not explicitly thread safe, but the state graph is treated as a read-only data structure from the moment when its gets fully built and on, so any number of state machines can use it simultaneously from contexts of many threads/tasks. This is how I use it in a real web application running in a Kestrel-based web server farm. State machines get instantiated on a per-request basis, whereas the state graph (in fact, there are several of them in the app) remains shared as a singleton. As long as each state machine gets called on the context of its own thread/task, it is just fine. Otherwise, synchronization has to be taken care of by the calling code. In conclusion, here is a complete code of a very simple application I used to profile the performance of the library.

class Program
{
    static void Main()
    {
        const int alphabetSize = 4;

        var sequences = new[]
        {
            new[] { 1, 3 },
            new[] { 2, 2, 2 },
            new[] { 2, 2, 1, 1 },
            new[] { 2, 2, 2, 2 },
            new[] { 2, 2, 2, 3 }
        };

        for (int i = 0; i < 100000; ++i)
        {
            var g = new StateGraph(alphabetSize, sequences);

            var sm = new StateMachine(g);

            for (int k = 0; k < alphabetSize; ++k)
            {
                for (int j = 0; j < 1000 * k; ++j)
                {
                    sm.AcceptSymbol(k);

                    var sequence = sm.Sequence;
                }
            }
        }
    }
}

Dear readers, please feel free to share any ideas of usage and/or improvement of the library. And yes, the full source code of the library along with all due unit tests is here:

https://github.com/DNemtsov/SequenceRecognizer