Click here to Skip to main content
15,891,607 members
Articles / Programming Languages / C++
Article

Templated Burrows-Wheeler transformation

Rate me:
Please Sign up or sign in to vote.
2.40/5 (5 votes)
28 May 20052 min read 42.8K   607   9   6
An article on how to use the templated class for Burrows-Wheeler transformation.

Introduction

This piece of code is something that I wrote while finishing my master thesis at Aalborg University. We were, among other things, researching on methods to compress inverted indexes in order to keep the disk space requirements low. The attached code demonstrates a way of implementing the Burrows-Wheeler transformation in a templated C++ environment. I so far only tested it with G++ on Sun Solaris -- however the code is platform independent and should be compilable anywhere.

Background

While researching on methods to rearrange data in order to achieve higher compression ratios, I stumbled across a multitude of code snippets on many different codecs. However I never found any code for the Burrows-Wheeler transformation (also known as BWT).

But what is BWT?

BWT is an advanced technique for rearranging data in order to achieve higher compression rates by the following codecs. In general, BWT takes a sequence of symbols as input and tries to order the data by rearranging the positions without tampering the actual values. As a standalone codec, the BWT does nothing but add a slight overhead to the data, however when developing customized compression scehems, several codecs may be set in sequence in order to achieve higher compression ratios. A simple codec scenario would be first to execute the BWT and then execute a Run-Length-Encoding (RLE). A sample application of the BWT followed by RLE follows:

Given the following input to the BWT:

The quick brown fox jumped over the silvermoon. The quick brown fox jumped 
over the silvermoon. The quick brown fox jumped over the silvermoon. The 
quick brown fox jumped over the silvermoon.

The BWT will output the following code:

...kkkknnnnxxxxddddeeeeeeeerrrr.nnnn|       iiiieeeehhhhhhhhppppvvvvvvvv    
TTTTttttuuuussss    cccciiiirrrruuuuwwwwoooooooommmm    rrrrffffmmmm    eeee
eeeebbbb        qqqqjjjjoooolllloooooooo|

Where "|" is the BWT-delimiter symbol.

This code is highly compressible compared to the original text. Below is shown a simple RLE encoding on the above BWT code:

3.#4k#4n#4x#4d#8e#4r#.#4n#|#7 #4i#4e#8h#4p#8v#4T#4t#4u#4s#4 #4c#4i#4r#4u
#4w#8o#4m#4 #8e#4b#4q#4j#3o#4l#8o#|#

Where "#" is the RLE-delimiter symbol.

Applying other more efficient codecs than the RLE will result in better compression ratios.

Using the code

Since the code is templated you can put more or less any kind of data into the collection and perform the BWT. However please make sure that you allocate one symbol as delimiter to be used in the BWT and inverse-BWT.

The first example shows how to transform a short string into easily compress-able BWT form:

// instantiate the BWT and allocate the value 127 as delimiter
cBWTransform<char> transform(127);

// push the string "BANANA" of length 6 onto the BWT-list
transform.push_back("BANANA", 6);

// perform the BWT transformation
cBWTransformationCode<char> code = transform.Transform();

// prepare for the inverse-BWT transformation
// and allocate the value 127 delimiter
// the delimiter value must be
// the same for the inverse-BWT and the BWT
cBWTransformation<char> inverse_transform(127);

// execute the inverse-BWT on the BWT-code from above
inversetransform.InverseTransform(code);

// output the value to std out
for (cBWTransform< char>::iterator i=inversetransform.begin();
                                   i!=inversetransform.end(); i++) {
    cout << *i << " ";
}
cout << endl;

Points of Interest

The above code is written using STL and relies heavily as such on that. Through our master thesis project, we have discovered that being under time pressure makes the STL very useful, however for special purposes one might want to write hand-optimized code instead :-)

For more information on the BWT transformation, please refer to these sources:

History

  • 28/05/05 - Initial version.
  • 28/05/05 - Added (hopefully) explanatory samples.
  • 29/05/05 - Updates ZIP with VS.NET 2003 project.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
Denmark Denmark
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionWhat version of VC are you using? Pin
WREY28-May-05 20:17
WREY28-May-05 20:17 
AnswerRe: What version of VC are you using? Pin
Rasmus Kaae28-May-05 21:54
Rasmus Kaae28-May-05 21:54 
GeneralIt would be interesting Pin
Jörgen Sigvardsson28-May-05 6:28
Jörgen Sigvardsson28-May-05 6:28 
GeneralRe: It would be interesting Pin
diilbert28-May-05 6:58
diilbert28-May-05 6:58 
GeneralRe: It would be interesting Pin
Rasmus Kaae28-May-05 8:21
Rasmus Kaae28-May-05 8:21 
GeneralRe: It would be interesting Pin
diilbert28-May-05 11:15
diilbert28-May-05 11:15 
Thanks Smile | :)

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.