Click here to Skip to main content
15,893,564 members
Articles / General Programming / Performance
Tip/Trick

Using a Variation on RLE to Achieve Lossless Compression for Near-consecutive or Grouped Tabular Data

Rate me:
Please Sign up or sign in to vote.
5.00/5 (3 votes)
9 Jun 2023CPOL3 min read 7.6K   2   2
Introducing a lossless compression mechanism for data structured in a table or matrix
In scenarios where there is a requirement for storing masses of loosely consecutive or grouped data, a variation of run-length-encoding that repeats content until overridden can be implemented to achieve a form of lossless compression.

Introduction

I’m working on a project called Track that among other things, requires me to store data about lots of trains in a tabular format. In the UK, most trains (or rolling stock) are ordered by operating companies in batches, leading to groups of almost identical stock. Each unit is numbered in a near consecutive way, and most of the time, this is the only way to distinguish between stock from the same group.

Therefore, when it comes to storing this as a table, there is quite a lot of unnecessary repetition, which can be efficiently removed using a variation on run-length-encoding.

Examples

It’s probably best if I try and explain with some examples. Let’s say we need to store the following data about five trains:

Number Operator Carriages Built
453101 AWR 8 2010
453102 AWR 8 2011
453103 AWR 5 2011
453104 LWR 5 2011
453105 LWR 5 2011

The first thing you might notice is that a lot of the data is grouped, and the only thing that changes every time is the number. This is a perfect candidate for variable-length run-length-encoding implementation.

Number Operator Carriages Built
453101 AWR 8 2010
+ / / +
+ / 5 /
+ LWR / /
+ / / /

As you can see, the first record is stored in full, this sets the defaults for each of the following cells. Any occurrence where data is not filled (represented with a slash /) will be replaced with the previous value, which at the start will be taken from the first record.

Occurrences where data is incremented from the value above are represented with a plus +, this can be seen in the number column.

How Does this Differ from RLE?

RLE is a very simple algorithm that is used to compress data by storing the number of times a value is repeated. For example, the following data:

Apple Apple Apple Apple Banana Banana Apple Apple Apple

can be compressed to:

4 Apple 2 Banana 3 Apple

The difference between this and my implementation is that my implementation only stores a value once until it is overidden, and thus only works with data represented vertically. ‘Rotating’ the data to the vertical and compressing it with my implementation would result in:

Apple
/
/
/
Banana
/
Apple
/
/

Which as you can see, is more complex than the original data! It’s only really when data is stored in a tabular format with multiple columns of data that my algorithm can be used to gain a substantial effect.

Performance

I’ve implemented this technique on a larger dataset of 87 trains with size of 10,430 bytes and the compressed result is only 3,123 bytes, which gives a reduction of 70.3%.

Caveats

This technique is not suitable for all data, and there are a few caveats to be aware of:

  • The data must be grouped in some way, with lots of repetition or consecutive data that can be removed and represented through slashes / or pluses +.
  • The data must be able to be stored in a tabular format (as proven above).
  • Processing the data is slower, as it must be decompressed before it can be used.

This is a very simple technique that can be used to achieve a form of lossless compression for data that is stored in a tabular format. It’s not suitable for all data, but it can be very useful in certain scenarios as shown in the examples.

History

  • 9th June, 2023: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Student
United Kingdom United Kingdom
I like programming and maths

Comments and Discussions

 
QuestionYour example is wrong Pin
Member 1602804012-Jun-23 20:50
Member 1602804012-Jun-23 20:50 
AnswerRe: Your example is wrong Pin
Jack Devey13-Jun-23 1:24
Jack Devey13-Jun-23 1:24 
Agreed, the example is a little misleading using full words rather than characters, but I opted to do this to maintain clarity. The example still gives a clear picture of how RLE functions, with the assumption that the entities Apple and Banana are atomic and cannot be further broken down into characters.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.