Introduction
Handling big text files can be quite demanding in terms of disk space and bandwidth if these files are frequently shared via network; especially when the total amount of data exceeds several Tera-Bytes.
A workaround provides the LZMA compression which yields very good compression rates for text files (in a current case, it reduces the file size to about 25% of the original size). Simultaneously, it also provides relatively fast decompression and is thus eligible for on-the-fly decompression which is necessary if the data needs to be accessed frequently.
To access LZMA compressed files, I use liblzma (XZ utils, https://tukaani.org/xz/) which is usually distributed on linux OS but it is also available for Windows.
In this tip, I present a customized std::streambuf
which can be used in conjunction with a std::istream
to read data from a LZMA compressed file on-the-fly. The code is tested on linux with GCC 6.4.0 and liblzma 5.2.2.
Using the Code
Minimum Example
As an example, the following snippet opens the LZMA compressed file "test.dat.xz", passes it to the customized LZMAStreamBuf
and writes the decompressed data line by line to STDOUT
. If an error occurs while decompressing, the badbit of the istream
is set.
ifstream ifs("test.dat.xz", ios::in | ios::binary);
LZMAStreamBuf lzmaBuf(&ifs);
istream in(&lzmaBuf);
while(!in.eof() && !in.bad())
{
string sLine;
getline(in, sLine);
cout << sLine << endl;
}
Customized std::streambuf
The functionality of streams in the standard library are generally extended by creating a customized std::streambuf
class. This customized class only provides the bare data, while the std::istream
instance takes care of the typical "stream behavior" in C++.
Inheriting from std::streambuf
in this example is rather straightforward, since I only provide an associated character sequence (https://en.cppreference.com/w/cpp/io/basic_streambuf), i.e., the seek pointer cannot freely be relocated. Imagine, for example, a simple network socket: The data is gone as soon as it has been read from the source.
To implement this behavior, I override the method underflow
which will be called by std::istream
once it needs more data. When underflow
has made new data available, the method passes the data pointer via the setg(...)
method back and returns the first available byte in the buffer. If no data is available any more, the method returns EOF. All usual stream functionalities are then provided by std::istream
solely.
#include "lzma.h"
class LZMAStreamBuf : public std::streambuf
{
public:
LZMAStreamBuf(std::istream* pIn)
: m_nBufLen(10000) , m_pIn(pIn)
, m_nCalls(0)
, m_lzmaStream(LZMA_STREAM_INIT)
{
m_pCompressedBuf.reset(new char[m_nBufLen]);
m_pDecompressedBuf.reset(new char[m_nBufLen]);
setg(&m_pDecompressedBuf[0], &m_pDecompressedBuf[1], &m_pDecompressedBuf[1]);
lzma_ret ret = lzma_stream_decoder
(&m_lzmaStream, std::numeric_limits<uint64_t>::max(), LZMA_CONCATENATED);
if(ret != LZMA_OK)
throw std::runtime_error("LZMA decoder could not be opened\n");
m_lzmaStream.avail_in = 0;
}
virtual ~LZMAStreamBuf()
{
}
virtual int underflow() override final
{
lzma_action action = LZMA_RUN;
lzma_ret ret = LZMA_OK;
if(this->gptr() < this->egptr())
return traits_type::to_int_type(*this->gptr());
while(true)
{
m_lzmaStream.next_out =
reinterpret_cast<unsigned char*>(m_pDecompressedBuf.get());
m_lzmaStream.avail_out = m_nBufLen;
if(m_lzmaStream.avail_in == 0)
{
m_pIn->read(&m_pCompressedBuf[0], m_nBufLen);
if(m_pIn->bad())
throw std::runtime_error
("LZMAStreamBuf: Error while reading the provided input stream!");
m_lzmaStream.next_in =
reinterpret_cast<unsigned char*>(m_pCompressedBuf.get());
m_lzmaStream.avail_in = m_pIn->gcount();
}
if(m_pIn->eof())
action = LZMA_FINISH;
ret = lzma_code(&m_lzmaStream, action);
if(m_lzmaStream.avail_out < m_nBufLen)
{
const size_t nDataAvailable = m_nBufLen - m_lzmaStream.avail_out;
setg(&m_pDecompressedBuf[0], &m_pDecompressedBuf[0],
&m_pDecompressedBuf[0] + nDataAvailable);
return traits_type::to_int_type(m_pDecompressedBuf[0]);
}
if(ret != LZMA_OK)
{
if(ret == LZMA_STREAM_END)
{
assert(action == LZMA_FINISH);
assert(m_pIn->eof());
assert(m_lzmaStream.avail_out == m_nBufLen);
return traits_type::eof();
}
setg(nullptr, nullptr, nullptr);
std::stringstream err;
err << "Error " << ret << " occurred while decoding LZMA file!";
throw std::runtime_error(err.str().c_str());
}
}
}
private:
std::istream* m_pIn;
std::unique_ptr<char[]> m_pCompressedBuf, m_pDecompressedBuf;
const size_t m_nBufLen;
lzma_stream m_lzmaStream;
};
Points of Interest
- In case of an error, a
std::runtime_error
is thrown. This exception is generally caught by std::istream
and the badbit of the stream is set. Therefore one regularly should test that condition by calling std::istream::bad()
. However, one can also make std::istream
to forward the exceptions directly to the user (who then must catch them himself) by calling std::istream::exceptions()
(https://en.cppreference.com/w/cpp/io/basic_ios/exceptions) before. - Through the
std::streambuf::setg()
method (https://en.cppreference.com/w/cpp/io/basic_streambuf/setg) underflow forwards the buffer to the std::streambuf
. The first argument marks the beginning of the buffer, the second one the current read position and the last one the end of the buffer. In case of an error, the buffer should be set to nullptr
.
History
- 17th June, 2019: Initial version