Alchemy: Message Serialization

Paul M Watt

0/5 (0 vote)

Dec 21, 2014

CPOL

9 min read

5884

Alchemy: Message Serialization

This is an entry for the continuing series of blog entries that documents the design and implementation process of a library. This library is called, Network Alchemy[^]. Alchemy performs data serialization and it is written in C++. This is an Open Source project and can be found at GitHub.

If you have read the previous Alchemy entries, you know that I have now shown the structure of the Message host. I have also demonstrated how the different fields are pragmatically processed to convert the byte-order of the message. In the previous Alchemy post, I put together the internal memory management object. All of the pieces are in place to demonstrate the final component to the core of Alchemy, serialization.

Serialization

Serialization is a mundane and error prone task. Generally, both a read and a write operation are required to provide any value. Serialization can occur on just about any medium including: files, sockets, pipes, and consoles to name a few. The primary purpose of a serialization task is to convert a locally represented object into a data stream. The data stream can then be stored or transferred to a remote location. The stream will be read back in, and converted to an implementation defined object.

It is possible to simply pass the object exactly as you created it, but only in special situations. You must be working on the same machine as the second process. Your system will require the proper security and resource configuration between processes, such as a shared memory buffer. Even then, there are issues with how memory is allocated. Are the two programs developed with the same compiler? A lot of flexibility is lost when raw pointers to objects are shared between processes. In most cases, I would recommend against doing that.

Serialization Types

There are two ways that data can be serialized:

Text Serialization
Text serialization works with basic text and symbols. This scenario often happens when editing a raw text file in Notepad. When the file is saved in Notepad, it writes out the text, in plain text. Configuration and XML files are another example of files that are stored in plain text. This makes it convenient for users to be able to hand edit these files. Again, all data is serialized to a human readable format (usually).
Binary Serialization
Binary serialization is simply that, a stream of binary bytes. As binary is only 1s and 0s, it is not human friendly for reading and manipulating. Furthermore, if your binary serialized data will be used on multiple systems, it is important to make sure the binary formats are compatible. If they are not compatible, adapter software can be used to translate the data into a compatible format for the new system. This is one of the primary reasons Alchemy was created.

Alchemy and Serialization

Alchemy serializes data in binary formats. The primary component in Alchemy is called ,Hg (Mercury - Messenger of the Gods). Hg is only focused on the correct transformation and serialization of data. On one end, Hg provides a simple object interface that behaves similarly to a struct. On the other end, the data is serialized and you will receive a buffer that is packed according to the format that you have specified for the message. With this buffer, you will be able to send it directly to any transport medium. Hg is also capable of reading input streams and populating a Hg Message object.

Integrating the Message Buffer

The MsgBuffer will remain an internal detail of the Message object that the user interacts with. However, there is one additional definition that will need to be added to the Message template parameters. That is the StoragePolicy chosen by the user. This will allow the same message format implementation to be used to interact with many different types of mediums. Here is a list of potential storage policies that could be integrated with Alchemy:

User-supplied buffer
Alchemy managed
Hardware memory maps

For hardware memory maps, the read/write operations could be customized to reading data on the particular platform. The Hg message format would provide a simple user-friendly interface to the fixed-memory on the machine. The additional template parameter, along with some convenience typedefs are shown below:

template < class MessageT,
           class ByteOrderT = Hg::HostByteOrder,
           class StorageT   = Hg::BufferedStoragePolicy
         >
struct DemoTypeMsg
{
  // Define an alias to provide access to this parameterized type.
  typedef MessageT                            format_type;

  typedef StorageT                            storage_type;

  typedef typename 
    storage_type::data_type                   data_type;
  typedef data_type*                          pointer;
  typedef const data_type*                    const_pointer;

  typedef MsgBuffer< storage_type >           buffer_type;
  typedef std::shared_ptr< buffer_type >      buffer_sptr;

  // ... Field declarations
private:
  buffer_type       m_msgBuffer;
};

The Alchemy managed storage policy, Hg::BufferedStoragePolicy, is specified by default. I have also implemented a storage policy that allows the user to supply their own buffer called Hg::StaticStoragePolicy. This is included with the Alchemy source.

Programmatic Serialization

The solution for serialization is very similar to the byte-order conversion logic that was demonstrated in the post in which I introduced the basic Alchemy: Prototype[^]. Once again, we will use the ForEachType static for loop that I implemented to serialize the Hg::Messages. This will require a functor to be created for both input and output serialization.

Since I have already presented the details that describe how this static for-loop processing works, I am going to present serialization from top to bottom. We will start with how the user interacts with the Hg::Message, and continue to step deeper into the processing until the programmatic serialization is performed.

User Interaction

// Create typedefs for the message.
// A storage policy is provided by default.
typedef Message< DemoTypeMsg, HostByteOrder >    DemoMsg;
typedef Message< DemoTypeMsg, NetByteOrder >     DemoMsgNet;

// Populate the data in Host order.
DemoMsg msg;

msg.letter = 'A';
msg.count =  sizeof(short);
msg.number = 100;

// The data will be transferred over a network connection.
DemoMsgNet netMsg  = to_network(msg);

// Serialize the data and transfer over our open socket.
// netMsg.data() initiates the serialization, 
// and returns a pointer to the buffer.
send(sock, netMsg.data(), netMsg.size(), 0);

This is the definition of the user accessible function. This code first converts the pointer to this to a non-const form, in order to call a private member-function that initiates the operation. This is required so the m_msgBuffer field can be modified and store the data. There are a few other options. The first is to remove the const qualifier from this function. This is not a good solution because it would make it impossible to get serialized data from objects declared const. The other option is to declare m_msgBuffer as mutable. However, this form provides the simplest solution, and limits the modification of m_msgBuffer to this function alone.

//  ***********************************************************
  /// Returns a pointer to the memory buffer 
  /// that contains the packed message.
  ///
  const_pointer data() const
  {
    Message *pThis = const_cast< Message* >(this);
    pThis->pack_data();

    return m_msgBuffer.data();
  }

In turn, the private member-function calls a utility function that initiates the process:

//  ***********************************************************
  void pack_data()
  {
    m_msgBuffer =  *pack_message < message_type, 
                                   buffer_type,
                                   size_trait
                                 >(values(), size()).get();
  }

Message Packing Details

Now we are behind the curtain where the work begins. Again, you will notice that this first function is a global top-level parameterized function, which calls another function. The reason for this is the generality of the final implementation. When nested fields are introduced, processing will return to this point a specialized form of this function. This is necessary to allow nested message formats to also be used as independent top-level message formats.

template< class MessageT,
          class BufferT
        >
std::shared_ptr< BufferT >
  pack_message( MessageT& msg_values,
                size_t    size)
{
  return detail::pack_message < MessageT, 
                                BufferT
                              >(msg_values, 
                                size);
}

... And just like the line at The Hollywood Tower Hotel ride at the California Adventure theme park, the ride has started and you weren't even aware. But, there's another sub-routine.

template< typename MessageT,
          typename BufferT
        >
std::shared_ptr< BufferT >
  pack_message( MessageT  &msg_values, 
                size_t          size)
{
  // Allocate a new buffer manager.
  std::shared_ptr< BufferT > spBuffer(new BufferT);
  // Resize the buffer.
  spBuffer->resize(size);
  // Create an instance of the
  // functor for serializing to a buffer.
  detail::PackMessageWorker 
    < 0, 
      Hg::length< typename MessageT::format_type >::value,
      MessageT,
      BufferT
    > pack;     // Note: Pack is the instantiated functor.

  // Call the function operator in pack.
  pack(msg_values, *spBuffer.get());
  return spBuffer;
}

Here is the implementation of the pack function object:

template< size_t    Idx,
          size_t    Count,
          typename  MessageT,
          typename  BufferT
         >
struct PackMessageWorker
{ 
  void operator()(MessageT &message,
                  BufferT  &buffer)
  {
    // Write the current value, then move to 
    // the next value for the message.
    size_t dynamic_offset = 0;
    WriteDatum< Idx, MessageT, BufferT >(message, buffer);

    PackMessageWorker < Idx+1, Count, MessageT, BufferT> pack;
    pack(message, buffer);
  }

This should start to look familiar of you, read the Alchemy: Prototype entry. Hopefully repetition does not bother you because that is what recursion is all about. This function will first call a template function called WriteDatum, which performs the serialization of the current data field. Then a new instance of the PackMessageWorker functor is created to perform serialization of the type at the next index. To satisfy your curiosity, here is the implementation for WriteDatum:

template< size_t   IdxT,      
          typename MessageT, 
          typename BufferT
        >
struct WriteDatum
{
  void operator()(MessageT &msg,
                  BufferT  &buffer)
  {
    typedef typename
      Hg::TypeAt
        < IdxT,
          typename MessageT::format_type
        >::type                                   value_type;

    value_type value  = msg.template FieldAt< IdxT >().get();
    size_t     offset = 
                 Hg::OffsetOf< IdxT, typename MessageT::format_type >::value;

    buffer.set_data(value, offset);
  }
};

That is pretty much the top-to-bottom journey for the serialization path in Alchemy. However, something is not quite right. I will give you a moment to see if you notice a difference between how this version works, compared to the byte-order processing in the other method.

Brief intermission for deep reflection on the previous recursive journey...

How Did You Do?

There are two things that you may have noticed.

The ForEachType construct I mentioned was not used.
There is no terminating case in this recursive implementation.

Originally, I had used the ForEachType construct. However, at the point, I am now with the project hosted on GitHub, I required more flexibility. Therefore, I had to create a more customized solution to work with. The code segments above are adapted from the source on GitHub. The only thing I changed was the removal of types and fields that relate to support for dynamically-sized arrays.

As for the terminating case, I have not shown that yet. Here it is:

template< size_t    Idx,
          typename  MessageT,
          typename  BufferT
         >
struct PackMessageWorker< Idx, // Special case:
                          Idx, // Current Idx == End Idx
                          MessageT, 
                          BufferT
                        >
{ 
  void operator()(MessageT& msg, 
                  BufferT& buffer)
  { }
};

This specialization of the PackMessageWorker template is a more specific fit for the current types. Therefore, the compiler chooses this version. The implementation of the function is empty, which breaks the recursive spiral.

Message Unpacking

For the fundamental types, the process looks almost exactly the same. Alchemy verifies the input buffer is large enough to satisfy what the algorithm is expecting. Then it churns away, copying the data from the input stream into the parameters of the Hg::Message.

Is All Of That Recursion Necessary?

Yes.

Remember, this is a template meta-programming solution. Recursion is the only loop mechanism available to us at compile-time. For a run-time algorithm, all of these function stack-frames would kill performance. If you run this portion of code compiled with a debug build, you will see that. However, things change once it is compiled for release mode with optimizations enabled.

Most of those function calls work as conditional statements to select the best-fit serializer for each type. After the optimizer gets a hold of the chain of calls, it is able to generate code that is very similar to loop unrolling that would occur in a run-time algorithm where the size of the loop was fixed.

I have just barely started the optimization process of this library as a whole. I am locating the places with unnecessary copies and other actions the kill performance. The library as a whole is performing well and I am happy with the progress. The fundamental fields perform slightly slower than a hand-coded memcpy on the field of the struct. However, the packed-bit type performs 10% faster than the hand-coded version. I have not had time to perform a deep analysis of the code that is generated. I will be posting an entry on the benchmarking process that I went through and I will post plenty of samples of assembly decomposition then.

What's Next?

Up to this point in the Alchemy series, I have demonstrated a full pass through the message management with simple types. This is enough to be able to pack the data buffers for just about any protocol. However, some formats would be very cumbersome to work with, and much of the work is still left to the user. My goal for Alchemy is to encapsulate all of that work within the library itself and help keep the user focused on solving their problem at hand.

Fundamental types are now supported. Here is a list of the additional types that I will add support, as well as other features that are congruent with this library:

Packed-bit fields
Nested message formats
Arrays
Variable-sized buffers (vector)
Additional StoragePolicy implementations
Simplify the message definitions even further

Original post blogged at Code of the Damned.