Standard C++

Serialization and Unserialization

What’s this “serialization” thing all about?

It lets you take an object or group of objects, put them on a disk or send them through a wire or wireless transport mechanism, then later, perhaps on another computer, reverse the process: resurrect the original object(s). The basic mechanisms are to flatten object(s) into a one-dimensional stream of bits, and to turn that stream of bits back into the original object(s).

Like the Transporter on Star Trek, it’s all about taking something complicated and turning it into a flat sequence of 1s and 0s, then taking that sequence of 1s and 0s (possibly at another place, possibly at another time) and reconstructing the original complicated “something.”

How do I select the best serialization technique?

There are lots and lots (and lots) of if’s, and’s and but’s, and in reality there are a whole continuum of techniques with lots of dimensions. Because I have a finite amount of time (translation: I don’t get paid for any of this), I’ve simplified it to a decision between using human-readable (“text”) or non-human-readable (“binary”) format, followed by a list of five techniques arranged more-or-less in increasing order of sophistication.

You are, of course, not limited to those five techniques. You will probably end up mixing ideas from several techniques. And certainly you can always use a more sophisticated (higher numbered) technique than is actually needed. In fact it might be wise to use a more sophisticated technique than is minimally needed if you believe future changes will require the greater sophistication. So think of this list merely as a good starting point.

There’s a lot here, so get ready!

Decide between human-readable (“text”) and non-human-readable (“binary”) formats. The tradeoffs are non-trivial. Later FAQs show how to write simple types in text format and how to write simple types in binary format.
Use the least sophisticated solution when the objects to be serialized aren’t part of an inheritance hierarchy (that is, when they’re all of the same class) and when they don’t contain pointers to other objects.
Use the second level of sophistication when the objects to be serialized are part of an inheritance hierarchy, but when they don’t contain pointers to other objects.
Use the third level of sophistication when the objects to be serialized contain pointers to other objects, but when those pointers form a tree with no cycles and no joins.
Use the fourth level of sophistication when the objects to be serialized contain pointers to other objects, and when those pointers form a graph with no cycles, and with joins at the leaves only.
Use the most sophisticated solution when the objects to be serialized contain pointers to other objects, and when those pointers form a graph that might have cycles or joins.

Here’s that same information arranged like an algorithm:

The first step is to make an eyes-open decision between text- and binary-formats.
If your objects aren’t part of an inheritance hierarchy and don’t contain pointers, use solution #1.
Else if your objects don’t contain pointers to other objects, use solution #2.
Else if the graph of pointers within your objects contain neither cycles nor joins, use solution #3.
Else if the graph of pointers within your objects don’t contain cycles and if the only joins are to terminal (leaf) nodes, use solution #4.
Else use solution #5.

Remember: feel free to mix and match, to add to the above list, and, if you can justify the added expense, to use a more sophisticated technique than is minimally required.

One more thing: the issues of inheritance and of pointers within the objects are logically unrelated, so there’s no theoretical reason for #2 to be any less sophisticated than #3-5. However in practice it often (not always) works out that way. So please do not think of these categories as somehow sacred — they’re somewhat arbitrary, and you are expected to mix and match the solutions to fit your situation. This whole area of serialization has far more variants and shades of gray than can be covered in a few questions/answers.

How do I decide whether to serialize to human-readable (“text”) or non-human-readable (“binary”) format?

Carefully.

There is no “right” answer to this question; it really depends on your goals. Here are a few of the pros/cons of human-readable (“text”) format vs. non-human-readable (“binary”) format:

Text format is easier to “desk check.” That means you won’t have to write extra tools to debug the input and output; you can open the serialized output with a text editor to see if it looks right.
Binary format typically uses fewer CPU cycles. However that is relevant only if your application is CPU bound and you intend to do serialization and/or unserialization on an inner loop/bottleneck. Remember: 90% of the CPU time is spent in 10% of the code, which means there won’t be any practical performance benefit unless your “CPU meter” is pegged at 100%, and your serialization and/or unserialization code is consuming a healthy portion of that 100%.
Text format lets you ignore programming issues like sizeof and little-endian vs. big-endian.
Binary format lets you ignore separations between adjacent values, since many values have fixed lengths.
Text format can produce smaller results when most numbers are small and when you need to textually encode binary results, e.g., uuencode or Base64.
Binary format can produce smaller results when most numbers are large or when you don’t need to textually encode binary results.

You might think of others to add as well… The important thing to remember is that one size does not fit all — make a careful decision here.

One more thing: no matter which you choose, you might want to start each file / stream with a “magic” tag and a version number. The version number would indicate the format rules. That way if you decide to make a radical change in the format, you hopefully will still be able to read the output produced by the old software.

How do I serialize/unserialize simple types like numbers, characters, strings, etc.?

The answer depends on your decision regarding human-readable (“text”) format vs. non-human-readable (“binary”) format:

The primitives discussed in those FAQs will be needed for most of the other FAQs in this section.

How exactly do I read/write simple types in human-readable (“text”) format?

Before you read this, make sure to evaluate all the tradeoffs between human-readable and non-human-readable formats. The tradeoffs are non-trivial, so you should resist a knee-jerk reaction to do it the way you did it on the last project — one size does not fit all.

After you have made an eyes-open decision to use human-readable (“text”) format, you should remember these keys:

You probably want to use iostream’s >> and << operators rather than its read() and write() methods. The >> and << operators are better for text mode, whereas read() and write() are better for binary mode.
When storing numbers, you’ll probably want to add a separator to prevent items from running together. One simple approach is to always add a space (' ') before each number, that way the number 1 followed by the number 2 won’t run together and look like a 12. Since the leading space will automatically get soaked up by the >> operator, you won’t have to do anything explicit to extract the leading space in the code that reads things back in.
String data is tricky because you have to unambiguously know when the string’s body stops. You can’t unambiguously terminate all strings with a '\n' or '"' or even '\0' if some string might contain those characters. You might want to use C++ source-code escape-sequences, e.g., writing '\' followed by 'n' when you see a newline, etc. After this transformation, you can either make strings go until end-of-line (meaning they are deliminated by '\n') or you can delimit them with '"'.
If you use C++-like escape-sequences for your string data, be sure to always use the same number of hex digits after '\x' and '\u'. I typically use 2 and 4 digits respectively. Reason: if you write a smaller number of hex digits, e.g., if you simply use stream << "\\x" << hex << unsigned(theChar), you’ll get errors when the next character in the string happens to be a hex digit. E.g., if the string contains '\xF' followed by 'A', you should write "\x0FA", not "\xFA".
If you don’t use some sort of escape sequence for characters like '\n', be careful that the operating system doesn’t mess up your string data. In particular, if you open a std::fstream without std::ios::binary, some operating systems translate end-of-line characters.
Another approach for string data is to prefix the string’s data with an integer length, e.g., to write "now is the time" as 15:now is the time. Note that this can make it hard for people to read/write the file, since the value just after that might not have a visible separator, but you still might find it useful.

Please remember that these are primitives that you will need to use in the other FAQs in this section.

How exactly do I read/write simple types in non-human-readable (“binary”) format?

After you have made an eyes-open decision to use non-human-readable (“binary”) format, you should remember these keys:

Make sure you open the input and output streams using std::ios::binary. Do this even if you are on a Unix system since it’s easy to do, it documents your intent, and it’s one less non-portability to locate and change down the road.
You probably want to use iostream’s read() and write() methods instead of its >> and << operators. read() and write() are better for binary mode; >> and << are better for text mode.
If the binary data might get read by a different computer than the one that wrote it, be very careful about endian issues (little-endian vs. big-endian) and sizeof issues. The easiest way to handle this is to anoint one of those two formats as the official “network” format, and to create a header file that contains machine dependencies (I usually call it machine.h). That header should define inline functions like readNetworkInt(std::istream& istr) to read a “network int,” and so forth for reading and writing all the primitive types. You can define the format for these pretty much anyway you want. E.g., you might define a “network int” as exactly 32 bits in little endian format. In any case, the functions in machine.h will do any necessary endian conversions, sizeof conversions, etc. You’ll either end up with a different machine.h on each machine architecture, or you’ll end up with a lot of #ifdefs in your machine.h, but either way, all this ugliness will be buried in a single header, and all the rest of your code will be clean(er). Note: the floating point differences are the most subtle and tricky to handle. It can be done, but you’ll have to be careful with things like NaN, over- and under-flow, #bits in the mantissa or exponent, etc.
When space-cost is an issue, such as when you are storing the serialized form in a small memory device or sending it over a slow link, you can compress the stream and/or you can do some manual tricks. The simplest is to store small numbers in a smaller number of bytes. For example, to store an unsigned integer in a stream that has 8-bit bytes, you can hijack the 8th bit of each byte to indicate whether or not there is another byte. That means you get 7 meaningful bits/byte, so 0…127 fit in 1 byte, 128…16384 fit in 2 bytes, etc. If the average number is smaller than around half a billion, this will use less space than storing every four-byte unsigned number in four 8-bit bytes. There are lots of other variations on this theme, e.g., a sorted array of numbers can store the difference between each number, storing extremely small values in unary format, etc.
String data is tricky because you have to unambiguously know when the string’s body stops. You can’t unambiguously terminate all strings with a '\0' if some string might contain that character; recall that std::string can store '\0'. The easiest solution is to write the integer length just before the string data. Make sure the integer length is written in “network format” to avoid sizeof and endian problems (see the solutions in earlier bullets).

Please remember that these are primitives that you will need to use in the other FAQs in this section.

How do I serialize objects that aren’t part of an inheritance hierarchy and that don’t contain pointers to other objects?

This is the least sophisticated problem, and not surprisingly, it is also the least sophisticated solution:

Every class should handle its own serialization and unserialization. You will typically create a member function that serializes the object to some sink (such as a std::ostream), and another that allocates a new object, or perhaps changes an existing object, setting the member data based on what it reads from some source (such as a std::istream).
If your object physically contains another object, e.g., a Car object might have a member variable of type Engine, the outer object’s serialize() member function should simply call the appropriate function associated with the member object.
Use the primitives described earlier to read/write the simple types in text or binary format.
If a class’s data structure might change someday, the class should write out a version number at the beginning of the object’s serialized output. Unless you are absolutely sure that there is no possible chance of the data structure changing at any point in the future, do yourself a favor and include the version number now. It is much easier to put in a version number from the very beginning than to add one later when you have an unanticipated data structure change. The version number simply represents the serialized format; it should not get incremented simply when the class’s behavior changes. This means the version numbers don’t need to be fancy — they usually don’t need a major and minor number.

How do I serialize objects that are part of an inheritance hierarchy and that don’t contain pointers to other objects?

Suppose you want to serialize a “shape” object, where Shape is an abstract class with derived classes Rectangle, Ellipse, Line, Text, etc. You would declare a pure virtual function serialize(std::ostream&) const within class Shape, and make sure the first thing done by each override is to write out the class’s identity. For example, Ellipse::serialize(std::ostream&) const would write out the identifier Ellipse (perhaps as a simple string, but there are several alternatives discussed below).

Things get a little trickier when unserializing the object. You typically start with a static member function in the base class such as Shape::unserialize(std::istream& istr). This is declared to return a Shape* or perhaps a smart pointer such as Shape::Ptr. It reads the class-name identifier, then uses some sort of creational pattern to create the object. For example, you might have a table that maps from the class name to an object of the class, then use the Virtual Constructor Idiom to create the object.

Here’s a concrete example: Add a pure virtual method create(std::istream&) const within base class Shape, and define each override to a one-liner that uses new to allocate an object of the appropriate derived class. E.g., Ellipse::create(std::istream& istr) const would be { return new Ellipse(istr); }. Add a static std::map<std::string,Shape*> object that maps from the class name to a representative (AKA prototype) object of the appropriate class; e.g., "Ellipse" would map to a new Ellipse(). Function Shape::unserialize(std::istream& istr) would read the class-name, throw an exception if it’s not in the map (if (theMap.count(className) == 0) throw …something…), then look up the associated Shape* and call its create() method: return theMap[className]->create(istr).

The map is typically populated during static initialization. For example, if file Ellipse.cpp contains the code for derived class Ellipse, it would also contain a static object whose ctor adds that class to the map: theMap["Ellipse"] = new Ellipse().

Notes and caveats:

It adds a little flexibility if Shape::unserialize() passes the class name to the create() method. In particular, that would let a derived class be used with two or more names, each with its own “network format.” For example, derived class Ellipse could be used for both "Ellipse" and "Circle", which might be useful to save space in the output stream or perhaps other reasons.
It’s usually easiest to handle errors during unserialization by throwing an exception. You can return NULL if you want, but you will need to move the code that reads the input stream out of the derived class’ ctors into the corresponding create() methods, and ultimately the result is often that your code is more complicated.
You must be careful to avoid the static initialization order fiasco with the map used by Shape::unserialize(). This normally means using the Construct On First Use Idiom for the map itself.
For the map used by Shape::unserialize(), I personally prefer the Named Constructor Idiom over the Virtual Constructor Idiom — it simplifies a few steps. Details: I usually define a typedef within Shape such as typedef Shape* (*Factory)(std::istream&). This means Shape::Factory is a “pointer to a function that takes a std::istream& and returns a Shape*.” I then define the map as std::map<std::string,Factory>. Finally I populate that map using lines like theMap["Ellipse"] = Ellipse::create (where Ellipse::create(std::istream&) is now a static member function of class Ellipse, that is, the Named Constructor Idiom). You’d change the return value in function Shape::unserialize(std::istream& istr) from theMap[className]->create(istr) to theMap[className](istr).
If you might need to serialize a NULL pointer, it’s usually easy since you already write out a class identifier so you can just as easily write out a pseudo class identifier like "NULL". You might need an extra if statement in Shape::unserialize(), but if you chose my preference from the previous bullet, you can eliminate that special case (and generally keep your code clean) by defining static member function Shape* Shape::nullFactory(istream&) { return NULL; }. You add that function to the map as any other: theMap["NULL"] = Shape::nullFactory;.
You can make the serialized form smaller and a little faster if you tokenize the class name identifiers. For example, write a class name only the first time it is seen, and for subsequent uses write only a corresponding integer index. A mapping such as std::map<std::string,unsigned> unique makes this easy: if a class name is already in the map, write unique[className]; otherwise set a variable unsigned n = unique.size(), write n, write the class name, and set unique[className] = n. (Note: be sure to copy it into a separate variable. Do not say unique[className] = unique.size()! You have been warned! Reason: the compiler might evaluate unique[className] before unique.size(), and if so, unique[className] will pre-increment the size.) When unserializing, use std::vector<std::string> unique, read the number n, and if n == unique.size(), read a name and add it to the vector. Either way the name will be unique[n]. You can also pre-populate the first N slots in these tables with the N most common names, that way streams won’t need to contain any of those strings.

How do I serialize objects that contain pointers to other objects, but those pointers form a tree with no cycles and no joins?

Before we even start, you must understand that the word “tree” does not mean that the objects are stored in some sort of tree-like data structure like std::set. It simply means that your objects point to each other, and the “with no cycles” part means if you keep following pointers from one object to the next, you never return to an earlier object. Your objects aren’t “inside” a tree; they are a tree. If you don’t understand that, you really should read the lingo FAQ before continuing with this one.

Second, don’t use this technique if the graph might someday contain cycles or joins.

Graphs with neither cycles nor joins are very common, even with “recursive composition” design patterns like Composite or Decorator. For example, the objects representing an XML document or an HTML document can be represented as a graph without joins or cycles.

The key to serializing these graphs is to ignore a node’s identity and instead to focus only on its contents. A (typically recursive) algorithm dives through the tree and writes the contents as it goes. For example, if the current node happens to have an integer a, a pointer b, a float c, and another pointer d, then you first write the integer a, then recursively dive into the child pointed to by b, then write the float c, and finally recursively dive into the child pointed to by d. (You don’t have to write/read them in the declaration order; the only essential rule is that the reader’s order is consistent with the writer’s order.)

When unserializing, you need a constructor that takes a std::istream&. The constructor for the above object would read an integer and store the result in a, then would allocate an object to be stored in pointer b (and will pass the std::istream to the constructor so it too can read the stream’s contents), read a float into c, and finally will allocate an object to be stored in pointer d. Be sure to use smart pointers within your objects, otherwise you will get a leak if an exception is thrown while reading anything but the first pointed-to object.

It is often convenient to use the Named Constructor Idiom when allocating these objects. This has the advantage that you can enforce the use of smart pointers. To do this in a class Foo, write a static method such as FooPtr Foo::create(std::istream& istr) { return new Foo(istr); } (where FooPtr is a smart pointer to a Foo). The alert reader will note how consistent this is with the technique discussed in the previous FAQ — the two techniques are completely compatible.

If an object can contain a variable number of children, e.g., a std::vector of pointers, then the usual approach is to write the number of children just before recursively diving into the first child. When unserializing, just read the number of children, then use a loop to allocate the appropriate number of child objects.

If a child-pointer might be NULL, be sure to handle that in both the writing and reading. This shouldn’t be a problem if your objects use inheritance; see that solution for details. Otherwise, if the first serialized character in an object has a known range, use something outside that range. E.g., if the first character of a serialized object is always a digit, use a non-digit like 'N' to mean a NULL pointer. Unseralization can use std::istream::peek() to check for the 'N' tag. If the first character doesn’t have a known range, force one; e.g., write 'x' before each object, then use something else like 'y' to mean NULL.

If an object physically contains another object inside itself, as opposed to containing a pointer to the other object, then nothing changes: you still recursively dive into the child-node as if it were via a pointer.

How do I serialize objects that contain pointers to other objects, but those pointers form a tree with no cycles and only “trivial” joins?

As before, the word “tree” does not mean that the objects are stored in some sort of tree-like data structure like std::set. It simply means your objects have pointers to each other, and the “no cycles” part means you can follow the pointers from one object to the next and never return to an earlier object. The objects are not “inside” a tree; they are a tree. If that doesn’t make sense, you really should read the lingo FAQ before continuing with this one.

Use this solution if the graph contains joins at the leaf nodes, but those joins can be easily reconstructed via a simple look-up table. For example, the parse-tree of an arithmetic expression like (3*(a+b) - 1/a) might have joins since a variable-name (like a) can show up more than once. If you want the graph to use the same exact node-object to represent both occurrences of that variable, then you could use this solution.

Although the above constraints don’t fit with those of the solution without any joins, it’s so close that you can squeeze things into that solution. Here are the differences:

During serialization, ignore the join completely.
During unserializing, create a look-up table, like std::map<std::string,Node*>, that maps from the variable name to the associated node.

Caveat: this assumes that all occurrences of variable a should map to the same node object; if it’s more complicated than this, that is, if some occurrences of a should map to one object and some to another, you might need to use a more sophisticated solution.

Caveat: you need to take special care if your objects contain pointers which point to a member of some object, rather than to the object itself. For example, if pointer ‘p’ points to ‘x.y’ and pointer ‘q’ points to ‘x.z’, that is, they point to data members within object ‘x’ rather than to object ‘x’ itself, when you serialize the pointers, you need to recognize both point to the same identical object. You will need to serialize object ‘x’, not just ‘x.y’ and ‘x.z’, and you need to make sure upon unserialization ‘p’ and ‘q’ point again to the same identical object. The serialization process will probably require a two-pass approach, and serialized pointer will probably need to store an offset within the target object as well as the target object’s ID.

How do I serialize objects that contain pointers to other objects, and those pointers form a graph that might have cycles or non-trivial joins?

Caveat: the word “graph” does not mean that the objects are stored in some sort of data structure. Your objects form a graph because they point to each other. They’re not “inside” a graph; they are a graph. If that doesn’t make sense, you really should read the lingo FAQ before continuing with this one.

Use this solution if the graph can contain cycles, or if it can contain more complicated kinds of joins than are allowed by the solution for trivial joins. This solution handles two core issues: it avoids infinite recursion and it writes/reads each node’s identity in addition to its contents.

A node’s identity doesn’t normally have to be consistent between different output streams. For example, if a particular file uses the number 3 to represent node x, a different file can use the number 3 to represent a different node, y.

There are some clever ways to serialize the graph, but the simplest to describe is a two-pass algorithm that uses an object-ID map, e.g., std::map<Node*,unsigned> oidMap. The first pass populates our oidMap, that is, it builds a mapping from object pointer to the integer that represents the object’s identity. It does this by recursively diving through the graph, at each node checking if the node is already in oidMap, and if not, adding the node and a unique integer to oidMap and recursively diving into the new node’s children. The unique integer is often just the initial oidMap.size(), e.g., unsigned n = oidMap.size(); oidMap[nodePtr] = n. (Yes, we did that in two statements. You must also. Do not shorten it to a single statement. You have been warned.)

The second pass iterates through the nodes of oidMap, and at each, writes the node’s identity (the associated integer) followed by its contents. When writing the contents of a node that contains pointers to other nodes, instead of diving into those “child” objects, it simply writes the identity (the associated integer) of the pointer to those nodes. For example, when your node contains Node* child, simply write the integer oidMap[child]. After the second pass, the oidMap can be discarded. In other words, the mapping from Node* to unsigned should not normally survive beyond the end of the serialization of any given graph.

There are also some clever ways to unserialize the graph, but here again the simplest to describe is a two-pass algorithm. The first pass populates a std::vector<Node*> v with objects of the right class, but any child pointers within those objects are all NULL. This means v[3] will point to the object whose oid is 3, but any child pointers inside that object will be NULL. The second pass populates the child pointers inside the objects, e.g., if v[3] has a child pointer called child that is supposed to point to the object whose oid is 5, the second pass changes changes v[3].child from NULL to v[5] (obviously encapsulation might prevent it from directly accessing v[3].child, but ultimately v[3].child gets changed to v[5]). After unserializing a given stream, the vector v can normally be discarded. In other words, the oids (3, 5, etc.) mean nothing when serializing or unserializing a different stream — those numbers are only meaningful within a given stream.

Note: if your objects contain polymorphic pointers, that is, base class pointers that might point at derived class objects, then use the technique described earlier. You’ll also want to read some of the earlier techniques for handling NULL, writing version numbers, etc.

Note: you should seriously consider the Visitor pattern when recursively diving through the graph, since serialization is probably just one of many different reasons to make that recursive dive, and they’ll all need to avoid infinite recursion.

Are there any caveats when serializing / unserializing objects?

One of the things you don’t want to do, except in unusual circumstances, is to make any changes to the node’s data during the traversal. For example, some people feel they can map from Node* to integer by simply adding an integer as a data member in the Node class. They also sometimes add a bool haveVisited flag as another data member within Node objects.

But this causes lots of multi-threading and/or performance problems. Your serialize() method can no longer be const, so even though it is logically just a reader of the node data, and even though it logically makes sense to have multiple threads reading the nodes at the same time, the actual algorithm writes into the nodes. If you fully understand threading and reader/writer conflicts, the best you can hope for is to make your code more complicated and slower (you’ll have to block all reader threads whenever any thread is doing anything with the graph). If you (or those who will maintain the code after you!) don’t fully understand reader/writer conflicts, this technique can create very serious and very subtle errors. Reader/writer and writer/writer conflicts are so subtle they often slip through test into the field. Bad news. Trust me. If you don’t trust me, talk to someone you do trust. But don’t cut corners.

There’s lots more I could say, such as several simplifications and special cases, but I’ve already spent too much time on this. If you want more info, spend some money.

What’s all this about graphs, trees, nodes, cycles, joins, and joins at the leaves vs. internal nodes?

When your objects contain pointers to other objects, you end up with something computer scientists call a graph. Not that your objects are stored inside a tree-like data structure; they are a tree-like structure.

Your objects correspond to the graph’s nodes AKA vertices, and the pointers within your objects correspond to the graph’s edges. The graph is of a special variety called a rooted, directed graph. The root object to be serialized corresponds to the graph’s root node, and the pointers correspond to directed edges.

If object x has a pointer to object y, we say that x is a parent of y and/or that y is a child of x.

A path through a graph corresponds to starting with an object, following a pointer to another object, etc., etc. to an arbitrary depth. If there is a path from x to z we say that x is an ancestor of z and/or that z is a descendent of x.

A join in a graph means there are two or more distinct paths to the same object. For example, if z is a child of both x and y, then the graph has a join, and z is a join node.

A cycle in a graph means there is a path from an object back to itself: if x has a pointer to itself, or to y which points to x, or to y which points to z which points to x, etc. A graph is cyclic if it has one or more cycles; otherwise it is acyclic.

An internal node is a node with children. A leaf node is a node without children.

As used in this section, the word tree means a rooted, directed, acyclic graph. Note that each node within a tree is also a tree.