containers

Save to:
Instapaper Pocket Readability

Container Classes

Why should I use container classes rather than simple arrays?

In terms of time and space, a contiguous array of any kind is just about the optimal construct for accessing a sequence of objects in memory, and if you are serious about performance in any language you will “often” use arrays.

However, there are good arrays (e.g., containers with contiguous storage such as std::array and std::vector) and there are bad arrays (e.g., C [] arrays). Simple C [] arrays are evil because a C array is a very low level data structure with a vast potential for misuse and errors and in essentially all cases there are better alternatives – where “better” means easier to write, easier to read, less error prone, and as fast.

The two fundamental problems with C arrays are that

  • a C array doesn’t know its own size
  • the name of a C array converts to a pointer to its first element at the slightest provocation

Consider some examples:

void f(int a[], int s)
{
    // do something with a; the size of a is s
    for (int i = 0; i<s; ++i) a[i] = i;
}

int arr1[20];
int arr2[10];

void g()
{
    f(arr1,20);
    f(arr2,20);
}

The second call will scribble all over memory that doesn’t belong to arr2. Naturally, a programmer usually gets the size right, but it’s extra work and every so often someone makes a mistake. You should prefer the simpler and cleaner version using the standard library vector or array:

void f(vector<int>& v)
{
    // do something with v
    for (int i = 0; i<v.size(); ++i) v[i] = i;
}

vector<int> v1(20);
vector<int> v2(10);

void g()
{
    f(v1);
    f(v2);
}
template<size_t N> void f(array<int, N>& v)
{
    // do something with v
    for (int i = 0; i<N; ++i) v[i] = i;
}

array<int, 20> v1;
array<int, 10> v2;

void g()
{
    f(v1);
    f(v2);
}

Since a C array doesn’t know its size, there can be no array assignment:

void f(int a[], int b[], int size)
{
    a = b;            // not array assignment
    memcpy(a,b,size); // a = b
    // ...
}

Again, prefer vector or array:

// This one can result in a changing size
void g(vector<int>& a, vector<int>& b)
{
    a = b;  
    // ...
}

// In this one, a and b must be the same size
template<size_t N>
void g(array<int, N>& a, array<int, N>& b)
{
    a = b;  
    // ...
}

Another advantage of vector here is that memcpy() is not going to do the right thing for elements with copy constructors, such as strings.

void f(string a[], string b[], int size)
{
    a = b;            // not array assignment
    memcpy(a,b,size); // disaster
    // ...
}

void g(vector<string>& a, vector<string>& b)
{
    a = b;  
    // ...
}

A normal C array is of a fixed size determined at compile time (ignoring C99 VLAs, which currently have no analog in ISO C++):

const int S = 10;

void f(int s)
{
    int a1[s];    // error
    int a2[S];    // ok

    // if I want to extend a2, I'll have to change to an array
    // allocated on free store using malloc() and use realloc()
    // ...
}

To contrast:

const int S = 10;

void g(int s)
{
    vector<int> v1(s);    // ok
    vector<int> v2(S);    // ok
    v2.resize(v2.size()*2);
    // ...
}

C99 allows variable array bounds for local arrays, but those VLAs have their own problems. The way that array names “decay” into pointers is fundamental to their use in C and C++. However, array decay interacts very badly with inheritance. Consider:

class Base { void fct(); /* ... */ };
class Derived : Base { /* ... */ };

void f(Base* p, int sz)
{
    for (int i=0; i<sz; ++i) p[i].fct();
}

Base ab[20];
Derived ad[20];

void g()
{
    f(ab,20);
    f(ad,20);    // disaster!
}

In the last call, the Derived[] is treated as a Base[] and the subscripting no longer works correctly when sizeof(Derived)!=sizeof(Base) – as will be the case in most cases of interest. If we used vectors instead, the error would be caught at compile time:

void f(vector<Base>& v)
{
    for (int i=0; i<v.size(); ++i) v[i].fct();
}

vector<Base> ab(20);
vector<Derived> ad(20);

void g()
{
    f(ab);
    f(ad);    // error: cannot convert a vector<Derived> to a vector<Base>
}

We find that an astonishing number of novice programming errors in C and C++ relate to (mis)uses of C arrays. Use std::vector or std::array instead.

Let’s assume the best case scenario: you’re an experienced C programmer, which almost by definition means you’re pretty good at working with arrays. You know you can handle the complexity; you’ve done it for years. And you’re smart — the smartest on the team — the smartest in the whole company. But even given all that, please read this entire FAQ and think very carefully about it before you go into “business as usual” mode.

Fundamentally it boils down to this simple fact: C++ is not C. That means (this might be painful for you!!) you’ll need to set aside some of your hard earned wisdom from your vast experience in C. The two languages simply are different. The “best” way to do something in C is not always the same as the “best” way to do it in C++. If you really want to program in C, please do yourself a favor and program in C. But if you want to be really good at C++, then learn the C++ ways of doing things. You may be a C guru, but if you’re just learning C++, you’re just learning C++ — you’re a newbie. (Ouch; I know that had to hurt. Sorry.)

Here’s what you need to realize about containers vs. arrays:

  1. Container classes make programmers more productive. So if you insist on using arrays while those around are willing to use container classes, you’ll probably be less productive than they are (even if you’re smarter and more experienced than they are!).
  2. Container classes let programmers write more robust code. So if you insist on using arrays while those around are willing to use container classes, your code will probably have more bugs than their code (even if you’re smarter and more experienced).
  3. And if you’re so smart and so experienced that you can use arrays as fast and as safe as they can use container classes, someone else will probably end up maintaining your code and they’ll probably introduce bugs. Or worse, you’ll be the only one who can maintain your code so management will yank you from development and move you into a full-time maintenance role — just what you always wanted!

Here are some specific problems with arrays:

  1. Subscripts don’t get checked to see if they are out of bounds. (Note that some container classes, such as std::vector, have methods to access elements with or without bounds checking on subscripts.)
  2. Arrays often require you to allocate memory from the heap (see below for examples), in which case you must manually make sure the allocation is eventually deleted (even when someone throws an exception). When you use container classes, this memory management is handled automatically, but when you use arrays, you have to manually write a bunch of code (and unfortunately that code is often subtle and tricky) to deal with this. For example, in addition to writing the code that destroys all the objects and deletes the memory, arrays often also force you you to write an extra try block with a catch clause that destroys all the objects, deletes the memory, then re-throws the exception. This is a real pain in the neck, as shown here. When using container classes, things are much easier.
  3. You can’t insert an element into the middle of the array, or even add one at the end, unless you allocate the array via the heap, and even then you must allocate a new array and copy the elements.
  4. Container classes give you the choice of passing them by reference or by value, but arrays do not give you that choice: they are always passed by reference. If you want to simulate pass-by-value with an array, you have to manually write code that explicitly copies the array’s elements (possibly allocating from the heap), along with code to clean up the copy when you’re done with it. All this is handled automatically for you if you use a container class.
  5. If your function has a non-static local array (i.e., an “automatic” array), you cannot return that array, whereas the same is not true for objects of container classes.

Here are some things to think about when using containers:

  1. Different C++ containers have different strengths and weaknesses, but for any given job there’s usually one of them that is better — clearer, safer, easier/cheaper to maintain, and often more efficient — than an array. For instance,
    • You might consider a std::map instead of manually writing code for a lookup table.
    • A std::map might also be used for a sparse array or sparse matrix.
    • A std::array is the most array-like of the standard container classes, but it also offers various extra features such as bounds checking via the at() member function, automatic memory management even if someone throws an exception, ability to be passed both by reference and by value, etc.
    • A std::vector is the second-most array-like of the standard container classes, and offers additional extra features over std::array such as insertions/removals of elements.
    • A std::string is almost always better than an array of char (you can think of a std::string as a “container class” for the sake of this discussion).
  2. Container classes aren’t best for everything, and sometimes you may need to use arrays. But that should be very rare, and if/when it happens:
    • Please design your container class’s public interface in such a way that the code that uses the container class is unaware of the fact that there is an array inside.
    • The goal is to “bury” the array inside a container class. In other words, make sure there is a very small number of lines of code that directly touch the array (just your own methods of your container class) so everyone else (the users of your container class) can write code that doesn’t depend on there being an array inside your container class.

To net this out, arrays really are evil. You may not think so if you’re new to C++. But after you write a big pile of code that uses arrays (especially if you make your code leak-proof and exception-safe), you’ll learn — the hard way. Or you’ll learn the easy way by believing those who’ve already done things like that. The choice is yours.

How can I make a perl-like associative array in C++?

Use the standard class template std::map<Key,Val>:

#include <string>
#include <map>
#include <iostream>

int main()
{
  // age is a map from string to int
  std::map<std::string, int, std::less<std::string>>  age;

  age["Fred"] = 42;                     // Fred is 42 years old
  age["Barney"] = 37;                   // Barney is 37

  if (todayIsFredsBirthday())           // On Fred's birthday...
    ++ age["Fred"];                     // ...increment Fred's age

  std::cout << "Fred is " << age["Fred"] << " years old\n";
  // ...
}

Nit: the order of elements in a std::map<Key,Val> are in the sort order based on the key, so from a strict standpoint, that is different from Perl’s associative arrays which are unordered. If you want an unsorted version for a closer match, you can use std::unordered_map<Key,Val> instead.

Is the storage for a std::vector<T> or a std::array<T,N> guaranteed to be contiguous?

Yes.

This means the following technique is safe:

#include <vector>
#include <array>
#include "Foo.h"  /* get class Foo */

// old-style code that wants an array
void f(Foo* array, unsigned numFoos);

void g()
{
  std::vector<Foo> v;
  std::array<Foo, 10> a;
  // ...
  f(v.data(), v.size());  // Safe
  f(a.data(), a.size());  // Safe
}

In general, it means you are guaranteed that &v[0] + n == &v[n], where v is a non-empty std::vector<T> or std::array<T,N> and n is an integer in the range 0 .. v.size()-1.

However v.begin() is not guaranteed to be a T*, which means v.begin() is not guaranteed to be the same as &v[0]:

void g()
{
  std::vector<Foo> v;
  // ...
  f(v.begin(), v.size());  // error, not guaranteed to be the same as &v[0]
    ↑↑↑↑↑↑↑↑↑ // cough, choke, gag; use v.data() instead
}

Also, using &v[0] is undefined behavior if the std::vector or std::array is empty, while it is always safe to use the .data() function.

Note: It’s possible the above code might compile for you today. If your compiler vendor happens to implement std::vector or std::array iterators as T*’s, the above may happen to work on that compiler – and at that, possibly only in release builds, because vendors often supply debug iterators that carry more information than a T*. But even if this code happens to compile for you today, it’s only by chance because of a particular implementation. It’s not portable C++ code. Use .data() for these situations.

Why doesn’t C++ provide heterogeneous containers?

The C++ standard library provides a set of useful, statically type-safe, and efficient containers. Examples are vector, list, and map:

vector<int> vi(10);
vector<Shape*> vs;
list<string> lst;
list<double> l2
map<string,Record*> tbl;
unordered_map<Key,vector<Record*> > t2;

These containers are described in all good C++ textbooks, and should be preferred over arrays and “home cooked” containers unless there is a good reason not to.

These containers are homogeneous; that is, they hold elements of the same type. If you want a container to hold elements of several different types, you must express that either as a union or (usually much better) as a container of pointers to a polymorphic type. The classical example is:

vector<Shape*> vi;  // vector of pointers to Shapes

Here, vi can hold elements of any type derived from Shape. That is, vi is homogeneous in that all its elements are Shapes (to be precise, pointers to Shapes) and heterogeneous in the sense that vi can hold elements of a wide variety of Shapes, such as Circles, Triangles, etc.

So, in a sense all containers (in every language) are homogeneous because to use them there must be a common interface to all elements for users to rely on. Languages that provide containers deemed heterogeneous simply provide containers of elements that all provide a standard interface. For example, Java collections provide containers of (references to) Objects and you use the (common) Object interface to discover the real type of an element.

The C++ standard library provides homogeneous containers because those are the easiest to use in the vast majority of cases, gives the best compile-time error message, and imposes no unnecessary run-time overheads.

If you need a heterogeneous container in C++, define a common interface for all the elements and make a container of those. For example:

class Io_obj { /* ... */ };    // the interface needed to take part in object I/O

vector<Io_obj*> vio;           // if you want to manage the pointers directly
vector<shared_ptr<Io_obj>> v2; // if you want a "smart pointer" to handle the objects

Don’t drop to the lowest level of implementation detail unless you have to:

vector<void*> memory;          // rarely needed

A good indication that you have “gone too low level” is that your code gets littered with casts.

Using an Any class, such as boost::any, can be an alternative in some programs:

vector<any> v = { 5, "xyzzy", 3.14159 };

If all objects you want to store in a container are publicly derived from a common base class, you can then declare/define your container to hold pointers to the base class. You indirectly store a derived class object in a container by storing the object’s address as an element in the container. You can then access objects in the container indirectly through the pointers (enjoying polymorphic behavior). If you need to know the exact type of the object in the container you can use dynamic_cast<> or typeid(). You’ll probably need the Virtual Constructor Idiom to copy a container of disparate object types. The downside of this approach is that it makes memory management a little more problematic (who “owns” the pointed-to objects? if you delete these pointed-to objects when you destroy the container, how can you guarantee that no one else has a copy of one of these pointers? if you don’t delete these pointed-to objects when you destroy the container, how can you be sure that someone else will eventually do the deleteing?). It also makes copying the container more complex (may actually break the container’s copying functions since you don’t want to copy the pointers, at least not when the container “owns” the pointed-to objects). In that case, you can use std::shared_ptr to manage the objects, and the containers will copy correctly.

The second case occurs when the object types are disjoint — they do not share a common base class. The approach here is to use a handle class. The container is a container of handle objects (by value or by pointer, your choice; by value is easier). Each handle object knows how to “hold on to” (i.e., maintain a pointer to) one of the objects you want to put in the container. You can use either a single handle class with several different types of pointers as instance data, or a hierarchy of handle classes that shadow the various types you wish to contain (requires the container be of handle base class pointers). The downside of this approach is that it opens up the handle class(es) to maintenance every time you change the set of types that can be contained. The benefit is that you can use the handle class(es) to encapsulate most of the ugliness of memory management and object lifetime.

How can I build a heterogeneous <favorite container> of objects of different types?

See heterogeneous containers.

Why can’t I assign a vector<Apple*> to a vector<Fruit*>?

Because that would open a hole in the type system. For example:

class Apple : public Fruit { void apple_fct(); /* ... */ };
class Orange : public Fruit { /* ... */ }; // Orange doesn't have apple_fct()

vector<Apple*> v;             // vector of Apples

void f(vector<Fruit*>& vf)    // innocent Fruit manipulating function
{
    vf.push_back(new Orange); // add orange to vector of fruit
}

void h()
{
    f(v);    // error: cannot pass a vector<Apple*> as a vector<Fruit*>
    for (int i=0; i<v.size(); ++i) v[i]->apple_fct();
}

Had the call f(v) been legal, we would have had an Orange pretending to be an Apple.

An alternative language design decision would have been to allow the unsafe conversion, but rely on dynamic checking. That would have required a run-time check for each access to v’s members, and h() would have had to throw an exception upon encountering the last element of v.

How can I insert/access/change elements from a linked list/hashtable/etc?

The most important thing to remember is this: don’t roll your own from scratch unless there is a compelling reason to do so. In other words, instead of creating your own list or hashtable, use one of the standard class templates such as std::vector<T> or std::list<T> or whatever.

Assuming you have a compelling reason to build your own container, here’s how to handle inserting (or accessing, changing, etc.) the elements.

To make the discussion concrete, I’ll discuss how to insert an element into a linked list. This example is just complex enough that it generalizes pretty well to things like vectors, hash tables, binary trees, etc.

A linked list makes it easy to insert an element before the first or after the last element of the list, but limiting ourselves to these would produce a library that is too weak (a weak library is almost worse than no library). This answer will be a lot to swallow for novice C++’ers, so I’ll give a couple of options. The first option is easiest; the second and third are better.

  1. Empower the List with a “current location,” and member functions such as advance(), backup(), atEnd(), atBegin(), getCurrElem(), setCurrElem(Elem), insertElem(Elem), and removeElem(). Although this works in small examples, the notion of a current position makes it difficult to access elements at two or more positions within the list (e.g., “for all pairs x,y do the following…”).
  2. Remove the above member functions from List itself, and move them to a separate class, ListPosition. ListPosition would act as a “current position” within a list. This allows multiple positions within the same list. ListPosition would be a friend of class List, so List can hide its innards from the outside world (else the innards of List would have to be publicized via public member functions in List). Note: ListPosition can use operator overloading for things like advance() and backup(), since operator overloading is syntactic sugar for normal member functions.
  3. Consider the entire iteration as an atomic event, and create a class template that embodies this event. This enhances performance by allowing the public access member functions (which may be virtual functions) to be avoided during the access, and this access often occurs within an inner loop. Unfortunately the class template will increase the size of your object code, since templates gain speed by duplicating code. For more, see [Koenig, “Templates as interfaces,” JOOP, 4, 5 (Sept 91)], and [Stroustrup, “The C++ Programming Language Third Edition,” under “Comparator”].

Can I have a container of smart pointers to my objects?

Yes, and you really want to do this, as smart pointers make your life easier and make your code more robust compared to the alternatives.

Note: forget that std::auto_ptr ever existed. Really. You don’t want to use it, ever, especially in containers. It is broken in too many ways.

Let’s motivate this discussion with an example. This first section shows why you’d want to use smart pointers in the first place - this is what not to do:

#include <vector>

class Foo {
public:
  // ...blah blah...
};

void foo(std::vector<Foo*>& v)  // ← BAD FORM: a vector of dumb pointers to Foo objects
{
  v.push_back(new Foo());
  // ...
  delete v.back();  // you have a leak if this line is skipped
  v.pop_back();     // you have a "dangling pointer" if control-flow doesn't reach this line
}

If control flow doesn’t reach either of the last two lines, either because you don’t have it in your code or you do a return or something throws an exception, you will have a leak or a “dangling pointer”; bad news. The destructor of std::vector cleans up whatever allocations were made by the std::vector object itself, but it will not clean up the allocation that you made when you allocated a Foo object, even though you put a pointer to that allocated Foo object into the std::vector object.

That’s why you’d want to use a smart pointer.

Now let’s talk about how to use a smart pointer. There are lots of smart pointers that can be copied and still maintain shared ownership semantics, such as std::shared_ptr and many others. For this example, we will use std::shared_ptr, though you might choose another based on the semantics and performance trade-offs you desire.

typedef std::shared_ptr<Foo> FooPtr;  // ← GOOD: using a smart-pointer

void foo(std::vector<FooPtr>& v)  // ← GOOD: using a container of smart-pointer
{
  // ...
}

This just works safely with all operations. The object is destroyed when the last shared_ptr to it is destroyed or set to point to something else.

Using a std::unique_ptr in a std::vector is safe, but it has some restrictions. The unique_ptr is a move-only type, it can’t be copied. This move-only restriction then applies to the std::vector containing them as well.

void create_foo(std::vector<std::unique_ptr<Foo>> &v)
{
    v.emplace_back(std::make_unique<Foo>(/* ... */));
}

If you want to put an element from this vector into another vector, you must move it to the other vector, as only one unique_ptr at a time can point to the same Foo object.

There are lots of good articles on this general topic, such as Herb Sutter’s in Dr. Dobbs and many others.

Why are the standard containers so slow?

They aren’t, they’re among the fastest on the planet.

Probably “compared to what?” is a more useful question (and answer). When people complain about standard-library container performance, we usually find one of three genuine problems (or one of the many myths and red herrings):

  • I suffer copy overhead
  • I suffer slow speed for lookup tables
  • My hand-coded (intrusive) lists are much faster than std::list

Before trying to optimize, consider if you have a genuine performance problem. In most of the cases sent to me, the performance problem is theoretical or imaginary: First measure, then optimize only if needed.

Let’s look at those problems in turn. Often, a vector<X> is slower than somebody’s specialized My_container<X> because My_container<X> is implemented as a container of pointers to X (brief spoiler: if you want that, you have it too: vector<X*> – more on this in a moment). A vector<X> (no *) holds copies of values, and copies a value when you put it into the container. This is essentially unbeatable for small values, but can be quite unsuitable for huge objects:

vector<int> vi;
vector<Image> vim;
// ...
int i = 7;
Image im("portrait.jpg");    // initialize image from file
// ...
vi.push_back(i);             // put (a copy of) i into vi
vim.push_back(im);           // put (a copy of) im into vim

Now, if portrait.jpg is a couple of megabytes and Image has value semantics (i.e., copy assignment and copy construction make copies) then vim.push_back(im) will indeed be expensive. But – as the saying goes – if it hurts so much, just don’t do it.

Move semantics and in-place construction can negate many of these costs if the vector is going to own the object, and you don’t need copies of it elsewhere.

vector<Image> vim;
vim.emplace_back("portrait.jpg"); // create image from file in place in the vector

Alternatively, either use a container of handles or a container of pointers. For example, if Image had reference semantics, the code above would incur only the cost of a copy constructor call, which would be trivial compared to most image manipulation operators. If some class, say Image again, does have copy semantics for good reasons, a container of pointers is often a reasonable solution:

vector<int> vi;
vector<Image*> vim;
// ...
Image im("portrait.jpg");    // initialize image from file
// ...
vi.push_back(7);             // put (a copy of) 7 into vi
vim.push_back(&im);          // put (a copy of) &im into vim

Naturally, if you use pointers, you have to think about resource management, but containers of pointers can themselves be effective and cheap resource handles (often, you need a container with a destructor for deleting the “owned” objects), or you can simply use a container of smart pointers.

The second frequently occurring genuine performance problem is the use of a map<string,X> for a large number of (string,X) pairs. Maps are fine for relatively small containers (say a few hundred or few thousand elements – access to an element of a map of 10000 elements costs about 9 comparisons), where less-than is cheap, and where no good hash-function can be constructed. If you have lots of strings and a good hash function, use an unordered_map.

Sometimes, you can speed up things by using (const char*,X) pairs rather than (string,X) pairs, but remember that < doesn’t do lexicographical comparison for C-style strings. Also, if X is large, you may have the copy problem also (solve it in one of the usual ways).

Intrusive lists can be really fast. However, consider whether you need a list at all: A vector is more compact and is therefore smaller and faster in many cases – even when you do inserts and erases. For example, if you logically have a list of a few integer elements, a vector is significantly faster than a list (any list). Also, intrusive lists cannot hold built-in types directly (an int does not have a link member). So, assume that you really need a list and that you can supply a link field for every element type. The standard-library list by default performs an allocation followed by a copy for each operation inserting an element (and a deallocation for each operation removing an element). For std::list with the default allocator, this can be significant. For small elements where the copy overhead is not significant, consider using an optimized allocator. Use a hand-crafted intrusive lists only where a list and the last ounce of performance is needed.

People sometimes worry about the cost of std::vector growing incrementally. Many C++ programmers used to worry about that and used reserve() to optimize the growth. After measuring their code and repeatedly having trouble finding the performance benefits of reserve() in real programs, they stopped using it except where it is needed to avoid iterator invalidation (a rare case in most code). Again: measure before you optimize.

The cost of std::vector growing incrementally in C++11 can be a lot less than it was in C++98/03 when you are using move-aware types, such as std::string or even std::vector<T>, as when the vector is reallocated, the objects are moved into the new storage instead of copied.