On vector<bool> -- Howard Hinnant

On vector<bool>

by Howard Hinnant

 

vector<bool> has taken a lot of heat over the past decade, and not without reason. However I believe it is way past time to draw back some of the criticism and explore this area with a dispassionate scrutiny of detail.

There are really two issues here:

  1. Is the data structure of an array of bits a good data structure?
  2. Should the aforementioned data structure be named vector<bool>?

I have strong opinions on both of these questions. And to get this out of the way up front:

  1. Yes.
  2. No.

The array of bits data structure is a wonderful data structure. It is often both a space and speed optimization over the array of bools data structure if properly implemented. However it does not behave exactly as an array of bools, and so should not pretend to be one.

First, what's wrong with vector<bool>?

Because vector<bool> holds bits instead of bools, it can't return a bool& from its indexing operator or iterator dereference. This can play havoc on quite innocent looking generic code. For example:

template <class T>
void
process(T& t)
{
    // do something with t
}

template <class T, class A>
void
test(std::vector<T, A>& v)
{
    for (auto& t : v)
        process(t);
}

The above code works for all T except bool. When instantiated with bool, you will receive a compile time error along the lines of:

error: non-const lvalue reference to type 'std::__bit_reference<std::vector<bool, std::allocator<bool>>, true>' cannot bind to
      a temporary of type 'reference' (aka 'std::__bit_reference<std::vector<bool, std::allocator<bool>>, true>')
    for (auto& t : v)
               ^ ~
note: in instantiation of function template specialization 'test<bool, std::allocator<bool>>' requested here
    test(v);
    ^
vector:2124:14: note: selected 'begin' function
      with iterator type 'iterator' (aka '__bit_iterator<std::vector<bool, std::allocator<bool>>, false>')
    iterator begin()
             ^
1 error generated.

This is not a great error message. But it is about the best the compiler can do. The user is confronted with implementation details of vector and in a nutshell says that the vector is not working with a perfectly valid ranged-based for statement. The conclusion the client comes to here is that the implementation of vector is broken. And he would be at least partially correct.

But consider if instead of vector<bool> being a specialization instead there existed a separate class template std::bit_vector<A = std::allocator<bool>> and the coder had written:

template <class A>
void
test(bit_vector<A>& v)
{
    for (auto& t : v)
        process(t);
}

Now one gets a similar error message:

error: non-const lvalue reference to type 'std::__bit_reference<std::bit_vector<std::allocator<bool>>, true>' cannot bind to
      a temporary of type 'reference' (aka 'std::__bit_reference<std::bit_vector<std::allocator<bool>>, true>')
    for (auto& t : v)
               ^ ~
note: in instantiation of function template specialization 'test<std::allocator<bool>>' requested here
    test(v);
    ^
bit_vector:2124:14: note: selected 'begin' function
      with iterator type 'iterator' (aka '__bit_iterator<std::bit_vector<std::allocator<bool>>, false>')
    iterator begin()
             ^
1 error generated.

And although the error message is similar, the coder is far more likely to see that he is using a dynamic array of bits data structure and it is understandable that you can't form a reference to a bit.

I.e. names are important. And creating a specialization that has different behavior than the primary, when the primary template would have worked, is poor practice.

But what's right with vector<bool>?

For the rest of this article assume that we did indeed have a std::bit_vector<A = std::allocator<bool>> and that vector was not specialized on bool. bit_vector<> can be much more than simply a space optimization over vector<bool>, it can also be a very significant performance optimization. But to achieve this higher performance, your vendor has to adapt many of the std::algorithms to have specialized code (optimizations) when processing sequences defined by bit_vector<>::iterators.

find

For example consider this code:

template <class C>
typename C::iterator
test()
{
    C c(100000);
    c[95000] = true;
    return std::find(c.begin(), c.end(), true);
}

How long does std::find take in the above example for:

  1. A hypothetical non-specialized vector<bool>?
  2. A hypothetical bit_vector<> using an optimized find?
  3. A hypothetical bit_vector<> using the unoptimized generic find?

I'm testing on an Intel Core i5 in 64 bit mode. I am normalizing all answers such that the speed of A is 1 (smaller is faster):

  1. 1.0
  2. 0.013
  3. 1.6

An array of bits can be a very fast data structure for a sequential search! The optimized find is inspecting 64 bits at a time. And due to the space optimization, it is much less likely to cause a cache miss. However if the implementation fails to do this, and naively checks one bit at a time, then this giant 75X optimization turns into a significant pessimization.

count

std::count can be optimized much like std::find to process a word of bits at a time:

template <class C>
typename C::difference_type
test()
{
    C c(100000);
    c[95000] = true;
    return std::count(c.begin(), c.end(), true);
}

My results are:

  1. 1.0
  2. 0.044
  3. 1.02

Here the results are not quite as dramatic as for the std::find case. However any time you can speed up your code by a factor of 20, one should do so!

fill

std::fill is yet another example:

template <class C>
void
test()
{
    C c(100000);
    std::fill(c.begin(), c.end(), true);
}

My results are:

  1. 1.0
  2. 0.40
  3. 38.

The optimized fill is over twice as fast as the non-specialized vector<bool>. But if the vendor neglects to specialize fill for bit-iterators the results are disastrous! Naturally the results are identical for the closely related fill_n.

copy

std::copy is yet another example:

template <class C>
void
test()
{
    C c1(100000);
    C c2(100000);
    std::copy(c1.begin(), c1.end(), c2.begin());
}

My results are:

  1. 1.0
  2. 0.36
  3. 34.

The optimized copy is approaches three times as fast as the non-specialized vector<bool>. But if the vendor neglects to specialize fill for bit-iterators the results are not good. If the copy is not aligned on word boundaries (as in the above example), then the optimized copy slows down to the same speed as the copy for A. Results for copy_backward, move and move_backward are similar.

swap_ranges

std::swap_ranges is yet another example:

template <class C>
void
test()
{
    C c1(100000);
    C c2(100000);
    std::swap_ranges(c1.begin(), c1.end(), c2.begin());
}

My results are:

  1. 1.0
  2. 0.065
  3. 4.0

Here bit_vector<> is 15 times faster than an array of bools, and over 60 times as fast as working a bit at a time.

rotate

std::rotate is yet another example:

template <class C>
void
test()
{
    C c(100000);
    std::rotate(c.begin(), c.begin()+c.size()/4, c.end());
}

My results are:

  1. 1.0
  2. 0.59
  3. 17.9

Yet another example of good results with an optimized algorithm and very poor results without this extra attention.

equal

std::equal is yet another example:

template <class C>
bool
test()
{
    C c1(100000);
    C c2(100000);
    return std::equal(c1.begin(), c1.end(), c2.begin());
}

My results are:

  1. 1.0
  2. 0.016
  3. 3.33

If you're going to compare a bunch of bools, it is much, much faster to pack them into bits and compare a word of bits at a time, rather than compare individual bools, or individual bits!

Summary

The dynamic array of bits is a very good data structure if attention is paid to optimizing algorithms that can process up to a word of bits at a time. In this case it becomes not only a space optimization but a very significant speed optimization. If such attention to detail is not given, then the space optimization leads to a very significant speed pessimization.

But it is a shame that the C++ committee gave this excellent data structure the name vector<bool> and that it gives no guidance nor encouragement on the critical generic algorithms that need to be optimized for this data structure. Consequently, few std::lib implementations go to this trouble.

Add a Comment

Comments are closed.

Comments (12)

0 0

Tianyu Zhu said on Nov 22, 2012 02:58 PM:

Could the following code

for (auto& t : v)
process(t);

be fixed if we had used auto&& instead?
0 0

AaronMcDaid said on Nov 22, 2012 04:52 PM:

Tianyu is right.

> Could the following code for (auto& t : v) process(t); be fixed if we had used auto&& instead?

auto&& works instead.
0 0

Martin Ba said on Nov 23, 2012 07:00 AM:

"... few std::lib implementations go to this trouble. ..."

It would certainly be interesting which do. (And what about boost::dynamic_bitset ...)
0 0

gmit said on Nov 23, 2012 05:46 PM:

I remember being not too happy with multithreaded modification of vector<bool> members back in VS2003. What does C++ standard say about this?
0 0

Robert Klarer said on Nov 24, 2012 10:43 AM:

To Tianyu and Aaron:

I think that Howard's point is that this is generic code. It shouldn't be necessary to add a workaround -- such as the one that you suggest -- to the example just so that was can handle the vector<bool> special case.
0 0

Robert Klarer said on Nov 24, 2012 10:45 AM:

s:was:we:

Can we get an edit button? grin
0 0

AaronMcDaid said on Nov 24, 2012 10:53 AM:

Robert, I agree with you on both counts, including the desirability of an edit button.

Anyway, is there any way to get an unspecialized vector<bool>? As a workaround, perhaps we could define a class Bool, which behaves very similarly to bool, and then use a vector<Bool>?
0 0

Glen said on Nov 27, 2012 08:14 PM:

Testing testing 123.
0 0

Glen said on Nov 27, 2012 08:31 PM:

So what you're saying about vector<bool> is: It's a bit of a drama...

* It's the sad tail of how a vector of bool wished to be free of the burden of being a bit special. i.e. not.
* Of being a bit_vector, but not quite.
* vector<bool> was not one, and not nothing was going to make it sound like one.
* People still called it vector<bool>, but when you're not one, you're nothing and people booly you all the time and you don't feel right.
* It really just wanted to be what many thought it already was: a dynamic array of bool. But some people couldn't handle the truth.
* If that sounds false, it's not, it's a true story.
* An array of bools just isn't cut out to be a no good two bit vector. Nothing could be clearer!
* An until vector<bool> stopped acting like a bad bit_vector thrown out of alignment, the sorry tail would never come to an easy end.
* So this act just had to end. But when?
* When would a bit_vector reach a standard where it could take over from a vector of bool and give the peformance of a lifetime?
* Well I don't know, that remains to be seen. But the moral of this story is clear enough:
* All great performers are in a class of their own. Nothing is false about that.
* If you think it's not true then, nothing will happen and that might be an error.
* You just can't have a happy ending if you remove your bits from one and have nowhere else to put them.

That's my take on the story so far!

I hope Howard gets this show made because his last movey had a similar plot line: A bit of work meets a lot of performance.

Get tickets early for this show! If you do, I see it playing out like this:

* Code will break but for good reason, and it'll be easy to fix without losing the plot.
* Howard has really made two much better stories from a bad variant of one. The script is simply vector<bool> reborn and a bit better sequel.
* Upgrading to bit_vector will take a bit of work, but the performance will be worth it!
* Upgrading a vector of bool to itself will be no effort and if it's to a vector of char it only gets easier to follow. The plot makes sense.
* bit_vector and vector<bool> themselves will also be a lot cleaner than before! Family viewing and happy times for the STL, both of them!
* bit_vector can even then be implemented using vector<x>. Just a spooky observation on how if you split up you can always get back together again even if no one else thinks it's a good idea...
* With clear names for different things, in the future it'll be obvious before you start watching, what you're going to see.
* With clear guides to the performance that can be expected, we know how this story ends.

The real story is... What are we waiting for? Let's make a move on this!

I wish these changes had made it into the C++11 standard before that final curtain fell. These are excellent changes and welcome, yesterday!

I nominate an Academy Award to: Howard Hinnant.
0 0

M. Verdone said on Nov 24, 2014 04:55 AM:

Could you remove the gendered language (namely, replace "he" with "they")? It distracts from your argument.
0 0

Peter Cordes said on May 17, 2017 07:36 PM:

Working 64 bits at a time is a big win for an array of 1-byte bool, as well as for a bit-vector. (bool is 1 byte on x86 Windows and SysV ABIs).

You make an excellent point that C++ should portably expose highly optimized bit-vector functions, but in most cases we should only expect about 8x speedups (times the difference in bandwidth from fitting in L1/L2/L3 or not), compared to optimized bool-array functions. So about 8x to 16x, for your case where 100k bits fits in L1 but 100k bytes only fits in L2 on an Intel CPU.


There's clearly a lot to be gained from platform-specific optimized library implementations of these functions, especially using SIMD for cases where the compiler doesn't optimally auto-vectorize a uint64_t pure-C++ implementation. All of these algorithms except count() and rotate() are easily/profitably vectorizable for bit-arrays, and even those are easy for byte-arrays.

----


fill() and copy() being only ~2x to ~3x faster than for an array of bool looks like a major missed-optimization.
fill() and copy() should be as fast as memset/memcpy for both byte and bit elements, so we should expect about 8x to 16x, depending on what kind of i5, assuming the compiler gets it right.

find() with byte elements (array of bool) probably wasn't optimized properly. 75x is too much. Perhaps the compiler-generated asm was only checking one byte at a time instead of 8. A search loop like while(*p++ == (!search_key) * 0x0101010101010101ULL); to search for a 64-bit chunk with a zero byte will get you in the neighbourhood.

And of course this can use SIMD. An optimal implementation might AND or OR together a whole cache line of vectors before checking if any of them were 0 or 1 (depending on which we're searching for), to amortize the cost of pcmpeqb / pmovmskb / test/jcc. Like glibc's strlen does for long strings, except it has to use pminub instead of pand because its input isn't bool.

If the allocator used calloc instead of zeroing the memory itself, then after creating a 100k-element vector and touching one element, only one 4k page of the vector would be dirty. The other pages would still be copy-on-write mapped to the same physical page of zeroed memory. (glibc's calloc on Linux knows that mmap returns zeroed pages, and takes advantage of this for large allocations). So reading the whole thing would only touch 8kiB of physical memory, even for the 100k byte-element vector. Since cache is physically tagged, it would all fit in 8kiB of L1d cache, and find()'s loads would hit in L1 until it got to the dirty page where you wrote an element. Since 4kiB is a smaller fraction of the total size for the byte-element case, I'd expect find() to show a speed ratio of less than 8x with optimal implementations for both cases.

(In my experience on Linux, std::vector fails to take advantage of calloc, so memory gets zeroed twice: once by the kernel and then again in user-space, dirtying it.)

---

bit-vector count() needs AVX2 to get much if any speedup over scalar 64-bit popcnt, if the scalar loop is optimal. (SSSE3 pshufb is a byte-shuffle. You can use it to do 16 parallel LUT lookups from a 16-entry table, which you can use to count the bits of the low and high nibbles of each byte. The AVX2 version does this for two 128b lanes in parallel, so a vector shift, vector AND, two shuffles, and two vector adds accumulate the count for 32 bytes. vs. 4 popcnt + 4 scalar add instructions)

count() of an array of 1-byte bool can SIMD very easily. Even SWAR (SIMD-within-a-register) works great. Since we know that only the low bit of each byte can be set, we can load and add in 64-bit chunks without worrying about carry from one byte into the next, up to 255 times. At that point we have to leave the inner loop to unpack and add the bytes. (x86 psadbw against an all-zero vector is perfect for horizontal sums of byte elements). Anyway, an optimal count() for 1-byte bool should process about twice as many bytes per clock cycle as for a bit-vector on a Sandybridge-family CPU, and thus only be about 4x slower since it has 8x as many bytes to process. But this only happens if load bandwidth can keep up.


rotate() with an offset that's a multiple of 8 bits is just a memmove with wrap-around. If you need to shift bits within bytes, then it's a lot of work for scalar or vector. Probably an optimal vector implementation is at least 2x faster than the best scalar asm a compiler (or human) could produce, since variable-count shifts are multiple uops on Intel CPUs. :( Scalar can use SHRD (double-shift) to shift bits from one register into another instead of separate shifts -> OR, but compilers don't like to use SHRD in my experience.

1-byte rotate() is always cheap, but touching more memory sucks.
0 0

Peter Cordes said on Oct 8, 2017 05:07 PM:

BTW, libc++ (http://libcxx.llvm.org/) does specialize some of these algorithms for its vector<bool> __bit_iterator. This leads to *very* good code-gen with clang for count (auto-vectorized with AVX2, like clang normally does for __builtin_popcountll in a loop), and several other functions.

The C++ implementations are in the __bit_reference header: https://llvm.org/viewvc/llvm-project/libcxx/trunk/include/__bit_reference?revision=304357&view=markup

The functions use bit-shifts and masking to handle the head and tail not-a-whole-chunk bits, and the middle of the range is handled with a loop over `long` chunks or something, which clang can often auto-vectorize.

There are specializations for find, count, fill, copy (with a special case for when the src has the same alignment as the destination), copy_backward, move/move_backward, swap_ranges (aligned and unaligned), rotate, and equal (aligned and unaligned).

With two different iterators, it's much more efficient when they're aligned *relative to each other*, because then the main loop over 64-bit chunks doesn't have to do any shifting for the bits in each chunk to match with the chunk from the other range.

I put some of these up on Godbolt (https://godbolt.org/g/pYuDHZ) to see what compilers will do. With libstdc++ (the normal default on Linux, unless you use clang -stdlib=libc++), the asm output is pretty abysmal. x86-64 clang -O3 -march=haswell doesn't optimize away the bit-at-a-time stuff in most of the libstdc++ functions. Everything looks pretty good for libc++, except std::rotate. It doesn't look wonderful: a lot of loading/storing before eventually using memmove. Also a *lot* of different branches, but of course any one execution will only take one path. This easy case hopefully takes a pretty simple path, even though it calls a fully-general implementation of the function in the header.