Built-in / Intrinsic / Primitive Data Types
sizeof(char) be 2 on some machines? For example, what about double-byte characters?
sizeof(char) is always 1. Always. It is never 2. Never, never, never.
Even if you think of a “character” as a multi-byte thingy,
char is not.
sizeof(char) is always exactly 1. No
Look, I know this is going to hurt your head, so please, please just read the next few FAQs in sequence and hopefully the pain will go away by sometime next week.
What are the units of
For example, if
sizeof(Fred) is 8, the distance between two
Fred objects in an array of
Freds will be exactly 8
As another example, this means
sizeof(char) is one byte. That’s right: one byte. One, one, one,
exactly one byte, always one byte. Never two bytes. No exceptions.
Whoa, but what about machines or compilers that support multibyte characters. Are you saying that a “character” and a
char might be different?!?
Yes that’s right: the thing commonly referred to as a “character” might be different from the thing C++ calls a
I’m really sorry if that hurts, but believe me, it’s better to get all the pain over with at once. Take a deep breath
and repeat after me: “character and
char might be different.” There, doesn’t that feel better? No? Well keep reading
— it gets worse.
But, but, but what about machines where a
char has more than 8 bits? Surely you’re not saying a C++ byte might have more than 8 bits, are you?!?
Yep, that’s right: a C++ byte might have more than 8 bits.
The C++ language guarantees a byte must always have at least 8 bits. But there are implementations of C++ that have more than 8 bits per byte.
Okay, I could imagine a machine with 9-bit bytes. But surely not 16-bit bytes or 32-bit bytes, right?
I have heard of one implementation of C++ that has 64-bit “bytes.” You read that right: a byte on that implementation has 64 bits. 64 bits per byte. 64. As in 8 times 8.
And yes, you’re right, combining with the above would mean that a
char on that implementation would
have 64 bits.
I’m sooooo confused. Would you please go over the rules about bytes,
chars, and characters one more time?
Here are the rules:
- The C++ language gives the programmer the impression that memory is laid out as a sequence of something C++ calls “bytes.”
- Each of these things that the C++ language calls a byte has at least 8 bits, but might have more than 8 bits.
- The C++ language guarantees that a
charpointers) can address individual bytes.
- The C++ language guarantees there are no bits between two bytes. This means every bit in memory is part of a
byte. If you grind your way through memory via a
char*, you will be able to see every bit.
- The C++ language guarantees there are no bits that are part of two distinct bytes. This means a change to one byte will never cause a change to a different byte.
- The C++ language gives you a way to find out how many bits are in a byte in your particular implementation:
include the header
<climits>, then the actual number of bits per byte will be given by the
Let’s work an example to illustrate these rules. The PDP-10 has 36-bit words with no hardware facility to address anything within one of those words. That means a pointer can point only at things on a 36-bit boundary: it is not possible for a pointer to point 8 bits to the right of where some other pointer points.
One way to abide by all the above rules is for a PDP-10 C++ compiler to define a “byte” as 36 bits. Another valid
approach would be to define a “byte” as 9 bits, and simulate a
char* by two words of memory: the first could point to
the 36-bit word, the second could be a bit-offset within that word. In that case, the C++ compiler would need to add
extra instructions when compiling code using
char* pointers. For example, the code generated for
*p = 'x' might read
the word into a register, then use bit-masks and bit-shifts to change the appropriate 9-bit byte within that word. An
int* could still be implemented as a single hardware pointer, since C++ allows
sizeof(char*) != sizeof(int*).
Using the same logic, it would also be possible to define a PDP-10 C++ “byte” as 12-bits or 18-bits. However the
above technique wouldn’t allow us to define a PDP-10 C++ “byte” as 8-bits, since 8×4 is 32, meaning every 4th byte
we would skip 4 bits. A more complicated approach could be used for those 4 bits, e.g., by packing nine bytes (of
8-bits each) into two adjacent 36-bit words. The important point here is that
memcpy() has to be able to see every
bit of memory: there can’t be any bits between two adjacent bytes.
Note: one of the popular non-C/C++ approaches on the PDP-10 was to pack 5 bytes (of 7-bits each) into each 36-bit
word. However this won’t work in C or C++ since 5×7 is 35, meaning using
char*s to walk through memory would “skip”
a bit every fifth byte (and also because C++ requires bytes to have at least 8 bits).
What is a “POD type”?
A type that consists of nothing but Plain Old Data.
A POD type is a C++ type that has an equivalent in C, and that uses the same rules as C uses for initialization, copying, layout, and addressing.
As an example, the C declaration
struct Fred x; does not initialize the members of the
x. To make
this same behavior happen in C++,
Fred would need to not have any constructors. Similarly to make the C++
version of copying the same as the C version, the C++
Fred must not have overloaded the assignment operator. To make
sure the other rules match, the C++ version must not have virtual functions, base classes, non-static members that
protected, or a destructor. It can, however, have static data members, static member functions, and
non-static non-virtual member functions.
The actual definition of a POD type is recursive and gets a little gnarly. Here’s a slightly simplified definition of
POD: a POD type’s non-static data members must be
public and can be of any of these types:
bool, any numeric type
including the various
char variants, any enumeration type, any data-pointer type (that is, any type convertible to
void*), any pointer-to-function type, or any POD type, including arrays of any of these. Note: data-pointers and
pointers-to-function are okay, but pointers-to-member are not. Also note that references are
not allowed. In addition, a POD type can’t have constructors, virtual functions, base classes, or an overloaded
When initializing non-static data members of built-in / intrinsic / primitive types, should I use the “initialization list” or assignment?
For symmetry, it is usually best to initialize all non-static data members in the constructor’s “initialization list,” even those that are of a built-in / intrinsic / primitive type. The FAQ shows you why and how.
When initializing static data members of built-in / intrinsic / primitive types, should I worry about the “
static initialization order fiasco”?
Yes, if you initialize your built-in / intrinsic / primitive variable by an expression that the compiler doesn’t evaluate solely at compile-time. The FAQ provides several solutions for this (subtle!) problem.
Can I define an operator overload that works with built-in / intrinsic / primitive types?
No, the C++ language requires that your operator overloads take at least one operand of a “class type” or enumeration type. The C++ language will not let you define an operator all of whose operands / parameters are of primitive types.
For example, you can’t define an
operator== that takes two
char*s and uses string
comparison. That’s good news because if
s2 are of type
s1 == s2 already has a well defined meaning: it compares the two pointers, not the two strings pointed
to by those pointers. You shouldn’t use pointers anyway. Use
std::string instead of
If C++ let you redefine the meaning of operators on built-in types, you wouldn’t ever know what
1 + 1 is: it would
depend on which headers got included and whether one of those headers redefined addition to mean, for example,
delete an array of some built-in / intrinsic / primitive type, why can’t I just say
delete a instead of
Because you can’t.
Look, please don’t write me an email asking me why C++ is what it is. It just is. If you really want a rationale, buy Bjarne Stroustrup’s excellent book, “Design and Evolution of C++” (Addison-Wesley publishers). But if your real goal is to write some code, don’t waste too much time figuring out why C++ has these rules, and instead just abide by its rules.
So here’s the rule: if
a points to an array of thingies that was allocated via
new T[n], then you must, must,
delete it via
delete a. Even if the elements in the array are built-in types.
Even if they’re of type
void*. Even if you don’t understand why.
How can I tell if an integer is a power of two without looping?
inline bool isPowerOf2(int i)
return i > 0 && (i & (i - 1)) == 0;
What should be returned from a function?
In practice, there are a lot of cases. Here are a few of them in random order:
- void — if you don’t need a return value, don’t return one.
- local by value — it’s the simplest, and with a little care NRVO maximizes performance.
- local by pointer or reference — NOT!. Please don’t do this.
- data member by value — excellent choice if the function is a non-static member function, and if the data member
can be copied relatively quickly, e.g.,
int. If the data member is something that is slow to copy, this has a performance penalty if you call this member function in the inner loop of a CPU-bound application.
- data member by pointer — okay, but make sure you don’t want to return it by reference, and make sure you use
Foo const*if you don’t want the caller to modify the data member. Since callers might store the pointer rather than copy the data member, you should warn callers in the member function’s “contract” that they must not use the returned pointer after the
- data member by reference-to-nonconst — okay, but this allows the caller to make changes to your object’s data
member without your class “seeing” the change. If you have a “set” method that changes this data member, use either a
reference-to-const or by-value instead. Another thing: since callers might store the reference rather than copy
the data member, you should warn callers in the member function’s “contract” that they must not use the returned
reference after the
- data member by reference-to-const — okay, but it does allow your users to see the data type of your member
variables. That means if you ever need to change the type of your member variables, the change might break the code
that uses your class, and that’s one of the main points of encapsulation. You can ameliorate that risk by exposing a
typedeffor the type of that member variable (and therefore the type of the reference-to-const return value), and by warning your users that they should use the
typedefrather than the raw, underlying type. Another reality is that if the caller captures this reference, as opposed to copying the object, then the underlying referent might change “under the caller’s nose,” even though the type is reference-to-const. Because a lot of programmers are surprised by that, it’s smart to warn callers in the member function’s “contract.” You should also warn callers to discard the returned reference once the
this-object has died.
shared_ptrto a member that was allocated via
new— this has tradeoffs that are very similar to those of returning a member by pointer or by reference; see those bullets for the tradeoffs. The advantage is that callers can legitimately hold onto and use the returned pointer after the
shared_ptrto freestore-allocated copy of the datum. This is useful for polymorphic objects, since it lets you have the effect of return-by-value yet without the “slicing” problem. The performance needs to be evaluated on a case-by-case basis.
- others — this list is by way of example and not by way of exclusion. In other words, this is just a starting point, not an ending point.
Murphy’s Law basically guarantees that your particular needs will fall under the last bullet, rather than any of the earlier bullets .