intrinsic types

Built-in / Intrinsic / Primitive Data Types

Can sizeof(char) be 2 on some machines? For example, what about double-byte characters?

No, sizeof(char) is always 1. Always. It is never 2. Never, never, never.

Even if you think of a “character” as a multi-byte thingy, char is not. sizeof(char) is always exactly 1. No exceptions, ever.

Look, I know this is going to hurt your head, so please, please just read the next few FAQs in sequence and hopefully the pain will go away by sometime next week.

What are the units of sizeof?

Bytes.

For example, if sizeof(Fred) is 8, the distance between two Fred objects in an array of Freds will be exactly 8 bytes.

As another example, this means sizeof(char) is one byte. That’s right: one byte. One, one, one, exactly one byte, always one byte. Never two bytes. No exceptions.

Whoa, but what about machines or compilers that support multibyte characters. Are you saying that a “character” and a char might be different?!?

Yes that’s right: the thing commonly referred to as a “character” might be different from the thing C++ calls a char.

I’m really sorry if that hurts, but believe me, it’s better to get all the pain over with at once. Take a deep breath and repeat after me: “character and char might be different.” There, doesn’t that feel better? No? Well keep reading — it gets worse.

But, but, but what about machines where a char has more than 8 bits? Surely you’re not saying a C++ byte might have more than 8 bits, are you?!?

Yep, that’s right: a C++ byte might have more than 8 bits.

The C++ language guarantees a byte must always have at least 8 bits. But there are implementations of C++ that have more than 8 bits per byte.

Okay, I could imagine a machine with 9-bit bytes. But surely not 16-bit bytes or 32-bit bytes, right?

Wrong.

I have heard of one implementation of C++ that has 64-bit “bytes.” You read that right: a byte on that implementation has 64 bits. 64 bits per byte. 64. As in 8 times 8.

And yes, you’re right, combining with the above would mean that a char on that implementation would have 64 bits.

I’m sooooo confused. Would you please go over the rules about bytes, chars, and characters one more time?

Here are the rules:

  • The C++ language gives the programmer the impression that memory is laid out as a sequence of something C++ calls “bytes.”
  • Each of these things that the C++ language calls a byte has at least 8 bits, but might have more than 8 bits.
  • The C++ language guarantees that a char* (char pointers) can address individual bytes.
  • The C++ language guarantees there are no bits between two bytes. This means every bit in memory is part of a byte. If you grind your way through memory via a char*, you will be able to see every bit.
  • The C++ language guarantees there are no bits that are part of two distinct bytes. This means a change to one byte will never cause a change to a different byte.
  • The C++ language gives you a way to find out how many bits are in a byte in your particular implementation: include the header <climits>, then the actual number of bits per byte will be given by the CHAR_BIT macro.

Let’s work an example to illustrate these rules. The PDP-10 has 36-bit words with no hardware facility to address anything within one of those words. That means a pointer can point only at things on a 36-bit boundary: it is not possible for a pointer to point 8 bits to the right of where some other pointer points.

One way to abide by all the above rules is for a PDP-10 C++ compiler to define a “byte” as 36 bits. Another valid approach would be to define a “byte” as 9 bits, and simulate a char* by two words of memory: the first could point to the 36-bit word, the second could be a bit-offset within that word. In that case, the C++ compiler would need to add extra instructions when compiling code using char* pointers. For example, the code generated for *p = 'x' might read the word into a register, then use bit-masks and bit-shifts to change the appropriate 9-bit byte within that word. An int* could still be implemented as a single hardware pointer, since C++ allows sizeof(char*) != sizeof(int*).

Using the same logic, it would also be possible to define a PDP-10 C++ “byte” as 12-bits or 18-bits. However the above technique wouldn’t allow us to define a PDP-10 C++ “byte” as 8-bits, since 8×4 is 32, meaning every 4th byte we would skip 4 bits. A more complicated approach could be used for those 4 bits, e.g., by packing nine bytes (of 8-bits each) into two adjacent 36-bit words. The important point here is that memcpy() has to be able to see every bit of memory: there can’t be any bits between two adjacent bytes.

Note: one of the popular non-C/C++ approaches on the PDP-10 was to pack 5 bytes (of 7-bits each) into each 36-bit word. However this won’t work in C or C++ since 5×7 is 35, meaning using char*s to walk through memory would “skip” a bit every fifth byte (and also because C++ requires bytes to have at least 8 bits).

What is a “POD type”?

A type that consists of nothing but Plain Old Data.

A POD type is a C++ type that has an equivalent in C, and that uses the same rules as C uses for initialization, copying, layout, and addressing.

As an example, the C declaration struct Fred x; does not initialize the members of the Fred variable x. To make this same behavior happen in C++, Fred would need to not have any constructors. Similarly to make the C++ version of copying the same as the C version, the C++ Fred must not have overloaded the assignment operator. To make sure the other rules match, the C++ version must not have virtual functions, base classes, non-static members that are private or protected, or a destructor. It can, however, have static data members, static member functions, and non-static non-virtual member functions.

The actual definition of a POD type is recursive and gets a little gnarly. Here’s a slightly simplified definition of POD: a POD type’s non-static data members must be public and can be of any of these types: bool, any numeric type including the various char variants, any enumeration type, any data-pointer type (that is, any type convertible to void*), any pointer-to-function type, or any POD type, including arrays of any of these. Note: data-pointers and pointers-to-function are okay, but pointers-to-member are not. Also note that references are not allowed. In addition, a POD type can’t have constructors, virtual functions, base classes, or an overloaded assignment operator.

When initializing non-static data members of built-in / intrinsic / primitive types, should I use the “initialization list” or assignment?

For symmetry, it is usually best to initialize all non-static data members in the constructor’s “initialization list,” even those that are of a built-in / intrinsic / primitive type. The FAQ shows you why and how.

When initializing static data members of built-in / intrinsic / primitive types, should I worry about the “static initialization order fiasco”?

Yes, if you initialize your built-in / intrinsic / primitive variable by an expression that the compiler doesn’t evaluate solely at compile-time. The FAQ provides several solutions for this (subtle!) problem.

Can I define an operator overload that works with built-in / intrinsic / primitive types?

No, the C++ language requires that your operator overloads take at least one operand of a “class type” or enumeration type. The C++ language will not let you define an operator all of whose operands / parameters are of primitive types.

For example, you can’t define an operator== that takes two char*s and uses string comparison. That’s good news because if s1 and s2 are of type char*, the expression s1 == s2 already has a well defined meaning: it compares the two pointers, not the two strings pointed to by those pointers. You shouldn’t use pointers anyway. Use std::string instead of char*.

If C++ let you redefine the meaning of operators on built-in types, you wouldn’t ever know what 1 + 1 is: it would depend on which headers got included and whether one of those headers redefined addition to mean, for example, subtraction.

When I delete an array of some built-in / intrinsic / primitive type, why can’t I just say delete a instead of delete[] a?

Because you can’t.

Look, please don’t write me an email asking me why C++ is what it is. It just is. If you really want a rationale, buy Bjarne Stroustrup’s excellent book, “Design and Evolution of C++” (Addison-Wesley publishers). But if your real goal is to write some code, don’t waste too much time figuring out why C++ has these rules, and instead just abide by its rules.

So here’s the rule: if a points to an array of thingies that was allocated via new T[n], then you must, must, must delete it via delete[] a. Even if the elements in the array are built-in types. Even if they’re of type char or int or void*. Even if you don’t understand why.

How can I tell if an integer is a power of two without looping?

inline bool isPowerOf2(int i)
{
  return i > 0 && (i & (i - 1)) == 0;
}

What should be returned from a function?

In practice, there are a lot of cases. Here are a few of them in random order:

  • void — if you don’t need a return value, don’t return one.
  • local by value — it’s the simplest, and with a little care NRVO maximizes performance.
  • local by pointer or reference — NOT!. Please don’t do this.
  • data member by value — excellent choice if the function is a non-static member function, and if the data member can be copied relatively quickly, e.g., int. If the data member is something that is slow to copy, this has a performance penalty if you call this member function in the inner loop of a CPU-bound application.
  • data member by pointer — okay, but make sure you don’t want to return it by reference, and make sure you use const Foo* or Foo const* if you don’t want the caller to modify the data member. Since callers might store the pointer rather than copy the data member, you should warn callers in the member function’s “contract” that they must not use the returned pointer after the this-object dies.
  • data member by reference-to-nonconst — okay, but this allows the caller to make changes to your object’s data member without your class “seeing” the change. If you have a “set” method that changes this data member, use either a reference-to-const or by-value instead. Another thing: since callers might store the reference rather than copy the data member, you should warn callers in the member function’s “contract” that they must not use the returned reference after the this-object dies.
  • data member by reference-to-const — okay, but it does allow your users to see the data type of your member variables. That means if you ever need to change the type of your member variables, the change might break the code that uses your class, and that’s one of the main points of encapsulation. You can ameliorate that risk by exposing a public typedef for the type of that member variable (and therefore the type of the reference-to-const return value), and by warning your users that they should use the typedef rather than the raw, underlying type. Another reality is that if the caller captures this reference, as opposed to copying the object, then the underlying referent might change “under the caller’s nose,” even though the type is reference-to-const. Because a lot of programmers are surprised by that, it’s smart to warn callers in the member function’s “contract.” You should also warn callers to discard the returned reference once the this-object has died.
  • shared_ptr to a member that was allocated via new — this has tradeoffs that are very similar to those of returning a member by pointer or by reference; see those bullets for the tradeoffs. The advantage is that callers can legitimately hold onto and use the returned pointer after the this-object dies.
  • local unique_ptr or shared_ptr to freestore-allocated copy of the datum. This is useful for polymorphic objects, since it lets you have the effect of return-by-value yet without the “slicing” problem. The performance needs to be evaluated on a case-by-case basis.
  • others — this list is by way of example and not by way of exclusion. In other words, this is just a starting point, not an ending point.

Murphy’s Law basically guarantees that your particular needs will fall under the last bullet, rather than any of the earlier bullets SMILE!.