Standard C++

Newbie Questions & Answers

What is this “newbie section” all about?

It’s a randomly ordered collection containing a few questions newbies might ask.

This section doesn’t pretend to be organized. Think of it as random. In truth, think of it as a hurried, initial cut by a busy guy.
This section doesn’t pretend to be complete. Think of it as offering a little help to a few people. It won’t help everyone and it might not help you.

Hopefully someday we’ll be able to improve this section, but for now, it is incomplete and unorganized. If that bothers you, my suggestion is to click that little x on the extreme upper right of your browser window :-).

Where do I start?

Read the FAQ, especially the section on learning C++, and read books plural.

But if everything still seems too hard, if you’re feeling bombarded with mysterious terms and concepts, if you’re wondering how you’ll ever grasp anything, do this:

Type in some C++ code from any of the sources listed above.
Get it to compile and run.
Repeat.

That’s it. Just practice and play. Hopefully that will give you a foothold.

Here are some places you can get “sample problems” (in alphabetical order):

How do I read a string from input?

You can read a single, whitespace terminated word like this:

    #include<iostream>
    #include<string>
    using namespace std;

    int main()
    {
        cout << "Please enter a word:\n";

        string s;
        cin>>s;

        cout << "You entered " << s << '\n';
    }

Note that there is no explicit memory management and no fixed-sized buffer that you could possibly overflow.

If you really need a whole line (and not just a single word) you can do this:

    #include<iostream>
    #include<string>
    using namespace std;

    int main()
    {
        cout << "Please enter a line:\n";

        string s;
        getline(cin,s);

        cout << "You entered " << s << '\n';
    }

For a brief introduction to standard library facilities, such as iostream and string, see Chapter 3 of TC++PL3 (available online). For a detailed comparison of simple uses of C and C++ I/O, see “Learning Standard C++ as a New Language”, which you can download from Stroustrup’s publications list.

How do I write this very simple program?

Often, especially at the start of semesters, there is a small flood of questions about how to write very simple programs. Typically, the problem to be solved is to read in a few numbers, do something with them, and write out an answer. Here is a sample program that does that:

#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;

int main()
{
    vector<double> v;

    double d;
    while(cin>>d) v.push_back(d);   // read elements
    if (!cin.eof()) {           // check if input failed
        cerr << "format error\n";
        return 1;                   // error return
    }

    cout << "read " << v.size() << " elements\n";

    reverse(v.begin(),v.end());
    cout << "elements in reverse order:\n";
    for (int i = 0; i<v.size(); ++i) cout << v[i] << '\n';

    return 0;                       // success return
}

Here are a few observations about this program:

This is a Standard ISO C++ program using the standard library. Standard library facilities are declared in namespace std in headers without a .h suffix.
If you want to compile this on a Windows machine, you need to compile it as a “console application”. Remember to give your source file the .cpp suffix or the compiler might think that it is C (not C++) source.
Yes, main() returns an int.
Reading into a standard vector guarantees that you don’t overflow some arbitrary buffer. Reading into an array without making a “silly error” is beyond the ability of complete novices – by the time you get that right, you are no longer a complete novice. If you doubt this claim, read Stroustrup’s paper “Learning Standard C++ as a New Language”, which you can download here.
The !cin.eof() is a test of the stream’s format. Specifically, it tests whether the loop ended by finding end-of-file (if not, you didn’t get input of the expected type/format). For more information, look up “stream state” in your C++ textbook.
A vector knows its size, so I don’t have to count elements.
Yes, you could declare i to be a vector<double>::size_type rather than plain int to quiet warnings from some hyper-suspicious compilers, but in this case,I consider that too pedantic and distracting.
This program contains no explicit memory management, and it does not leak memory. A vector keeps track of the memory it uses to store its elements. When a vector needs more memory for elements, it allocates more; when a vector goes out of scope, it frees that memory. Therefore, the user need not be concerned with the allocation and deallocation of memory for vector elements.
For reading in strings, see How do I read a string from input?.
The program ends reading input when it sees “end of file”. If you run the program from the keybord on a Unix machine “end of file” is Ctrl-D. If you are on a Windows machine that because of a bug doesn’t recognize an end-of-file character, you might prefer this slightly more complicated version of the program that terminates input with the word “end”:

#include <iostream>
#include <vector>
#include <algorithm>
#include <string>
using namespace std;

int main()
{
    vector<double> v;

    double d;
    while(cin>>d) v.push_back(d);   // read elements
    if (!cin.eof()) {               // check if input failed
        cin.clear();                // clear error state
        string s;
        cin >> s;                   // look for terminator string
        if (s != "end") {
            cerr << "format error\n";
            return 1;               // error return
        }
    }

    cout << "read " << v.size() << " elements\n";

    reverse(v.begin(),v.end());
    cout << "elements in reverse order:\n";
    for (int i = 0; i<v.size(); ++i) cout << v[i] << '\n';

    return 0;                       // success return
}

For more examples of how to use the standard library to do simple things simply, see Parts 3 and 4 of the Tour of C++.

How do I convert an integer to a string?

Call to_string. This is new in C++11, widely available, and as of this writing widely not-noticed. :)

int i = 127;
string s = to_string(i);

How do I convert a string to an integer?

Call stoi:

string s = "127";
int i = stoi(s);

The related functions stol and strtoll will convert a string to a long or a long long, respectively.

Should I use `void main()` or `int main()`?

int main()

main() must return int. Some compilers accept void main(), but that is non-standard and shouldn’t be used. Instead use int main(). As to the specific return value, if you don’t know what else to return just say return 0;

The definition

    void main() { /* ... */ }

is not and never has been C++, nor has it even been C. See the ISO C++ standard 3.6.1[2] or the ISO C standard 5.1.2.2.1. A conforming implementation accepts

    int main() { /* ... */ }

and

    int main(int argc, char* argv[]) { /* ... */ }

A conforming implementation may provide more versions of main(), but they must all have return type int. The int returned by main() is a way for a program to return a value to “the system” that invokes it. On systems that doesn’t provide such a facility the return value is ignored, but that doesn’t make void main() legal C++ or legal C. Even if your compiler accepts void main(), avoid it, or risk being considered ignorant by C and C++ programmers.

In C++, main() need not contain an explicit return statement. In that case, the value returned is 0, meaning successful execution. For example:

    #include<iostream>

    int main()
    {
        std::cout << "This program returns the integer value 0\n";
    }

Note also that neither ISO C++ nor C99 allows you to leave the type out of a declaration. That is, in contrast to C89 and ARM C++, int is not assumed where a type is missing in a declaration. Consequently:

    #include<iostream>

    main() { /* ... */ }

is an error because the return type of main() is missing.

Should I use `f(void)` or `f()`?

f()

C programmers often use f(void) when declaring a function that takes no parameters, however in C++ that is considered bad style. In fact, the f(void) style has been called an “abomination” by Bjarne Stroustrup, the creator of C++, Dennis Ritchie, the co-creator of C, and Doug McIlroy, head of the research department where Unix was born.

If you’re writing C++ code, you should use f(). The f(void) style is legal in C++, but only to make it easier to compile C code.

This C++ code shows the best way to declare a function that takes no parameters:

void f();      // declares (not defines) a function that takes no parameters

This C++ code both declares and defines a function that takes no parameters:

void f()       // declares and defines a function that takes no parameters
{
  // ...
}

The following C++ code also declares a function that takes no parameters, but it uses the less desirable (some would say “abomination”) style, f(void):

void f(void);  // undesirable style for C++; use void f() instead

Actually this f() thing is all you need to know about C++. That and using those new fangled // comments. Once you know those two things, you can claim to be a C++ expert. Go for it: type those magical “++” marks on your resumé. Who cares about all that OO stuff — why should you bother changing the way you think? After all, the really important thing isn’t thinking; it’s typing in function declarations and comments. (Sigh; I wish nobody actually thought that way.)

What are the criteria for choosing between `short` / `int` / `long` data types?

Other related questions: If a short int is the same size as an int on my particular implementation, why choose one or the other? If I start taking the actual size in bytes of the variables into account, won’t I be making my code unportable (since the size in bytes may differ from implementation to implementation)? Or should I simply go with sizes much larger than I actually need, as a sort of safety buffer?

Answer: It’s usually a good idea to write code that can be ported to a different operating system and/or compiler. After all, if you’re successful at what you do, someone else might want to use it somewhere else. This can be a little tricky with built-in types like int and short, since C++ doesn’t give guaranteed sizes. However C++ gives you two things that might help: guaranteed minimum sizes, and that will usually be all you need to know, and a standard C header that provides typedefs for sized integers.

C++ guarantees a char is exactly one byte which is at least 8 bits, short is at least 16 bits, int is at least 16 bits, and long is at least 32 bits. It also guarantees the unsigned version of each of these is the same size as the original, for example, sizeof(unsigned short) == sizeof(short).

When writing portable code, you shouldn’t make additional assumptions about these sizes. For example, don’t assume int has 32 bits. If you have an integral variable that needs at least 32 bits, use a long or unsigned long even if sizeof(int) == 4 on your particular implementation. On the other hand, if you have an integral variable quantity that will always fit within 16 bits and if you want to minimize the use of data memory, use a short or unsigned short even if you know sizeof(int) == 2 on your particular implementation.

The other option is to use the following standard C header (which may or may not be provided by your C++ compiler vendor):

#include <stdint.h>  /* not part of the C++ standard */

That header defines typedefs for things like int32_t and uint16_t, which are a signed 32-bit integer and an unsigned 16-bit integer, respectively. There are other goodies in there, as well. My recommendation is that you use these “sized” integral types only where they are actually needed. Some people worship consistency, and they are sorely tempted to use these sized integers everywhere simply because they were needed somewhere. Consistency is good, but it is not the greatest good, and using these typedefs everywhere can cause some headaches and even possible performance issues. Better to use common sense, which often leads you to use the normal keywords, e.g., int, unsigned, etc. where you can, and use of the explicitly sized integer types, e.g., int32_t, etc. where you must.

Note that there are some subtle tradeoffs here. In some cases, your computer might be able to manipulate smaller things faster than bigger things, but in other cases it is exactly the opposite: int arithmetic might be faster than short arithmetic on some implementations. Another tradeoff is data-space against code-space: int arithmetic might generate less binary code than short arithmetic on some implementations. Don’t make simplistic assumptions. Just because a particular variable can be declared as short doesn’t necessarily mean it should, even if you’re trying to save space.

Note that the C standard doesn’t guarantee that <stdint.h> defines intn_t and uintn_t specifically for n = 8, 16, 32 or 64. However if the underlying implementation provides integers with any of those sizes, <stdint.h> is required to contain the corresponding typedefs. Furthermore you are guaranteed to have typedefs for sizes n = 8, 16 and 32 if your implementation is POSIX compliant. Put all that together and it’s fair to say that the vast majority of implementations, though not all implementations, will have typedefs for those typical sizes.

What the heck is a `const` variable? Isn’t that a contradiction in terms?

If it bothers you, call it a “const identifier” instead.

The main issue is to figure out what it is; we can figure out what to call it later. For example, consider the symbol max in the following function:

void f()
{
  const int max = 107;
  // ...
  float array[max];
  // ...
}

It doesn’t matter whether you call max a const variable or a const identifier. What matters is that you realize it is like a normal variable in some ways (e.g., you can take its address or pass it by const-reference), but it is unlike a normal variable in that you can’t change its value.

Here is another even more common example:

class Fred {
public:
  // ...
private:
  static const int max_ = 107;
  // ...
};

In this example, you would need to add the line int Fred::max_; in exactly one .cpp file, typically in Fred.cpp.

It is generally considered good programming practice to give each “magic number” (like 107) a symbolic name and use that name rather than the raw magic number.

Why would I use a `const` variable / `const` identifier as opposed to `#define`?

const identifiers are often better than #define because:

they obey the language’s scoping rules
you can see them in the debugger
you can take their address if you need to
you can pass them by const-reference if you need to
they don’t create new “keywords” in your program.

In short, const identifiers act like they’re part of the language because they are part of the language. The preprocessor can be thought of as a language layered on top of C++. You can imagine that the preprocessor runs as a separate pass through your code, which would mean your original source code would be seen only by the preprocessor, not by the C++ compiler itself. In other words, you can imagine the preprocessor sees your original source code and replaces all #define symbols with their values, then the C++ compiler proper sees the modified source code after the original symbols got replaced by the preprocessor.

There are cases where #define is needed, but you should generally avoid it when you have the choice. You should evaluate whether to use const vs. #define based on business value: time, money, risk. In other words, one size does not fit all. Most of the time you’ll use const rather than #define for constants, but sometimes you’ll use #define. But please remember to wash your hands afterwards.

Are you saying that the preprocessor is evil?

Yes, that’s exactly what I’m saying: the preprocessor is evil.

Every #define macro effectively creates a new keyword in every source file and every scope until that symbol is #undefd. The preprocessor lets you create a #define symbol that is always replaced independent of the {...} scope where that symbol appears.

Sometimes we need the preprocessor, such as the #ifndef/#define wrapper within each header file, but it should be avoided when you can. “Evil” doesn’t mean “never use.” You will use evil things sometimes, particularly when they are “the lesser of two evils.” But they’re still evil :-)

What is the “standard library”? What is included / excluded from it?

Most (not all) implementations have a “standard include” directory, sometimes directories plural. If your implementation is like that, the headers in the standard library are probably a subset of the files in those directories. For example, iostream and string are part of the standard library, as is cstring and cstdio. There are a bunch of .h files that are also part of the standard library, but not every .h file in those directories is part of the standard library. For example, stdio.h is but windows.h is not.

You include headers from the standard library like this:

#include <iostream>

int main()
{
  std::cout << "Hello world!\n";
  // ...
}

How should I lay out my code? When should I use spaces, tabs, and/or newlines in my code?

The short answer is: Just like the rest of your team. In other words, the team should use a consistent approach to whitespace, but otherwise please don’t waste a lot of time worrying about it.

Here are a few details:

There is no universally accepted coding standard when it comes to whitespace. There are a few popular whitespace standards, such as the “one true brace” style, but there is a lot of contention over certain aspects of any given coding standard.

Most whitespace standards agree on a few points, such as putting a space around infix operators like x * y or a - b. Most (not all) whitespace standards do not put spaces around the [ or ] in a[i], and similar comments for ( and ) in f(x). However there is a great deal of contention over vertical whitespace, particularly when it comes to { and }. For example, here are a few of the many ways to lay out if (foo()) { bar(); baz(); }:

if (foo()) {
  bar();
  baz();
}

if (foo())
{
  bar();
  baz();
}

if (foo())
  {
    bar();
    baz();
  }

if (foo())
  {
  bar();
  baz();
  }

if (foo()) {
  bar();
  baz();
  }

…and others…

IMPORTANT: Do NOT email me with reasons your whitespace approach is better than the others. I don’t care. Plus I won’t believe you. There is no objective standard of “better” when it comes to whitespace so your opinion is just that: your opinion. If you write me an email in spite of this paragraph, I will consider you to be a hopeless geek who focuses on nits. Don’t waste your time worrying about whitespace: as long as your team uses a consistent whitespace style, get on with your life and worry about more important things.

For example, things you should be worried about include design issues like when ABCs should be used, whether inheritance should be an implementation or specification technique, what testing and inspection strategies should be used, whether interfaces should uniformly have a get() and/or set() member function for each data member, whether interfaces should be designed from the outside-in or the inside-out, whether errors be handled by try/catch/throw or by return codes, etc. Read the FAQ for some opinions on those important questions, but please don’t waste your time arguing over whitespace. As long as the team is using a consistent whitespace strategy, drop it.

Is it okay if a lot of numbers appear in my code?

Probably not.

In many (not all) cases, it’s best to name your numbers so each number appears only once in your code. That way, when the number changes there will only be one place in the code that has to change.

For example, suppose your program is working with shipping crates. The weight of an empty crate is 5.7. The expression 5.7 + contentsWeight probably means the weight of the crate including its contents, meaning the number 5.7 probably appears many times in the software. All these occurrences of the number 5.7 will be difficult to find and change when (not if) somebody changes the style of crates used in this application. The solution is to make sure the value 5.7 appears exactly once, usually as the initializer for a const identifier. Typically this will be something like const double crateWeight = 5.7;. After that, 5.7 + contentsWeight would be replaced by crateWeight + contentsWeight.

Now that’s the general rule of thumb. But unfortunately there is some fine print.

Some people believe one should never have numeric literals scattered in the code. They believe all numeric values should be named in a manner similar to that described above. That rule, however noble in intent, just doesn’t work very well in practice. It is too tedious for people to follow, and ultimately it costs companies more than it saves them. Remember: the goal of all programming rules is to reduce time, cost and risk. If a rule actually makes things worse, it is a bad rule, period.

A more practical rule is to focus on those values that are likely to change. For example, if a numeric literal is likely to change, it should appear only once in the software, usually as the initializer of a const identifier. This rule lets unchanging values, such as some occurrences of 0, 1, -1, etc., get coded directly in the software so programmers don’t have to search for the one true definition of one or zero. In other words, if a programmer wants to loop over the indices of a vector, he can simply write for (int i = 0; i < v.size(); ++i). The “extremist” rule described earlier would require the programmer to poke around asking if anybody else has defined a const identifier initialized to 0, and if not, to define his own const int zero = 0; then replace the loop with for (int i = zero; i < v.size(); ++i). This is all a waste of time since the loop will always start with 0. It adds cost without adding any value to compensate for that cost.

Obviously people might argue over exactly which values are “likely to change,” but that kind of judgment is why you get paid the big bucks: do your job and make a decision. Some people are so afraid of making a wrong decision that they’ll adopt a one-size-fits-all rule such as “give a name to every number.” But if you adopt rules like that, you’re guaranteed to have made the wrong decision: those rules cost your company more than they save. They are bad rules.

The choice is simple: use a flexible rule even though you might make a wrong decision, or use a one-size-fits-all rule and be guaranteed to make a wrong decision.

There is one more piece of fine print: where the const identifier should be defined. There are three typical cases:

If the const identifier is used only within a single function, it can be local to that function.
If the const identifier is used throughout a class and no where else, it can be static within the private part of that class.
If the const identifier is used in numerous classes, it can be static within the public part of the most appropriate class, or perhaps private in that class with a public static access method.

As a last resort, make it static within a namespace or perhaps put it in the unnamed namespace. Try very hard to avoid using #define since the preprocessor is evil. If you need to use #define anyway, wash your hands when you’re done. And please ask some friends if they know of a better alternative.

(As used throughout the FAQ, “evil” doesn’t mean “never use it.” There are times when you will use something that is “evil” since it will be, in those particular cases, the lesser of two evils.)

What’s the point of the `L`, `U` and `f` suffixes on numeric literals?

You should use these suffixes when you need to force the compiler to treat the numeric literal as if it were the specified type. For example, if x is of type float, the expression x + 5.7 is of type double: it first promotes the value of x to a double, then performs the arithmetic using double-precision instructions. If that is what you want, fine; but if you really wanted it to do the arithmetic using single-precision instructions, you can change that code to x + 5.7f. Note: it is even better to “name” your numeric literals, particularly those that are likely to change. That would require you to say x + crateWeight where crateWeight is a const float that is initialized to 5.7f.

The U suffix is similar. It’s probably a good idea to use unsigned integers for variables that are always >= 0. For example, if a variable represents an index into an array, that variable would typically be declared as an unsigned. The main reason for this is it requires less code, at least if you are careful to check your ranges. For example, to check if a variable is both >= 0 and < max requires two tests if everything is signed: if (n >= 0 && n < max), but can be done with a single comparison if everything is unsigned: if (n < max).

If you end up using unsigned variables, it is generally a good idea to force your numeric literals to also be unsigned. That makes it easier to see that the compiler will generate “unsigned arithmetic” instructions. For example: if (n < 256U) or if ((n & 255u) < 32u). Mixing signed and unsigned values in a single arithmetic expression is often confusing for programmers — the compiler doesn’t always do what you expect it should do.

The L suffix is not as common, but it is occasionally used for similar reasons as above: to make it obvious that the compiler is using long arithmetic.

The bottom line is this: it is a good discipline for programmers to force all numeric operands to be of the right type, as opposed to relying on the C++ rules for promoting/demoting numeric expressions. For example, if x is of type int and y is of type unsigned, it is a good idea to change x + y so the next programmer knows whether you intended to use unsigned arithmetic, e.g., unsigned(x) + y, or signed arithmetic: x + int(y). The other possibility is long arithmetic: long(x) + long(y). By using those casts, the code is more explicit and that’s good in this case, since a lot of programmers don’t know all the rules for implicit promotions.

I can understand the and (`&&`) and or (`||`) operators, but what’s the purpose of the not (`!`) operator?

Some people are confused about the ! operator. For example, they think that !true is the same as false, or that !(a < b) is the same as a >= b, so in both cases the ! operator doesn’t seem to add anything.

Answer: The ! operator is useful in boolean expressions, such occur in an if or while statement. For example, let’s assume A and B are boolean expressions, perhaps simple method-calls that return a bool. There are all sorts of ways to combine these two expressions:

if ( A &&  B) /*...*/ ;
if (!A &&  B) /*...*/ ;
if ( A && !B) /*...*/ ;
if (!A && !B) /*...*/ ;
if (!( A &&  B)) /*...*/ ;
if (!(!A &&  B)) /*...*/ ;
if (!( A && !B)) /*...*/ ;
if (!(!A && !B)) /*...*/ ;

Along with a similar group formed using the || operator.

Note: boolean algebra can be used to transform each of the &&-versions into an equivalent ||-version, so from a truth-table standpoint there are only 8 logically distinct if statements. However, since readability is so important in software, programmers should consider both the &&-version and the logically equivalent ||-version. For example, programmers should choose between !A && !B and !(A || B) based on which one is more obvious to whoever will be maintaining the code. In that sense there really are 16 different choices.

The point of all this is simple: the ! operator is quite useful in boolean expressions. Sometimes it is used for readability, and sometimes it is used because expressions like !(a < b) actually are not equivalent to a >= b in spite of what your grade school math teacher told you.

Is `!(a < b)` logically the same as `a >= b`?

No!

Despite what your grade school math teacher taught you, these equivalences don’t always work in software, especially with floating point expressions or user-defined types.

Example: if a is a floating point NaN, then both a < b and a >= b will be false. That means !(a < b) will be true and a >= b will be false.

Example: if a is an object of class Foo that has overloaded operator< and operator>=, then it is up to the creator of class Foo if these operators will have opposite semantics. They probably should have opposite semantics, but that’s up to whoever wrote class Foo.

What is this NaN thing?

NaN means “not a number,” and is used for floating point operations.

There are lots of floating point operations that don’t make sense, such as dividing by zero, taking the log of zero or a negative number, taking the square root of a negative number, etc. Depending on your compiler, some of these operations may produce special floating point values such as infinity (with distinct values for positive vs. negative infinity) and the not a number value, NaN.

If your compiler produces a NaN, it has the unusual property that it is not equal to any value, including itself. For example, if a is NaN, then a == a is false. In fact, if a is NaN, then a will be neither less than, equal to, nor greater than any value including itself. In other words, regardless of the value of b, a < b, a <= b, a > b, a >= b, and a == b will all return false.

Here’s how to check if a value is NaN:

#include <cmath>

void funct(double x)
{
  if (isnan(x)) {   // Though see caveat below
    // x is NaN
    // ...
  } else {
    // x is a normal value
    // ...
  }
}

Note: although isnan() is part of the latest C standard library, your C++ compiler vendor might not supply it. For example, Microsoft Visual C++.NET does not supply isnan() (though it does supply _isnan() defined in <float.h>). If your vendor does not supply any variant of isnan(), define this function:

inline bool my_isnan(double x)
{
  return x != x;
}

In any case, DO NOT WRITE ME just to say that your compiler does/does not support isnan().

Why is floating point so inaccurate? Why doesn’t this print 0.43?

#include <iostream>

int main()
{
  float a = 1000.43;
  float b = 1000.0;
  std::cout << a - b << '\n';
  // ...
}

(On one C++ implementation, this prints 0.429993)

Disclaimer: Frustration with rounding/truncation/approximation isn’t really a C++ issue; it’s a computer science issue. However, people keep asking about it on comp.lang.c++, so what follows is a nominal answer.

Answer: Floating point is an approximation. The IEEE standard for 32 bit float supports 1 bit of sign, 8 bits of exponent, and 23 bits of mantissa. Since a normalized binary-point mantissa always has the form 1.xxxxx… the leading 1 is dropped and you get effectively 24 bits of mantissa. The number 1000.43 (and many, many others, including some really common ones like 0.1) is not exactly representable in float or double format. 1000.43 is actually represented as the following bitpattern (the “s” shows the position of the sign bit, the “e“s show the positions of the exponent bits, and the “m“s show the positions of the mantissa bits):

    seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm
    01000100011110100001101110000101

The shifted mantissa is 1111101000.01101110000101 or 1000 + 7045/16384. The fractional part is 0.429992675781. With 24 bits of mantissa you only get about 1 part in 16M of precision for float. The double type provides more precision (53 bits of mantissa).

Why doesn’t my floating-point comparison work?

Because floating point arithmetic is different from real number arithmetic.

Bottom line: Never use == to compare two floating point numbers.

Here’s a simple example:

double x = 1.0 / 10.0;
double y = x * 10.0;
if (y != 1.0)
  std::cout << "surprise: " << y << " != 1\n";

The above “surprise” message will appear on some (but not all) compilers/machines. But even if your particular compiler/machine doesn’t cause the above “surprise” message (and if you write me telling me whether it does, you’ll show you’ve missed the whole point of this FAQ), floating point will surprise you at some point. So read this FAQ and you’ll know what to do.

The reason floating point will surprise you is that float and double values are normally represented using a finite precision binary format. In other words, floating point numbers are not real numbers. For example, in your machine’s floating point format it might be impossible to exactly represent the number 0.1. By way of analogy, it’s impossible to exactly represent the number one third in decimal format (unless you use an infinite number of digits).

To dig a little deeper, let’s examine what the decimal number 0.625 means. This number has a 6 in the “tenths” place, a 2 in the “hundreths” place, and a 5 in the “thousanths” place. In other words, we have a digit for each power of 10. But in binary, we might, depending on the details of your machine’s floating point format, have a bit for each power of 2. So the fractional part might have a “halves” place, a “quarters” place, an “eighths” place, “sixteenths” place, etc., and each of these places has a bit.

Let’s pretend your machine represents the fractional part of floating point numbers using the above scheme (it’s normally more complicated than that, but if you already know exactly how floating point numbers are stored, chances are you don’t need this FAQ to begin with, so look at this as a good starting point). On that pretend machine, the bits of the fractional part of 0.625 would be 101: 1 in the ½-place, 0 in the ¼-place, and 1 in the ⅛-place. In other words, 0.625 is ½ + ⅛.

But on this pretend machine, 0.1 cannot be represented exactly since it cannot be formed as a sum of a finite number of powers of 2. You can get close but you can’t represent it exactly. In particular you’d have a 0 in the ½-place, a 0 in the ¼-place, a 0 in the ⅛-place, and finally a 1 in the “sixteenths” place, leaving a remainder of 1/10 - 1/16 = 3/80. Figuring out the other bits is left as an exercise (hint: look for a repeating bit-pattern, analogous to trying to represent 1/3 or 1/7 in decimal format).

The message is that some floating point numbers cannot always be represented exactly, so comparisons don’t always do what you’d like them to do. In other words, if the computer actually multiplies 10.0 by 1.0/10.0, it might not exactly get 1.0 back.

That’s the problem. Now here’s the solution: be very careful when comparing floating point numbers for equality (or when doing other things with floating point numbers; e.g., finding the average of two floating point numbers seems simple but to do it right requires an if/else with at least three cases).

Here’s the wrong way to do it:

void dubious(double x, double y)
{
  // ...
  if (x == y)  // Dubious!
    foo();
  // ...
}

If what you really want is to make sure they’re “very close” to each other (e.g., if variable a contains the value 1.0 / 10.0 and you want to see if (10*a == 1)), you’ll probably want to do something fancier than the above:

void smarter(double x, double y)
{
  // ...
  if (isEqual(x, y))  // Smarter!
    foo();
  // ...
}

There are many ways to define the isEqual() function, including:

#include <cmath>  /* for std::abs(double) */

inline bool isEqual(double x, double y)
{
  const double epsilon = /* some small number such as 1e-5 */;
  return std::abs(x - y) <= epsilon * std::abs(x);
  // see Knuth section 4.2.2 pages 217-218
}

Note: the above solution is not completely symmetric, meaning it is possible for isEqual(x,y) != isEqual(y,x). From a practical standpoint, does not usually occur when the magnitudes of x and y are significantly larger than epsilon, but your mileage may vary.

For other useful functions, check out the following (listed alphabetically):

Isaacson, E. and Keller, H., Analysis of Numerical Methods, Dover.
Kahan, W., http.cs.berkeley.edu/~wkahan/.
Knuth, Donald E., The Art of Computer Programming, Volume II: Seminumerical Algorithms, Addison-Wesley, 1969.
LAPACK: Linear Algebra Subroutine Library, www.siam.org
NETLIB: the collected algorithms from ACM Transactions on Mathematical Software, which have all been refereed, plus a great many other algorithms that have withstood somewhat less formal scrutiny from peers, www.netlib.org
Numerical Recipes, by Press et al. Although note some negative reviews, such as amath.colorado.edu/computing/Fortran/numrec.html
Ralston and Rabinowitz, A First Course in Numerical Analysis: Second Edition, Dover.
Stoer, J. and Bulirsch, R., Introduction to Numerical Analysis, Springer Verlag, in German.

Double-check your assumptions, including “obvious” things like how to compute averages, how to solve quadratic equations, etc., etc. Do not assume the formulas you learned in High School will work with floating point numbers!

For insights on the underlying ideas and issues of floating point computation, start with David Goldberg’s paper, What Every Computer-Scientist Should Know About Floating Point Arithmetic or here in PDF format. You might also want to read this supplement by Doug Priest. The combined paper + supplement is also available. You might also want to go here for links to other floating-point topics.

Why is `cos(x) != cos(y)` even though `x == y`? (Or sine or tangent or log or just about any other floating point computation)

I know it’s hard to accept, but floating point arithmetic simply does not work like most people expect. Worse, some of the differences are dependent on the details of your particular computer’s floating point hardware and/or the optimization settings you use on your particular compiler. You might not like that, but it’s the way it is. The only way to “get it” is to set aside your assumptions about how things ought to behave and accept things as they actually do behave.

Let’s work a simple example. Turns out that on some installations, cos(x) != cos(y) even though x == y. That’s not a typo; read it again if you’re not shocked: the cosine of something can be unequal to the cosine of the same thing. (Or the sine, or the tangent, or the log, or just about any other floating point computation.)

#include <iostream>
#include <cmath>

void foo(double x, double y)
{
  if (std::cos(x) != std::cos(y)) {
    std::cout << "Huh?!?\n";  // You might end up here when x == y!!
  }
}

int main()
{
  foo(1.0, 1.0);
  return 0;
}

On many (not all) computers, you will end up in the if block even when x == y. If that doesn’t shock you, you’re asleep; read it again. If you want, try it on your particular computer. Some of you will end up in the if block, some will not, and for some it will depend on the details of your particular compiler or options or hardware or the phase of the moon.

Why, you ask, can that happen? Good question; thanks for asking. Here’s the answer (with emphasis on the word “often”; the behavior depends on your hardware, compiler, etc.): floating point calculations and comparisons are often performed by special hardware that often contain special registers, and those registers often have more bits than a double. That means that intermediate floating point computations often have more bits than sizeof(double), and when a floating point value is written to RAM, it often gets truncated, often losing some bits of precision.

Said another way, intermediate calculations are often more precise (have more bits) than when those same values get stored into RAM. Think of it this way: storing a floating point result into RAM requires some bits to get discarded, so comparing a (truncated) value in RAM with an (untruncated) value within a floating-point register might not do what you expect. Suppose your code computes cos(x), then truncates that result and stores it into a temporary variable, say tmp. It might then compute cos(y), and (drum roll please) compare the untruncated result of cos(y) with tmp, that is, with the truncated result of cos(x). Expressed in an imaginary assembly language, the expression cos(x) != cos(y) might get compiled into this:

// Imaginary assembly language
fp_load x     // load a floating-point register with the value of parameter x
call _cos     // call cos(double), using the floating point register for param and result
fp_store tmp  // truncate the floating-point result and store into temporary local var, tmp

fp_load y     // load a floating-point register with the value of parameter y
call _cos     // call cos(double), using the floating point register for param ans result
fp_cmp tmp    // compare the untruncated result (in the register) with the truncated value in tmp
// ...

Did you catch that? Your particular installation might store the result of one of the cos() calls out into RAM, truncating it in the process, then later compare that truncated value with the untruncated result of the second cos() call. Depending on lots of details, those two values might not be equal.

It gets worse; better sit down. Turns out that the behavior can depend on how many instructions are between the cos() calls and the != comparison. In other words, if you put cos(x) and cos(y) into locals, then later compare those variables, the result of the comparison can depend on exactly what, if anything, your code does after storing the results into locals and comparing the variables. Gulp.

void foo(double x, double y)
{
  double cos_x = cos(x);
  double cos_y = cos(y);
  // the behavior might depend on what's in here
  if (cos_x != cos_y) {
    std::cout << "Huh?!?\n";  // You might end up here when x == y!!
  }
}

Your mouth should be hanging open by now. If not, you either learned pretty quickly from the above or you are still asleep. Read it again. When x == y, you can still end up in the if block depending on, among other things, how much code is in the ... line. Wow.

Reason: if the compiler can prove that you’re not messing with any floating point registers in the ... line, it might not actually store cos(y) into cos_y, instead leaving it in the register and comparing the untruncated register with the truncated variable cos_x. In this case, you might end up in the if block. But if you call a function between the two lines, such as printing one or both variables, or if you do something else that messes with the floating point registers, the compiler will (might) need to store the result of cos(y) into variable cos_y, after which it will be comparing two truncated values. In that case you won’t end up in the if block.

If you didn’t hear anything else in this whole discussion, just remember this: floating point comparisons are tricky and subtle and fraught with danger. Be careful. The way floating point actually works is different from the way most programmers tend to think it ought to work. If you intend to use floating point, you need to learn how it actually works.

What is the type of an enumeration such as `enum Color`? Is it of type `int`?

An enumeration such as enum Color { red, white, blue }; is its own type. It is not of type int.

When you create an object of an enumeration type, e.g., Color x;, we say that the object x is of type Color. Object x isn’t of type “enumeration,” and it’s not of type int.

An expression of an enumeration type can be converted to a temporary int. An analogy may help here. An expression of type float can be converted to a temporary double, but that doesn’t mean float is a subtype of double. For example, after the declaration float y;, we say that y is of type float, and the expression y can be converted to a temporary double. When that happens, a brand new, temporary double is created by copying something out of y. In the same way, a Color object such as x can be converted to a temporary int, in which case a brand new, temporary int is created by copying something out of x. (Note: the only purpose of the float / double analogy in this paragraph is to help explain how expressions of an enumeration type can be converted to temporary ints; do not try to use that analogy to imply any other behavior!)

The above conversion is very different from a subtype relationship, such as the relationship between derived class Car and its base class Vehicle. For example, an object of class Car, such as Car z;, actually is an object of class Vehicle, therefore you can bind a Vehicle& to that object, e.g., Vehicle& v = z;. Unlike the previous paragraph, the object z is not copied to a temporary; reference v binds to z itself. So we say an object of class Car is a Vehicle, but an object of class “Color” simply can be copied/converted into a temporary int. Big difference.

Final note, especially for C programmers: the C++ compiler will not automatically convert an int expression to a temporary Color. Since that sort of conversion is unsafe, it requires a cast, e.g., Color x = Color(2);. But be sure your integer is a valid enumeration value. If you go provide an illegal value, you might end up with something other than what you expect. The compiler doesn’t do the check for you; you must do it yourself.

If an enumeration type is distinct from any other type, what good is it? What can you do with it?

Let’s consider this enumeration type: enum Color { red, white, blue };.

The best way to look at this (C programmers: hang on to your seats!!) is that the values of this type are red, white, and blue, as opposed to merely thinking of those names as constant int values. The C++ compiler provides an automatic conversion from Color to int, and the converted values will be, in this case, 0, 1, and 2 respectively. But you shouldn’t think of blue as a fancy name for 2. blue is of type Color and there is an automatic conversion from blue to 2, but the inverse conversion, from int to Color, is not provided automatically by the C++ compiler.

Here is an example that illustrates the conversion from Color to int:

enum Color { red, white, blue };

void f()
{
  int n;
  n = red;    // change n to 0
  n = white;  // change n to 1
  n = blue;   // change n to 2
}

The following example also demonstrates the conversion from Color to int:

void f()
{
  Color x = red;
  Color y = white;
  Color z = blue;

  int n;
  n = x;   // change n to 0
  n = y;   // change n to 1
  n = z;   // change n to 2
}

However the inverse conversion, from int to Color, is not automatically provided by the C++ compiler:

void f()
{
  Color x;
  x = blue;  // change x to blue
  x = 2;     // compile-time error: can't convert int to Color
}

The last line above shows that enumeration types are not ints in disguise. You can think of them as int types if you want to, but if you do, you must remember that the C++ compiler will not implicitly convert an int to a Color. If you really want that, you can use a cast:

void f()
{
  Color x;
  x = red;      // change x to red
  x = Color(1); // change x to white
  x = Color(2); // change x to blue
  x = 2;        // compile-time error: can't convert int to Color
}

There are other ways that enumeration types are unlike int. For example, enumeration types don’t have a ++ operator:

void f()
{
  int n = red;    // change n to 0
  Color x = red;  // change x to red

  n++;   // change n to 1
  x++;   // compile-time error: can't ++ an enumeration (though see caveat below)
}

Caveat on the last line: it is legal to provide an overloaded operator that would make that line legal, such as definining operator++(Color& x).

What other “newbie” guides are there for me?

An excellent place to start is this site’s Get Started page.