Document number:   D2488R0
Date:   2021-11-11
Audience:   EWG
Reply-to:  
Andrzej Krzemieński <akrzemi1 at gmail dot com>

Attribute [[for_overwrite]]

This paper proposes to add a standard attribute [[for_overwrite]], which may appertain to a variable or a non-static data member declaration, indicating that the value set upon initialization (if any) is intended never to be read: instead, another modification of the value is expected before the value is read for the first time.

The goal is to allow the programmers to clearly express their intent in a way that is consumable by both humans and machines. This is to help static analyzers detect programmer bugs.

Tony Table
Suppose that funcions poorly_fill_i() and poorly_fill_s() are documented to assign a value to an object of type int and std::string respectively, but fail to do it due to a bug.
BeforeAfter
int i;
poorly_fill_i(i); // BUG
read(i);
// :-)  tools can diagnose the bug
// :-(  UB when run
[[for_overwrite]] int i;
poorly_fill_i(i); // BUG
read(i);
// :-)  tools can diagnose the bug
// :-(  UB when run
int i = -1;
poorly_fill_i(i); // BUG
read(i);
// :-(  tools cannot diagnose the bug
// :-)  no UB when run
[[for_overwrite]] int i = -1;
poorly_fill_i(i); // BUG
read(i);
// :-)  tools can diagnose the bug
// :-)  no UB when run
std::string s;
poorly_fill_s(s); // BUG
read(s);
// :-(  tools cannot diagnose the bug
[[for_overwrite]] std::string s;
poorly_fill_s(s); // BUG
read(s);
// :-)  tools can diagnose the bug

Disclaimer: this paper is not motivated by any performance considerations. Its goal is not to increase (or preserve) performance. The goal is to maximize the chances of detecting bugs in program code, and to allow the programmers to communicate their intent. If the reader gets the impression that this paper suggests or helps striving for performance, this is either because they misunderstood the intent, or because the author failed to communicate the intent clearly.

1. Motivation {mot}

In C++, much like in C, you can declare a scalar variable without providing its initial value:

int main()
{
  int i;         // no initial value
  cin >> i;      // value assigned
  cout << i + 1; // value read
}

Because of this, it is possible to create a program where a scalar variable is read from before it is assigned value. This results in undefined behavior. No-one would write a program like this on purpose, but it may occur as a result of a bug somewhere in the code.

int main()
{
  int i, j;
  
  cin >> j;
  cout << i + 1; // typo: intended: `j + 1`
    
  cin >> i;  
  cout << i + 1;
}

This is becaue the separation of the object creation from the initial value assignment is bug prone. For this reason programmers are advised to move the creation of the variable as close as possible to the place whenre we have a meaningful value to assign to the object: ideally this value should be used in the object initialization. In the above example the initialization of the variable i could be moved further down, which would immediately uncover the bug:

int main()
{
  int j;
  cin >> j;
  cout << i + 1; // compiler error: no `i` in scope
    
  int i;  
  cin >> i;  
  cout << i + 1;
}

To be more specific: the above refactoring does not remove the bug, but it turns it into a compiler-detectable one. The programmer now has to fix the typo.

There are more rues, like this, that help turn programmer bugs into diagnosable compiler errors. For instance, declare every scalar and POD types const, even if it means using "immediately invoked lambda expressions" to cover more complex initialization:

int main()
{
  const int j = [] {
    int j;
    cin >> j;
    return j;
  }();
  cout << j + 1;
    
  const int i = [] {
    int i;
    cin >> i;
    return i;
  }();  
  cout << i + 1;
}

The above rule has the potential to turn inadvertent writes (e.g., caused by typos) into compiler errors.

The above two pieces of advise are unquestionably good. They can only help, and never can they cause any harm. There is no good reason not to follow them, if one can afford to do it. However, there are situations in the code where these good pieces of advise cannot be applied, and as a consequence we cannot turn all programmer bugs into compile-time errors. One example of such situation is the interface of IOStreams: you have to have a living object before you can read the value from the user input into it.

In response to these difficult cases, people give a not-so-good piece of advice:

If there is no way to assign the desired value upon initalization of a scalar object, assign any value in the initialization, even if you know that you are going to overwrite the value before the first read.

This renders code like the following:

int main()
{
  int i = 0;     // initial value
  cin >> i;      // initial value overwritten
  cout << i + 1; // value read
}

This "addition" no longer has any potential to turn programmer bugs into compiler errors. Its goal is different; it is:

Should the variable's proper value assignment be skipped, due to a programmer bug, we do not get undefined behavior upon the first read. Instead, we continue with an unintended value, which is not good either, but at least does not immediately cause undefined behavior (even if it can trigger undefined behavior later in the program).

In order words, it is an attempt to narrow down the scope of unpredictable behavior of the program containing a bug.

However, following this advice also comes with a cost: it may prevent the detection of the programmer bug by means of symblic evaluation of the program (one form of static analysis). Static analyzers cannot understand programmer's intentions, so they cannot diagnose programmer bugs directly; but they can diagnose situations of which we could say with high level of certainty that they coud never have been the programmer's intention. One such situation is:

It is never the programmer's intention to read the value of a scalar object before any value has been assigned to the object.

This rule doesn't say which part of the situation is the source of the bug:

  1. The read of the value in an inappropriate place.
  2. Failure to perform a write in an approprate place.

This rule can only indirectly observe the symptom of a bug. We can be sure that the bug is somewhere close to the illegal read, but we do not know where. The static analyzer can now report the symptom, and the programmer has a chance to look for the bug.

In order for this type of static analysis to be successful, one component is required from the programmer: the potential to trigger a suspicious event upon a programmer bug. In this case the suspicious situation is undefined behavior. Initializing a scalar variable with no value is not undefined behavior. It is the read from variable without the preciding write that is undefined behavior. An uninitialized variable creates a potential to cause undefined behavior, but this will happen only when there is a programmer bug interacting with it. To visualize this:

potential_for_UB + bug + run = UB. (1)

On the other hand we have a similar equation:

potential_for_UB + bug + static_analyzer = static_bug_report. (2)

The advice "assign value (any value) upon initializing a scalar" removes term 'potential_for_UB' from both equations. In consequence, there is no UB if we run the buggy code, but at the same time we may have prevented the static analysis from finding the bug. In contrast, if we do not remove 'potential_for_UB' and if we are lucky enough that the static analyzer finds the bug for us, we can eliminate term 'bug' from both equations. This also prevents undefined behavior when we run the program, and additionally we have eliminated the bug.

Of course, removing 'potential_for_UB' is cheaper than removing 'bug', and also running the static analyzer does not guarantee that all bugs leading to UB will be found. On the other hand, removing the `UB` from the running program while leaving 'bug' in place is not a guarantee that we have prevented all security vulnerabilities caused by 'bug'.

So there is a technical trade-off to be made here, without an ideal solution, before we can advise with clear conscience whether or not to initialize all scalar variables with to-be-overwritten values. We could call it a trade off between the strive for correctness and the strive for bug-tolerance. This is why we call it a not-so-good advice: you gain something (bug-tolerance), but we also loose something: the chance to detect a potential bug.

This proposal offers a solution where we can get both benefits — static bug detection and no undefined befavior — without making a trade-off.

2. The proposal{pro}

We propose to add a standard attribute [[for_overwrite]], which may appertain to a variable or a non-static data member declaration.

int main()
{
  [[for_overwrite]] int i = 0;  // with initial value
  [[for_overwrite]] int j;      // without initial value
  
  cin >> i >> j;
  cout << i << ' ' << j;
}

Its purpose is to communicate the programmer's intention, which is:

I still intend to perform a write to this variable, after the initialization, before it is read from for the first time, on any execution path. If there is an execution path that leads to the read from the variable before a write — other than the write in the initialization — has been performed, this unambiguously indicates a bug in the program.

You can use the attribute and still provide the initial value: this is the case for variable i in the example. This is the case where you get the best of both worlds: no undefined behavior in the presence of bugs and the potential to generate a suspicious enough event that can be reported by the static analyzer in the presence of a bug.

The second case, for variable j, doesn't use the initial value. It still has the potential to cause undefined behavior in the presence of a bug. One can use it, if one wishes to skip the redundant initialization for other purposes.

With this attribute being present, one can introduce a new programming guideline:

If you make a conscious decision to initialize a scalar variable without providing the initial value, because you have considered all other alternatives and concluded that this is the best solution, use attribute [[for_overwrite]] to indicate this decision.

A style checker that observes at least one [[for_overwrite]] in a translation unit can now safely assume that the programmer uses [[for_overwrite]] effectively, and warn about all places where a scalar variable is initialized without the initial value and no [[for_overwrite]] has been used.

Another programmng guideline enabled by this proposal

If you have to initialize a variable, but you do not have the initial value yet, and you decided to assign a dummy value because you are striving for bug tolerance: use attribute [[for_overwrite]] to indicate that this is a dummy value.

This will still enable the static analyzer to detect the read-before-write bugs during symbolic analysis. This will also be a hint to the compiler that it is worth investing more resources into dead write elimination pass. (However, as stated earlier, optimizations are not the primary goal for this feature, it is just a side effect of communicating the programmer intent.)

The attribute can be also appled to non-scalar types with a user provided (default, or other) constructor:

string f(bool cond1, bool cond2)
{
  [[for_overwrite]] std::string s;  // I do not need this value
  
  if (cond1)
    lib::fill_1(s);
  
  if (!cond1 && cond2)
    lib::fill_2(s);
    
  return s; // static analyzer has reason to emit a warning
}

A static analyzer can be confused by it and bail out, but if it doesn't it can alert that there is a path that does not attempt to modify the original value of s. On the other hand, if the value assigned in function lib::fill_1() — in the non-buggy path — is an empty string again, it is considered a second read and does not cause a static analyzer message.

These semantics make the attribute usable in the template context, where you do not know if your type is scalar or not:

template <typename T>
string f(bool cond1, bool cond2)
{
  [[for_overwrite]] T v;
  
  if (cond1)
    lib::fill_1(v);
  
  if (!cond1 && cond2)
    lib::fill_2(v);
    
  return v;
}

3. Alternatives{alt}

Another solution to the same problem of reducing the tension between the strife for enabling static bug detection and the strife for bug-tolerance has been iplemented in Clang. A programmer can set the compiler flag -ftrivial-auto-var-init which assigns a predictable value to every stack-based scalar variable, even if there is no initializer present in the source code. As a result, the program avoids unpredictable behavior, and because there are no initializers visible in the code, static analyzers can still warn based on read-before-write. This feature is described in the corrsponding GIT commit: https://reviews.llvm.org/rG14daa20be1ad89639ec209d969232d19cf698845.

The differences from this proposal are:

  1. For the case of missing initializer on scalar types, it doesn't require anything in the source code: it will just work even if programmers forget to initialize their variabes.
  2. It is not "portable": once you move to a different compier, this feature is gone.
  3. It does not address the problem of tracking the writes to types with user-provided (default) constructors.
  4. Sometimes there is the need to assign the initial to-be-overwritten value specific to a given context. The Clang soluton cannot do this.

4. The choice of name{nam}

The name "for_overwrite" is taken after the Standard Library function std::make_shared_for_overwrite(), which reflects a similar purpose: we may start with the initial value or not, but in either case the programmer's intention is to overwrite the value before it is read. The rationale for that name is provided in [P2042R0] and [P1973R1] .

We do not choose any name containing "uninitialized", because in the technical terms an int without the initial value is still called initialized: it is default initialization. And the name makes no sense when the attribute is used for std::string. An alternate suitable name for this attribute could be [[requires_write]].

5. Acknowledgments{ack}

Ryan McDougall, JF Bastien, Peter Sommerlad, Jens Maurer, Richard Smith, Barry Revzin, Loïc Joly, Gašper Ažman, Balog Pal, Jens Gusted and Herb Sutter reviewed and helped improve the document.

6. References {ref}