D2180R0
pp-number makes cpp dumber: fixing 0xFE+1 == 0xFF

Draft Proposal,

This version:
wg21.link/P2180
Author:
(MongoDB)
Audience:
CWG
Project:
ISO/IEC JTC1/SC22/WG21 14882: Programming Language — C++
Latest Version:
Click here
Source:
No real reason to click here

Abstract

pp-number is broken, useless, and should be removed.

1. The Problem

Should this code compile?

static_assert(0xFE+1 == 0xFF);

Hopefully we can agree that it should, even though it doesn’t today. Except on MSVC. Or clang with -fms-extensions.

2. The Explanation

The issue is that during phase 3 (preprocessing tokenization), integer and floating-point literals, and their user-defined variants, are lumped into a single production called pp-number. It uses its own custom grammar (essentially, the regex .?[0-9] ([:cpp-identifier:] | [eEpP][-+] | '[a-zA-Z0-9_])*) which is intended to cover the union of the character sequences that later become those numeric literals. Unfortunately it matches more that that, including some sequences such as 0xFE+1 that most readers would expect to turn into 3 separate tokens (0xFE + 1). This is an attempt to match floating-point literals such as 1E+1, even though floating point literals containing E are not allowed to start with 0x. It also matches complete gibberish such as 0xfffe+0x0001e-..afedfjawkufdskfehk.....fnajdsfnuaewkfnjdsaf.

3. The Solution

Luckily pp-number is basically not used, except to convert into one of the "real" numeric literals, so we can easily remove it, and just lex directly to the numeric literals. I suggest making this change as a DR rather than simply applying to new versions of C++.

4. The Breakage

The only hypothetical code that I think this breaks is code that intentionally forms invalid binary or octal literals containing decimal digits that are too high, then concatenates the pp-number with an identifier prefix to form a valid identifier. For example:

#define X(arg) x ## arg
int x019 = X(019);   // Phase 3 output: X ( 01 9 )
int x0b12 = X(0b12); // Phase 3 output: X ( 0b1 2 )

This is similar to the case described by [diff.cpp14.lex]/2 that was deemed acceptable in C++17.

4.1. The Workaround

My preference would be to accept the breakage, but if that prevents consensus, or would prevent treating this as a DR, there is a workaround. We can introduce a new production to catch cases like that, that could not be directly turned into a numeric literal.

    invalid-bin-digit: 2 3 4 5 6 7 8 9
    invalid-oct-digit: 8 9
    not-quite-valid-integer-literal:
        binary-literal 'opt invalid-bin-digit 'opt digit-sequenceopt
        octal-literal 'opt invalid-oct-digit 'opt digit-sequenceopt

Note: This does not need to cover cases like 0bad since that lexes as 0 with the user-defined-literal suffix bad.

5. Wording

This eliminates the concept of pp-number and just directly tokenizes to the numeric literals.

5.1. Modify [lex.pptoken]:

    preprocessing-token:
	header-name
	import-keyword
	module-keyword
	export-keyword
	identifier
	pp-number
        integer-literal
        user-defined-integer-literal
        floating-point-literal
        user-defined-floating-point-literal
	character-literal
	user-defined-character-literal
	string-literal
	user-defined-string-literal
	preprocessing-op-or-punc
	each non-white-space character that cannot be one of the above

Note: an alternative would be to use literal rather than all of the foo-literal productions. I think the main effect of that change would be that true, false, and nullptr would no longer be considered identifiers during preprocessing. I am not proposing doing this now.

Modify paragraph 2:

A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, placeholder tokens produced by preprocessing import and module directives (import-keyword, module-keyword, and export-keyword), identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals) literals (including user-defined literals) , preprocessing operators and punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories. If a ' or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments, or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in [cpp], in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.

Modify paragraph 5:

[ Example: The program fragment 0xe+foo is parsed as a preprocessing number token (one that is not a valid integer-literal or floating-point-literal token), even though a parse as three preprocessing tokens 0xe, +, and foo might produce a valid expression (for example, if foo were a macro defined as 1). Similarly, the The program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating-point-literal token ) , whether or not E is a macro name. — end example ]

5.2. Remove [lex.ppnumber]

    pp-number:
	digit
	. digit
	pp-number digit
	pp-number identifier-nondigit
	pp-number ' digit
	pp-number ' nondigit
	pp-number e sign
	pp-number E sign
	pp-number p sign
	pp-number P sign
	pp-number .
  1. Preprocessing number tokens lexically include all integer-literal tokens ([lex.icon]) and all floating-point-literal tokens ([lex.fcon]).
  2. A preprocessing number does not have a type or a value; it acquires both after a successful conversion to an integer-literal token or a floating-point-literal token.

5.3. Modify [cpp.cond]/5:

Each has-attribute-expression is replaced by a non-zero pp-number matching the form of an integer-literal if the implementation supports an attribute with the name specified by interpreting the pp-tokens, after macro expansion, as an attribute-token, and by 0 otherwise. The program is ill-formed if the pp-tokens do not match the form of an attribute-token.

5.4. Modify [cpp.cond]/11:

After all replacements due to macro expansion and evaluations of defined-macro-expressions, has-include-expressions, and has-attribute-expressions have been performed, all remaining identifiers and keywords, except for true and false, are replaced with the pp-number integer-literal 0, and then each preprocessing token is converted into a token. [ Note: An alternative token is not an identifier, even when its spelling consists entirely of letters and underscores. Therefore it is not subject to this replacement. — end note ]

5.5. Modify [diff.cpp14.lex]/2

Affected subclause: [lex.ppnumber]
Change: pp-number can contain p sign and P sign.
Rationale: Necessary to enable hexadecimal-floating-point-literals.
Effect on original feature: Valid C++ 2014 code may fail to compile or produce different results in this International Standard. Specifically, character sequences like 0x0p+0 and 0e1_p+0 are three separate tokens each in C++ 2014, but one single token in this International Standard. For example:
#define F(a) b ## a
int b0x0p = F(0x0p+0);  // ill-formed; equivalent to “int b0x0p = b0x0p + 0;” in C++ 2014

5.6. Modify [diff.cpp11.lex]

Affected subclause: [lex.ppnumber]
Change: pp-number Numeric literals can contain one or more single quotes.
Rationale: Necessary to enable single quotes as digit separators.
Effect on original feature: Valid C++ 2011 code may fail to compile or may change meaning in this International Standard. For example, the following code is valid both in C++ 2011 and in this International Standard, but the macro invocation produces different outcomes because the single quotes delimit a character-literal in C++ 2011, whereas they are digit separators in this International Standard:
    #define M(x, ...) __VA_ARGS__
    int x[2] = { M(12,34, 5) };
    // int x[2] = { 5 };      --- C++ 2011
    // int x[2] = { 3’4, 5 }; --- this International Standard

6. Alternative Wording (minimal change)

This just fixes pp-number to precisely cover the literal forms it is intended to.

6.1. Modify [lex.ppnumber]:

    pp-number:  
	digit
	. digit
	pp-number digit
	pp-number identifier-nondigit
	pp-number ' digit
	pp-number ' nondigit
	pp-number e sign
	pp-number E sign
	pp-number p sign
	pp-number P sign
	pp-number .
        integer-literal
        floating-point-literal
        user-defined-integer-literal
        user-defined-floating-point-literal
  1. Preprocessing number tokens lexically include all integer-literal tokens ([lex.icon]) and all floating-point-literal tokens ([lex.fcon]).
  2. A preprocessing number does not have a type or a value; it acquires both after a successful conversion to an integer-literal token or a floating-point- literal token.

6.2. Modify [diff.cpp14.lex]/2

Same as above.

7. Acknowledgments

Thanks to Michał Dominiak for reviewing and inspiring this paper. Additional thanks to everyone else who has provided feedback.