1. The Problem
Should this code compile?
static_assert ( 0xFE + 1 == 0xFF );
Hopefully we can agree that it should, even though it doesn’t today. Except on MSVC. Or clang with -fms-extensions.
2. The Explanation
The issue is that during phase 3 (preprocessing tokenization), integer and floating-point literals,
and their user-defined variants, are lumped into a single production called pp-number. It uses its
own custom grammar
(essentially, the regex
)
which is intended to cover the union of the character sequences that later become those numeric
literals. Unfortunately it matches more that that, including some sequences such as
that
most readers would expect to turn into 3 separate tokens (
). This is an attempt to
match floating-point literals such as
, even though floating point literals containing
are
not allowed to start with
. It also matches complete gibberish such as
.
3. The Solution
Luckily pp-number is basically not used, except to convert into one of the "real" numeric literals, so we can easily remove it, and just lex directly to the numeric literals. I suggest making this change as a DR rather than simply applying to new versions of C++.
4. The Breakage
The only hypothetical code that I think this breaks is code that intentionally forms invalid binary or octal literals containing decimal digits that are too high, then concatenates the pp-number with an identifier prefix to form a valid identifier. For example:
#define X(arg) x ## arg int x019 = X ( 01 9 ); // Phase 3 output: X ( 01 9 ) int x0b12 = X ( 0 b12 ); // Phase 3 output: X ( 0b1 2 )
This is similar to the case described by [diff.cpp14.lex]/2 that was deemed acceptable in C++17.
4.1. The Workaround
My preference would be to accept the breakage, but if that prevents consensus, or would prevent treating this as a DR, there is a workaround. We can introduce a new production to catch cases like that, that could not be directly turned into a numeric literal.
invalid - bin - digit : 2 3 4 5 6 7 8 9 invalid - oct - digit : 8 9 not - quite - valid - integer - literal : binary - literal 'opt invalid - bin - digit 'opt digit - sequence opt octal - literal 'opt invalid - oct - digit 'opt digit - sequence opt
Note: This does not need to cover cases like
since that lexes as
with the
user-defined-literal suffix
.
5. Wording
This eliminates the concept of pp-number and just directly tokenizes to the numeric literals.
5.1. Modify [lex.pptoken]:
preprocessing - token : header - name import - keyword module - keyword export - keyword identifier pp - number integer - literal user - defined - integer - literal floating - point - literal user - defined - floating - point - literal character - literal user - defined - character - literal string - literal user - defined - string - literal preprocessing - op - or - punc each non - white - space character that cannot be one of the above
Note: an alternative would be to use literal rather than all of the foo-literal productions. I
think the main effect of that change would be that true
, false
, and
would no longer be
considered identifiers during preprocessing. I am not proposing doing this now.
Modify paragraph 2:
A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, placeholder tokens produced by preprocessingand
import directives (import-keyword, module-keyword, and export-keyword), identifiers,
module preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals)literals (including user-defined literals) , preprocessing operators and punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories. If a'
or acharacter matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments, or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in [cpp], in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.
"
Modify paragraph 5:
[ Example:The program fragmentThe program fragmentis parsed as a preprocessing number token (one that is not a valid integer-literal or floating-point-literal token), even though a parse as three preprocessing tokens
0xe + foo ,
0xe , and
+ might produce a valid expression (for example, if
foo were a macro defined as
foo ). Similarly, the
1 is parsed as a
1E1 preprocessing number (one that is a validfloating-point-literal token), whether or notis a macro name. — end example ]
E
5.2. Remove [lex.ppnumber]
pp - number : digit . digit pp - number digit pp - number identifier - nondigit pp - number 'digit pp - number 'nondigit pp - number e sign pp - number E sign pp - number p sign pp - number P sign pp - number .
Preprocessing number tokens lexically include all integer-literal tokens ([lex.icon]) and all floating-point-literal tokens ([lex.fcon]).A preprocessing number does not have a type or a value; it acquires both after a successful conversion to an integer-literal token or a floating-point-literal token.
5.3. Modify [cpp.cond]/5:
Each has-attribute-expression is replaced by a non-zeropp-number matching the form of aninteger-literal if the implementation supports an attribute with the name specified by interpreting the pp-tokens, after macro expansion, as an attribute-token, and byotherwise. The program is ill-formed if the pp-tokens do not match the form of an attribute-token.
0
5.4. Modify [cpp.cond]/11:
After all replacements due to macro expansion and evaluations of defined-macro-expressions, has-include-expressions, and has-attribute-expressions have been performed, all remaining identifiers and keywords, except fortrue
andfalse
, are replaced with thepp-numberinteger-literal, and then each preprocessing token is converted into a token. [ Note: An alternative token is not an identifier, even when its spelling consists entirely of letters and underscores. Therefore it is not subject to this replacement. — end note ]
0
5.5. Modify [diff.cpp14.lex]/2
Affected subclause: [lex.ppnumber]
Change: pp-number can containsign and
p sign.
P
Rationale: Necessary to enable hexadecimal-floating-point-literals.
Effect on original feature: Valid C++ 2014 code may fail to compile or produce different results in this International Standard. Specifically, character sequences like
0x 0 p + 0 andare three separate tokens each in C++ 2014, but one single token in this International Standard. For example:
0e1 _p + 0 #define F(a) b ## a int b 0x 0p = F ( 0x 0 p + 0 ); // ill-formed; equivalent to “int b 0x 0p = b 0x 0p + 0;” in C++ 2014
5.6. Modify [diff.cpp11.lex]
Affected subclause: [lex.ppnumber]
Change:pp-numberNumeric literals can contain one or more single quotes.
Rationale: Necessary to enable single quotes as digit separators.
Effect on original feature: Valid C++ 2011 code may fail to compile or may change meaning in this International Standard. For example, the following code is valid both in C++ 2011 and in this International Standard, but the macro invocation produces different outcomes because the single quotes delimit a character-literal in C++ 2011, whereas they are digit separators in this International Standard:#define M(x, ...) __VA_ARGS__ int x [ 2 ] = { M ( 1 ’2 , 3 ’4 , 5 ) }; // int x[2] = { 5 }; --- C++ 2011 // int x[2] = { 3’4, 5 }; --- this International Standard
6. Alternative Wording (minimal change)
This just fixes pp-number to precisely cover the literal forms it is intended to.
6.1. Modify [lex.ppnumber]:
pp - number : digit . digit pp - number digit pp - number identifier - nondigit pp - number 'digit pp - number 'nondigit pp - number e sign pp - number E sign pp - number p sign pp - number P sign pp - number . integer - literal floating - point - literal user - defined - integer - literal user - defined - floating - point - literal
Preprocessing number tokens lexically include all integer-literal tokens ([lex.icon]) and all floating-point-literal tokens ([lex.fcon]).A preprocessing number does not have a type or a value; it acquires both after a successful conversion to
an integer-literal token orafloating-point-literal token.
6.2. Modify [diff.cpp14.lex]/2
Same as above.
7. Acknowledgments
Thanks to Michał Dominiak for reviewing and inspiring this paper. Additional thanks to everyone else who has provided feedback.