Sequential hexadecimal digits
- Document number:
- D4039R0
- Date:
2026-02-28 - Audience:
- SG16
- Project:
- ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
- Reply-to:
- Jan Schultke <janschultke@gmail.com>
- GitHub Issue:
- wg21.link/P4039/github
- Source:
- github.com/eisenwave/cpp-proposals/blob/master/src/sequential-hex-digits.cow
..
and .. are contiguous.
C++ should inherit the same guarantee.
Contents
1. Introduction
Among other guarantees, [lex.charset] paragraph 5 ensures the following:
The code unit value of each decimal digit character after the digit
(U+0030) is one greater than the value of the previous.0
The guarantee is useful because it allows
-
testing whether a
character
is an ASCII digit using the expressionc , andc >= ' 0 ' && c <= ' 9 ' -
computing the integer digit value using
.c - ' 0 '
Unfortunately, no similar guarantee is provided for other characters,
which makes it non-portable to e.g. test whether a character
is a lower-case letter using .
This test only works for ASCII-compatible encodings such as UTF-8,
and C++ supports encodings such as
EBCDIC,
where letters are not contiguous.
However, [N3192] Sequential Hexdigits
observed that even in EBCDIC,
there are blocks of 8 or 9 contiguous letters in the invariant subset
.
That is, ..,
.., and
.. are contiguous blocks;
this is analogous for upper-case letters.
[N3192] has been merged into the C2y draft,
providing the guarantee that ..
is a contiguous block,
which is not quite as strong as EBCDIC would allow.
At least the C2y guarantee should be provided in C++ as well.
2. Motivation
Having the guarantee that the letters ..
form a contiguous block
is mainly useful for working with hexadecimal digits.
There is likely a large amount of C++ code which relies on the contiguity of that range already, possibly unaware or disinterested in the lack of portability. Hexadecimal digits letters are uniquely interesting because of how frequently they are used and how obviously useful contiguity is.
There is substantially less motivation for the other two EBCDIC letter blocks.
While the guarantee allows implementing a test for whether a
is a lowercase letter using three range checks,
such an implementation would likely perform worse than a bitset lookup anyway.
Locale-specific character tests should either be done using standard library functions,
or the user should that their encoding is ASCII-based.
3. Design
The proposed change is to guarantee
that the blocks of letters
.. and .. are contiguous,
similar to ...
Due to the lack of motivation mentioned above, and out of caution not to provide more guarantees than C2y, no guarantee for other blocks of letters is proposed, despite EBCDIC seemingly allowing for a stronger guarantee.
4. Impact
To my understanding and to the understanding of WG14,
there is no change in behavior to existing code,
nor is any implementation affected.
The proposed change is for all intents and purposes on paper
.
However, the proposed change makes it impossible to
create a hypothetical future C++ implementation where ..
is not contiguous.
This seems like an acceptable sacrifice,
especially considering that such a C++ implementation
would be incompatible with C.
5. Wording
The changes are relative to [N5032].
Change [lex.charset] paragraph 5 as follows:
A literal encoding or a locale-specific encoding of one of the execution character sets ([character.seq]) encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element.
[Note: A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set. — end note]
The U+0000 NULL character is encoded as the value .
No other element of the translation character set is encoded
with a code unit of value .
The code unit values
of each decimal digit character after the digit
of characters in any of the ranges
(U+0030)
is one greater than the value of the previous.