More named universal character escapes
- Document number:
- P3733R0
- Date:
2025-06-06 - Audience:
- LEWG, SG16
- Project:
- ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
- Reply-to:
- Jan Schultke <[email protected]>
- GitHub Issue:
- wg21.link/P3733/github
- Source:
- github.com/Eisenwave/cpp-proposals/blob/master/src/more-unicode-escapes.cow
Contents
Introduction
History
Inconsistency with other languages
Motivation
Abbreviations
Figments
Proposed change
Impact on implementations
Wording
References
1. Introduction
1.1. History
[P2071R2] introduced
.
Such escape sequences provide much needed clarity as compared to
.
Some code points additionally or exclusively have aliases.
For example,
SG16 voted unanimously to support aliases within a
Match name aliases?
SF F N A SA 8 2 1 0 0
EWG reaffirmed that decision at the same meeting:
This [named universal character escapes] should further support aliases
SF F N A SA 18 2 1 0 0
However, some categories of aliases are disallowed in C++, as explained in §8.2 Name sources:
Unicode aliases provide another critical service. As mentioned above, once assigned, names are immutable. Corrections are only offered by providing an alias. Aliases, accoring to the NamedAliases tables in the Unicode Character Database, come in five varieties:
- correction Aliases for cases where an incorrect assigned name was published. For example, U+FE18 has an assigned name of
PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET and a correction alias ofPRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET (note the typo correction).- control Aliases for various control characters. For example,
NULL for U+0000.- alternate Aliases for widely used alternate names. For example,
BYTE ORDER MARK for U+FEFF.- figment Aliases for names that were documented, but never accepted in a standard. For example,
HIGH OCTET PRESET for U+0081.- abbreviation Aliases for common abbreviations. For example,
NBSP for U+00A0.The intent is to use the aliases classified as
correction ,control , andalternate as recognized names.
While the paper does not make it obvious why
1.2. Inconsistency with other languages
Many design choices of [P2071R2] are ultimately motivated by
§8.5 Existing practice.
For example, the
syntax in C++ is identical to Python and Perl.
While C++ shares a syntax,
it does not permit the same categories of aliases:
Alias category | Example | C++ | Python | Perl |
---|---|---|---|---|
| ✅ | ✅ | ✅ | |
| ✅ | ✅ | ✅ | |
| ✅ | ✅ | ✅ | |
| ❌ | ✅ | ✅ | |
| ❌ | ✅ | ✅ |
While this may have made historical sense, it now feels like an arbitrary restriction.
2. Motivation
2.1. Abbreviations
While the usefulness of some abbreviations is debatable, some of them significantly shorten commonly used code points.
All that is to say that some abbreviations are well-motivated. Furthermore, allowing abbreviations would establish consistency with Python and Perl.
2.2. Figments
There are currently only three aliases classified as
- U+0080 PADDING CHARACTER
- U+0081 HIGH OCTET PRESET
- U+0099 SINGLE GRAPHIC CHARACTER INTRODUCER
While there also exist
3. Proposed change
The
4. Impact on implementations
Permitting abbreviations and figments is essentially trivial. [UnicodeNameAliases] contains a list of all aliases, with 354 abbreviations and 3 figments. This is a drop in the ocean compared to the existing set of names.
Furthermore, the same guarantees of
uniqueness (will never conflict with other names)
and immutability (will never change)
are provided for
5. Wording
The following change is relative to [N5008].
Change [lex.universal.char] paragraph 3 as follows:
A of type “control”, “correction”, or “alternate”;
otherwise, the program is ill-formed.