More named universal character escapes

Document number:
P3733R0
Date:
2025-06-06
Audience:
LEWG, SG16
Project:
ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
Reply-to:
Jan Schultke <[email protected]>
GitHub Issue:
wg21.link/P3733/github
Source:
github.com/Eisenwave/cpp-proposals/blob/master/src/more-unicode-escapes.cow

C++23 permits the use of "correction", "control", and "alternate" aliases for character names, but not "figment" or "abbreviation". Following P2736R2, this this restriction is no longer necessary because "figment" and "abbreviation" are normatively specified in the Unicode standard.

Contents

1

Introduction

1.1

History

1.2

Inconsistency with other languages

2

Motivation

2.1

Abbreviations

2.2

Figments

3

Proposed change

4

Impact on implementations

5

Wording

6

References

1. Introduction

1.1. History

[P2071R2] introduced named-universal-character escapes into C++23, which makes it possible to write, say, \N{NO-BREAK SPACE}. Such escape sequences provide much needed clarity as compared to \u00A0. Some code points additionally or exclusively have aliases. For example, DELETE (control alias) and DEL (abbreviation alias) correspond to U+007F within the Unicode standard. There is no name for U+007F that is not categorized as an alias.

SG16 voted unanimously to support aliases within a named-universal-character at Prague 2020:

Match name aliases?

SFFNASA
82100

EWG reaffirmed that decision at the same meeting:

This [named universal character escapes] should further support aliases

SFFNASA
182100

However, some categories of aliases are disallowed in C++, as explained in §8.2 Name sources:

Unicode aliases provide another critical service. As mentioned above, once assigned, names are immutable. Corrections are only offered by providing an alias. Aliases, accoring to the NamedAliases tables in the Unicode Character Database, come in five varieties:

The intent is to use the aliases classified as correction, control, and alternate as recognized names.

While the paper does not make it obvious why figment and abbreviation are excluded, the underlying reason is that the C++ standard referenced ISO/IEC 10646 at the time, where figment aliases are not included whatsoever, and where only a subset of the abbreviation aliases in the Unicode standard is included. Following [P2736R2], the C++ standard references the Unicode standard, and such a restriction is no longer motivated.

1.2. Inconsistency with other languages

Many design choices of [P2071R2] are ultimately motivated by §8.5 Existing practice. For example, the \N{...} syntax in C++ is identical to Python and Perl. While C++ shares a syntax, it does not permit the same categories of aliases:

Alias categoryExampleC++PythonPerl
correction\N{PRESENTATION FORM FOR VERTICAL
RIGHT WHITE LENTICULAR BRACKET}
control\N{NULL}
alternate\N{BYTE ORDER MARK}
figment\N{HIGH OCTET PRESET}
abbreviation\N{NBSP}

While this may have made historical sense, it now feels like an arbitrary restriction.

2. Motivation

2.1. Abbreviations

While the usefulness of some abbreviations is debatable, some of them significantly shorten commonly used code points.

Multi-part emoji are constructed using U+200D ZERO WIDTH JOINER, which is a rather long name:

// Without abbreviations, we can form a "family: woman, woman, girl" 👩👩👧 emoji as follows: u8"\N{WOMAN}\N{ZERO WIDTH JOINER}\N{WOMAN}\N{ZERO WIDTH JOINER}\N{GIRL}" // With abbreviations: u8"\N{WOMAN}\N{ZWJ}\N{WOMAN}\N{ZWJ}\N{GIRL}"

If we log messages into a UTF-8 text file, it is quite plausible that we would occasionally want to use U+00A0 NO-BREAK SPACE or U+00AD SOFT HYPHEN code points:

// Without abbreviations: u8"INFO: Auto\N{SOFT HYPHEN}reconnect triggered due to network\N{NO-BREAK SPACE}timeout." // With abbreviations: u8"INFO: Auto\N{SHY}reconnect triggered due to network\N{NBSP}timeout."

All that is to say that some abbreviations are well-motivated. Furthermore, allowing abbreviations would establish consistency with Python and Perl.

2.2. Figments

There are currently only three aliases classified as figment:

While there also exist PAD, HOP, and SGC abbreviations for these characters, these are rather obscure and a user may prefer to use the figment names for additional clarity. Therefore, they should also be supported by C++. This would also establish consistency with Python and Perl.

An alias being considered a figment is largely inconsequential. It just means that the name was not standardized in ISO/IEC 10646, which is no longer referenced by the C++ standard anyway.

3. Proposed change

The abbreviation and figment categories should also be permitted within a named-universal-character.

4. Impact on implementations

Permitting abbreviations and figments is essentially trivial. [UnicodeNameAliases] contains a list of all aliases, with 354 abbreviations and 3 figments. This is a drop in the ocean compared to the existing set of names.

Furthermore, the same guarantees of uniqueness (will never conflict with other names) and immutability (will never change) are provided for figment and abbreviation as for the control, alternate, and correction categories. This is generally the case for any names listed in [UnicodeNameAliases]. See [UnicodeAliasStability]. Therefore, upwards compatibility is not threatened.

5. Wording

The following change is relative to [N5008].

Change [lex.universal.char] paragraph 3 as follows:

A universal-character-name that is a named-universal-character designates the corresponding character in the Unicode Standard (chapter 4.8 Name) if the n-char-sequence is equal to its character name or to one of its character name aliases of type “control”, “correction”, or “alternate”; otherwise, the program is ill-formed.

[Note: These aliases are listed in the Unicode Character Database's NameAliases.txt. None of these names or aliases have leading or trailing spaces. — end note]

6. References

[N5008] Thomas Köppe. Working Draft, Programming Languages — C++ 2025-03-15 https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/n5008.pdf
[P2071R2] Tom Honermann et al.. Named universal character escapes 2022-03-25 https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2071r2.html
[P2736R2] Corentin Jabot. Referencing The Unicode Standard 2023-02-09 https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2736r2.pdf
[UnicodeAliasStability] Unicode® Character Encoding Stability Policies — Formal Name Alias Stability https://www.unicode.org/policies/stability_policy.html#Formal_Name_Alias