Doc. no.: P0417R1
Date: 2016-11-25
Reply to: Beman Dawes <bdawes at acm dot org>
Audience: Core, Library

C++17 should refer to ISO/IEC 10646 2014 instead of 1994 (R1)

ISO standards are only supposed to have normative references to the latest version of other ISO standards, yet the C++17 CD still refers to ISO/IEC 10646-1:1993, Information technology — Universal Multiple-Octet Coded Character Set (UCS)— Part 1: Architecture and Basic Multilingual Plane.

This paper proposes updating the C++ standard to refer to ISO/IEC 10646:2014 and replacing of the terms UCS2 and UCS4 with UTF-16 and UTF-32. National Body comment GB 4 requests updating the reference. NB comments US 64 and CA 9 implicitly support updating the reference, but explicitly request UCS2 be retained.

Background 

There have been three revisions and numerous amendments to ISO/IEC 10646 since 1994. The changes that impact the C++17 CD include:

See http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html for a copy of ISO/IEC 10646:2014.

Discussion of the UCS2 to UTF-16 change

The term 'UCS2' is only used twice, in the specification of the C++11 header <codecvt> facets in [locale.stdcvt].

Rationale for the change to UTF-16:

UCS-2. UCS-2 stands for “Universal Character Set coded in 2 octets” and is also known as “the two-octet BMP form.” It was documented in earlier editions of 10646 as the two-octet (16-bit) encoding consisting only of code positions for plane zero, the Basic Multilingual Plane. This documentation has been removed from ISO/IEC 10646:2011 and subsequent editions, and the term UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard.

Revision history

R1 - 2016 Post-Issaquah mailing

R0 - 2016 Post-Oulu mailing

Acknowledgements

Thanks to Richard Smith for encouraging me to write this paper.

Thanks to Tom Honermann for standardese discussions that led me to realize how out-of-date the ISO/IEC 10646:1-1993 reference was.

Proposed changes

Strike the wording high-lighted in red and add the wording high-lighted in green.

1.2 Normative references [intro.refs]

— ISO/IEC 10646-1:1993, Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane :2014, Information technology — Universal Coded Character Set (UCS)

22.5 Standard code conversion facets [locale.stdcvt]

For the facet codecvt_utf8:

— The facet shall convert between UTF-8 multibyte sequences and UCS2 UTF-16 or UCS4 UTF-32 (depending on the size of Elem) within the program.

...

For the facet codecvt_utf16:

— The facet shall convert between UTF-16 multibyte sequences and UCS2 UTF-16 or UCS4 UTF-32 (depending on the size of Elem) within the program.