Document #: | P2728R6 |
Date: | 2023-07-11 |
Project: | Programming Language C++ |
Audience: |
SG-16 Unicode LEWG |
Reply-to: |
Zach Laine <[email protected]> |
project_view
char32_t
.charN_t
.utfN_view
to the types of the
from-range, instead of the types of the transcoding iterators used to
implement the view.as_utfN()
functions with the
as_utfN
view adaptors that
should have been there all along.transcoding_error_handler
concept.unpack_iterator_and_sentinel
into a CPO.null_sentinel_t
to a non-Unicode-specific facility.utf{8,16,32}_view
with a single utf_view
.noexcept
where
appropriate.code_unit
concept, and added
as_charN_t
adaptors.replacement_character
.utf_iterator
slightly.null_sentinel_t
back
to being Unicode-specific.unpacking_owning_view
with
unpacking_view
, and use it to do
unpacking, rather than sometimes doing the unpacking in the
adaptor.const
and
non-const
overloads for
begin
and
end
in all views.null_sentinel_t
to
std
, remove its
base
member function, and make
it useful for more than just pointers, based on SG-9 guidance.null_sentinel_t
.ranges::project_view
, and
inplement charN_view
s in terms
of that.utfN_view
s to
aliases, rather than individual classes.Unicode is important to many, many users in everyday software. It is not exotic or weird. Well, it’s weird, but it’s not weird to see it used. C and C++ are the only major production languages with essentially no support for Unicode.
Let’s fix.
To fix, first we start with the most basic representations of strings in Unicode: UTF. You might get a UTF string from anywhere; on Windows you often get them from the OS, in UTF-16. In web-adjacent applications, strings are most commonly in UTF-8. In ASCII-only applications, everything is in UTF-8, by its definition as a superset of ASCII.
Often, an application needs to switch between UTFs: 8 -> 16, 32 -> 16, etc. In SG-16 we’ve taken to calling such UTF-N -> UTF-M operations “transcoding”.
I’m proposing interfaces to do transcoding that meet certain design requirements that I think are important; I hope you’ll agree:
[P1629R1] from JeanHeyd Meneide is a much more ambitious proposal that aims to standardize a general-purpose text encoding conversion mechanism. This proposal is not at odds with P1629; the two proposals have largely orthogonal aims. This proposal only concerns itself with UTF interconversions, which is all that is required for Unicode support. P1629 is concerned with those conversions, plus a lot more. Accepting both proposals would not cause problems; in fact, the APIs proposed here could be used to implement parts of the P1629 design.
There are some differences between the way that the transcode views
and iterators from [P1629R1] work and
the transcoding view and iterators from this paper work. First,
std::text::transcode_view
has no
direct support for null-terminated strings. Second, it does not do the
unpacking described in this paper. Third, it is not printable and
streamable.
There are multiple encoding types defined in Unicode: UTF-8, UTF-16, and UTF-32.
A code unit is the lowest-level datum-type in your Unicode
data. Examples are a char8_t
in
UTF-8 and a char32_t
in
UTF-32.
A code point is a 32-bit integral value that represents a single Unicode value. Examples are U+0041 “A” “LATIN CAPITAL LETTER A” and U+0308 “¨” “COMBINING DIAERESIS”.
A code point may be consist of multiple code units. For instance, 3 UTF-8 code units in sequence may encode a particular code point.
In this case, we have a generic range interface to transcode into, so we use a transcoding view.
// A generic function that accepts sequences of UTF-16.
template<std::uc::utf16_range R>
void process_input(R r);
void process_input_again(std::uc::utf_view<std::uc::format::utf16, std::ranges::ref_view<std::string>> r);
::u8string input = get_utf8_input();
stdauto input_utf16 = input | std::uc::as_utf16;
(input_utf16);
process_input(input_utf16); process_input_again
This time, we have a generic iterator interface we want to transcode into, so we want to use the transcoding iterators.
// A generic function that accepts sequences of UTF-16.
template<std::uc::utf16_iter I>
void process_input(I first, I last);
::u8string input = get_utf8_input();
std
(
process_input::uc::utf_iterator<std::uc::format::utf8, std::uc::format::utf16, std::u8string::iterator>(
std.begin(), input.begin(), input.end()),
input::uc::utf_iterator<std::uc::format::utf8, std::uc::format::utf16, std::u8string::iterator>(
std.begin(), input.end(), input.end()));
input
// Even more conveniently:
auto const utf16_view = input | std::uc::as_utf16;
(utf16_view.begin(), utf16.end()); process_input
Let’s say that we want to take code points that we got from ICU, and
transcode them to UTF-8. The problem is that ICU’s code point type is
int
. Since
int
is not a character type,
it’s not deduced by as_utf8
to
be UTF-32 data.
// A generic function that accepts sequences of UTF-16.
template<std::uc::utf8_range R>
void process_input(R r);
::vector<int> input = get_icu_code_points();
std// This is ill-formed without the as_char32_t adaptation.
auto input_utf8 = input | std::uc::as_char32_t | std::uc::as_utf8;
(input_utf8); process_input
Text processing is pretty useless without I/O. All of the Unicode
algorithms operate on code points, and so the output of any of those
algorithms will be in code points/UTF-32. It should be easy to print the
results to a std::ostream
, to a
std::wostream
on Windows, or
using std::format
and
std::print
.
utf_view
is therefore printable
and streamable.
void double_print(char32_t const * str)
{
auto utf8 = str | std::uc::as_utf8;
::print("{}", utf8);
std::cerr << utf8;
std}
This proposal depends on the existence of P2727 “std::iterator_interface”.
The macro
CODE_UNIT_CONCEPT_OPTION_2
is
used below to indicate the two options for how to define
code_unit
. See below for a
description of the two options.
namespace std::uc {
enum class format { utf8 = 1, utf16 = 2, utf32 = 4 };
inline constexpr format wchar-t-format = see below; // exposition only
template<class T, format F>
concept code_unit = (same_as<T, char8_t> && F == format::utf8) ||
(same_as<T, char16_t> && F == format::utf16) ||
(same_as<T, char32_t> && F == format::utf32)
#if CODE_UNIT_CONCEPT_OPTION_2
|| (same_as<T, char> && F == format::utf8)
|| (same_as<T, wchar_t> && F == wchar-t-format)
#endif
;
template<class T>
concept utf8_code_unit = code_unit<T, format::utf8>;
template<class T>
concept utf16_code_unit = code_unit<T, format::utf16>;
template<class T>
concept utf32_code_unit = code_unit<T, format::utf32>;
template<class T>
concept utf_code_unit = utf8_code_unit<T> || utf16_code_unit<T> || utf32_code_unit<T>;
template<class T, format F>
concept code_unit_iter =
<T> && code_unit<iter_value_t<T>, F>;
input_iteratortemplate<class T, format F>
concept code_unit_pointer =
<T> && code_unit<iter_value_t<T>, F>;
is_pointer_vtemplate<class T, format F>
concept code_unit_range = ranges::input_range<T> &&
<ranges::range_value_t<T>, F>;
code_unit
template<class T>
concept utf8_iter = code_unit_iter<T, format::utf8>;
template<class T>
concept utf8_pointer = code_unit_pointer<T, format::utf8>;
template<class T>
concept utf8_range = code_unit_range<T, format::utf8>;
template<class T>
concept utf16_iter = code_unit_iter<T, format::utf16>;
template<class T>
concept utf16_pointer = code_unit_pointer<T, format::utf16>;
template<class T>
concept utf16_range = code_unit_range<T, format::utf16>;
template<class T>
concept utf32_iter = code_unit_iter<T, format::utf32>;
template<class T>
concept utf32_pointer = code_unit_pointer<T, format::utf32>;
template<class T>
concept utf32_range = code_unit_range<T, format::utf32>;
template<class T>
concept utf_iter = utf8_iter<T> || utf16_iter<T> || utf32_iter<T>;
template<class T>
concept utf_pointer = utf8_pointer<T> || utf16_pointer<T> || utf32_pointer<T>;
template<class T>
concept utf_range = utf8_range<T> || utf16_range<T> || utf32_range<T>;
template<class T>
concept utf_range_like =
<remove_reference_t<T>> || utf_pointer<remove_reference_t<T>>;
utf_range
template<class T>
concept utf8_input_range_like =
(ranges::input_range<remove_reference_t<T>> && utf8_code_unit<iter_value_t<T>>) ||
<remove_reference_t<T>>;
utf8_pointertemplate<class T>
concept utf16_input_range_like =
(ranges::input_range<remove_reference_t<T>> && utf16_code_unit<iter_value_t<T>>) ||
<remove_reference_t<T>>;
utf16_pointertemplate<class T>
concept utf32_input_range_like =
(ranges::input_range<remove_reference_t<T>> && utf32_code_unit<iter_value_t<T>>) ||
<remove_reference_t<T>>;
utf32_pointer
template<class T>
concept utf_input_range_like =
<T> || utf16_input_range_like<T> || utf32_input_range_like<T>;
utf8_input_range_like
template<class T>
concept transcoding_error_handler =
requires (T t, string_view msg) { { t(msg) } -> same_as<char32_t>; };
}
There are two options for how the
code_unit
concept is
defined.
This is represented by
CODE_UNIT_CONCEPT_OPTION_2 == 0
in the code above. In this option, a code unit must be one of
char8_t
,
char16_t
, and
char32_t
.
This is represented by
CODE_UNIT_CONCEPT_OPTION_2 == 1
in the code above. In this option, a code unit must be a character type.
This includes the charN_t
character types from Option 1, plus
char
and
wchar_t
. The value of
wchar-t-format
is
implementation defined, but must be
uc::format::utf16
or
uc::format::utf32
.
Here are some examples of the differences between Options 1 and 2.
The as_utfN
and
as_charN
adaptors are discussed
later in this paper.
The as_utfN
adaptors produce
utfN_view
s, which do
transcoding.
The as_utfN
adaptors produce
charN_view
s that are each very
similar to a transform_view
that
casts each element of the adapted range to a
charN_t
value. A
charN_view
differs from the
equivalent transform in that it may be a borrowed range, and that the
utfN_view
views know about the
charN_view
s, and can optimize
away the work that would be done by the
charN_view
. This turns
charN_view
into a no-op when
nested within a utfN_view
.
Note the use of charN_t
below
with std::wstring
. That’s there
because whether you write
as_char16_t
or
as_char32_t
is
implementation-dependent.
Option 1
|
Option 2
|
---|---|
|
|
In short, Option 1 forces you to write
“| as_char8_t
” everywhere you
want to use a std::string
with
the interfaces proposed in this paper.
Option 1 is supported by most of SG-16. Here is the relevant SG-16 poll:
UTF transcoding interfaces provided by the C++ standard library should operate on charN_t types, with support for other types provided by adapters, possibly with a special case for char and wchar_t when their associated literal encodings are UTF.
SF
|
F
|
N
|
A
|
SA
|
---|---|---|---|---|
6 | 1 | 0 | 0 | 1 |
(I have chosen to ignore the “possibly with a special case for char and wchar_t when their associated literal encodings are UTF” part. Making the evaluation of a concept change based on the literal encoding seems like a flaky move to me; the literal encoding can change TU to TU.)
The feeling in SG-16 is that the
charN_t
types are designed to
represent UTF encodings, and
char
is not. A
char const *
string could be in
any one of dozens (hundreds?) of encodings. The addition of
“| as_char8_t
” to adapt ranges
of char
is meant to act as a
lexical indicator of user intent.
I believe this decision is a mistake. I would very, very much
not like to standardize Unicode interfaces that do not easily
interoperate with std::string
.
This is my reasoning:
First, char
and
char8_t
maintain exactly the
same set of invariants – the empty set. Note that this is true even for
string literals. The encoding of
u8"text"
is not
necessarily UTF-8! It depends on the flags you pass to your compiler.
Those flags are allowed to vary TU by TU. I have been bitten by the
“u8
does not necessarily mean
UTF-8” oddity of MSVC before.
Second, “| as_char8_t
” is a
no-op when used with
utfN_view
/utf_view
.
It does not actually do anything to help you get your program’s text
into UTF-8 encoding, nor to detect that you have non-UTF-8 encoded text
in your program.
Third, people use std::string
a lot. They use char
string
literals a lot. They use
std::u8string
and
char8_t
string literals almost
not at all. Using Github Code Search, I found 15.3M references to
std::string
and 6.7k references
to std::u8string
. Even were
everyone to switch from
std::string
to
std::u8string
today, we should
still have to deal with lots and lots of
char const *
strings for C API
compatibility.
Finally, whether a given range of code units is properly UTF encoded may be a precondition of a given API that the user writes, but it is not a precondition of any API proposed in this paper, nor is it a precondition of any API I’m proposing in the papers that will follow this one.
In short, I think "text" | std::uc::as_utf32
should “just work”. Making users write "text" | std::uc::as_char8_t | std::uc::as_utf32
,
when that does not increase correctness or efficiency – and produces no
different object code – seems wrongheaded to me. Users that want the
extra explicitness can still write the longer version under both
options. Users that do not want this explicitness should not be forced
to write it.
namespace std {
struct null_sentinel_t {
template<input_iterator I>
requires default_initializable<iter_value_t<I>> &&
<iter_reference_t<I>, iter_value_t<I>>
equality_comparablefriend constexpr auto operator==(I it, null_sentinel_t) { return *it == iter_value_t<I>{}; }
};
inline constexpr null_sentinel_t null_sentinel;
}
This sentinel type matches any iterator position
it
at which
*it
is equal to a
default-constructed object of type
iter_value_t<I>
. This
works for null-terminated strings, but can also serve as the sentinel
for any forward range terminated by a default-constructed value.
Because this type is potentially useful for lots of ranges unrelated
to Unicode or text, it is in the
std
namespace, not
std::uc
.
If you’re wondering why
ITER_CONCEPT
is used
instead of directly requiring
forward_iterator<I>
, it’s
because the latter causes recursion in a check of
equality_comparable
within
forward_iterator
.
I’m using P2727’s
iterator_interface
here for
simplicity.
First, the synopsis:
namespace std::uc {
inline constexpr char32_t replacement_character = 0xfffd;
struct use_replacement_character {
constexpr char32_t operator()(string_view error_msg) const noexcept;
};
template<format Format>
constexpr auto format-to-type() { // exposition only
if constexpr (Format == format::utf8) {
return char8_t{};
} else if constexpr (Format == format::utf16) {
return char16_t{};
} else {
return char32_t{};
}
}
template<class I>
using format-to-type-t = decltype(format-to-type<I>()); // exposition only
template<
format FromFormat,
format ToFormat,
input_iterator I,<I> S = I,
sentinel_for= use_replacement_character>
transcoding_error_handler ErrorHandler requires convertible_to<iter_value_t<I>, format-to-type-t<FromFormat>>
class utf_iterator;
}
Then the definitions:
namespace std::uc {
template<class I>
constexpr auto bidirectional-at-most() { // exposition only
if constexpr (bidirectional_iterator<I>) {
return bidirectional_iterator_tag{};
} else if constexpr (forward_iterator<I>) {
return forward_iterator_tag{};
} else if constexpr (input_iterator<I>) {
return input_iterator_tag{};
}
}
template<class I>
using bidirectional-at-most-t = decltype(bidirectional-at-most<I>()); // exposition only
template<typename I, bool SupportReverse = bidirectional_iterator<I>>
struct first-and-curr { // exposition only
() = default;
first-and-curr(I curr) : curr{curr} {}
first-and-currtemplate<class I2>
requires convertible_to<I2, I>
(const first-and-curr<I2>& other) : curr{other.curr} {}
first-and-curr
I curr;};
template<typename I>
struct first-and-curr<I, true> { // exposition only
() = default;
first-and-curr(I first, I curr) : first{first}, curr{curr} {}
first-and-currtemplate<class I2>
requires convertible_to<I2, I>
(const first-and-curr<I2>& other) : first{other.first}, curr{other.curr} {}
first-and-curr
I first;
I curr;};
struct use_replacement_character {
constexpr char32_t operator()(string_view) const noexcept { return replacement_character; }
};
template<
format FromFormat,
format ToFormat,
input_iterator I,<I> S,
sentinel_for>
transcoding_error_handler ErrorHandlerrequires convertible_to<iter_value_t<I>, format-to-type-t<FromFormat>>
class utf_iterator : public iterator_interface<
<I>,
bidirectional-at-most<ToFormat>,
format-to-type-t<ToFormat>> {
format-to-type-tpublic:
using value_type = format-to-type-t<ToFormat>;
constexpr utf_iterator() = default;
constexpr utf_iterator(I first, I it, S last) requires bidirectional_iterator<I>
: first_and_curr_{first, it}, last_(last) {
if (curr() != last_)
();
read}
constexpr utf_iterator(I it, S last) requires (!bidirectional_iterator<I>)
: first_and_curr_{it}, last_(last) {
if (curr() != last_)
();
read}
template<class I2, class S2>
requires convertible_to<I2, I> && convertible_to<S2, S>
constexpr utf_iterator(const utf_iterator<FromFormat, ToFormat, I2, S2, ErrorHandler>& other) :
(other.buf_),
buf_(other.first_and_curr_),
first_and_curr_(other.buf_index_),
buf_index_(other.buf_last_),
buf_last_(other.last_)
last_{}
constexpr I begin() const requires bidirectional_iterator<I> { return first(); }
constexpr S end() const { return last_; }
constexpr I base() const requires forward_iterator<I> { return curr(); }
constexpr value_type operator*() const { return buf_[buf_index_]; }
constexpr utf_iterator& operator++() {
if (buf_index_ + 1 == buf_last_ && curr() != last_) {
if constexpr (forward_iterator<I>) {
(curr(), to_increment_);
advance}
if (curr() == last_)
= 0;
buf_index_ else
();
read} else if (buf_index_ + 1 <= buf_last_) {
++buf_index_;
}
return *this;
}
constexpr utf_iterator& operator--() requires bidirectional_iterator<I> {
if (!buf_index_ && curr() != first())
();
read_reverseelse if (buf_index_)
--buf_index_;
return *this;
}
friend constexpr bool operator==(utf_iterator lhs, utf_iterator rhs)
requires forward_iterator<I> || requires (I i) { i != i; } {
if constexpr (forward_iterator<I>) {
return lhs.curr() == rhs.curr() && lhs.buf_index_ == rhs.buf_index_;
} else {
if (lhs.curr() != rhs.curr())
return false;
if (lhs.buf_index_ == rhs.buf_index_ &&
.buf_last_ == rhs.buf_last_) {
lhsreturn true;
}
return lhs.buf_index_ == lhs.buf_last_ &&
.buf_index_ == rhs.buf_last_;
rhs}
}
friend constexpr bool operator==(utf_iterator lhs, S rhs)
if constexpr (forward_iterator<I>) {
return lhs.curr() == rhs;
} else {
return lhs.curr() == rhs && lhs.buf_index_ == lhs.buf_last_;
}
}
using base_type = // exposition only
<bidirectional-at-most-t<I>, value_type, value_type>;
iterator_interfaceusing base_type::operator++;
using base_type::operator--;
private:
constexpr void read(); // exposition only
constexpr void read_reverse(); // exposition only
constexpr I first() const requires bidirectional_iterator<I> // exposition only
{ return first_and_curr_.first; }
constexpr I& curr() { return first_and_curr_.curr; } // exposition only
constexpr I curr() const { return first_and_curr_.curr; } // exposition only
<value_type, 4 / static_cast<int>(ToFormat)> buf_; // exposition only
array
<I> first_and_curr_; // exposition only
first-and-curr
uint8_t buf_index_ = 0; // exposition only
uint8_t buf_last_ = 0; // exposition only
uint8_t to_increment_ = 0; // exposition only
[[no_unique_address]] S last_; // exposition only
template<
format FromFormat2,
format ToFormat2,<FromFormat2> I2,
code_unit_iter<I2> S2,
sentinel_for>
transcoding_error_handler ErrorHandler2friend class utf_iterator;
};
}
use_replacement_character
is
an error handler type that can be used with
utf_iterator
. It accepts a
string_view
error message, and
returns the replacement character. The user can substitute their own
type here, which may throw, abort, log, etc.
utf_iterator
is an iterator
that transcodes from UTF-N to UTF-M, where N and M are each one of 8,
16, or 32. N may equal M. UTF-N to UTF-N operation invokes the error
handler as appropriate, but does not change format.
utf_iterator
does its work by
adapting an underlying range of code units. Each code point
c
to be transcoded is decoded
from FromFormat
in the
underlying range. c
is then
encoded to ToFormat
into an
internal buffer. If ill-formed UTF is encountered during the decoding
step, c
is whatever invoking the
error handler returns; using the default error handler, this is
replacement_character
.
utf_iterator
maintains
certain invariants; the invariants differ based on whether
utf_iterator
is an input
iterator.
For input iterators the invariant is: if
*this
is at the end of the range
being adapted, then curr()
==
last_
; otherwise, the position
of curr()
is always at the end
of the current code point c
within the range being adapted, and
buf_
contains the code units in
ToFormat
that comprise
c
.
For forward and bidirectional iterators, the invariant is: if
*this
is at the end of the range
being adapted, then curr()
==
last_
; otherwise, the position
of curr()
is always at the
beginning of the current code point
c
within the range being
adapted, and buf_
contains the
code units in ToFormat
that
comprise c
.
When ill-formed UTF is encountered in the range being adapted,
utf_iterator
calls
ErrorHandler{}.operator()
to
produce a character to represent the ill-formed sequence. The number and
position of error handler invocations within the transcoded output is
the same, whether the range being adapted is traversed forward or
backward. The number and position of the error handler invocations
should use the “substitution of maximal subparts” approach described in
Chapter 3 of the Unicode standard.
Besides the constructors, no member function of
utf_iterator
has preconditions.
As long as a utf_iterator
i
is constructed with proper
arguments, all subsequent operations on
i
are memory safe. This includes
decrementing a utf_iterator
at
the beginning of the range being adapted, and incrementing or
dereferencing a utf_iterator
at
the end of the range being adapted.
If FromFormat
and
ToFormat
are not each one of
format::utf8
,
format::utf16
, or
format::utf32
, the program is
ill-formed.
If input_iterator<I>
is
true
, noexcept(ErrorHandler{}("")))
must be true
as well; otherwise,
the program is ill-formed.
The exposition-only member function
read
decodes the code point
c
as
FromFormat
starting from
position curr()
in the range
being adapted (c
may be
replacement_character
); sets
to_increment_
to the number of
code units read while decoding
c
; encodes
c
as
ToFormat
into
buf_
; sets
buf_index_
to
0
; and sets
buf_last_
to the number of code
units encoded into buf_
. If
forward_iterator<I>
is
true
,
curr()
is set to the position it
had before read
was called. If
an exception is thrown during a call to
read
, the call to
read
has no effect.
The exposition-only member function
read_reverse
decodes the code
point c
as
FromFormat
ending at position
curr()
in the range being
adapted (c
may be
replacement_character
); sets
to_increment_
to the number of
code units read while decoding
c
; encodes
c
as
ToFormat
into
buf_
; sets
buf_last_
to the number of code
units encoded into buf_
; and
sets buf_index_
to
buf_last_ - 1
. If an exception
is thrown during a call to
read_reverse
, the call to
read_reverse
has no effect.
utf_iterator
is constrained the
way it isThe template parameter I
to
utf_iterator
is not constrained
with
code_unit_iter<FromFormat>
as it was in earlier revisions of this paper. Instead,
I
must be an
input_iterator
whose value type
is convertible to format-to-type-t<FromFormat>
.
This allows two uses of
utf_iterator
that the previous
constraint would not.
First, utf_iterator
can be
used to adapt an iterator whose value type is some non-character type.
This is useful in general, since lots of existing Unicode-aware user
code uses uint32_t
for UTF-32,
or short
for UTF-16 or whatever.
It is useful in particular because ICU uses
int
for its UTF-32/code point
type.
Second, because of the first point, adaptations of ranges of non-character types can be made more efficient. Consider:
::vector<int> code_points_from_icu = /* ... */;
stdauto v = code_points_from_icu | std::uc::as_char32_t | std::uc::as_utf8;
auto first = v.begin();
The type of first
is:
::uc::utf_iterator<std::uc::format::utf8, std::uc::format::utf32, std::vector<int>::iterator> std
That is, the adapting iterator that
as_char32_t
uses is gone. This
makes using as_char32_t
more
efficient, when used in conjunction with
as_utfN
. If
utf_iterator
’s
I
were required to be a
utf_iter
, this optimization
would not work.
utf_iterator
is not a nested
type within utf_view
Most users will use views most of the time. However, it can be useful to use iterators some of the time. For example, say I wanted to track some user-visible cursor within some bit of text. If I wanted to represent that cursor independently from the view within which it is found, it can be awkward to do so without an independent iterator template.
// This is the easy case. We have the View right there, and can use
// ranges::iterator_t to get its iterator type.
template<typename View>
struct my_state_type
{
View all_text_;::ranges::iterator_t<View>> current_position_;
std// other state ...
};
// This one, not so much. Since we don't have the View type, we have to make
// the type of current_position_ a template parameter, even if there's only one
// type ever in use for a given view.
template<typename Iterator>
struct my_other_state_type
{
Iterator current_position_;// other state ...
};
Using utf_iterator
allows us
to write more specific code. Sometimes, generic code is more desirable;
sometimes nongeneric code is more desirable.
struct my_other_state_type
{
::uc::utf_iterator<format::utf8, format::utf32, char const*> current_position_;
std// other state ...
};
Further, utf_iterator
has
configurability options that do not work for
utfN_view
, like the
ErrorHandler
template parameter.
This will not be used often, but some users will want it sometimes. I
don’t think such alternate uses are going to be common enough to justify
complicating utfN_view
; those
uses belong in a lower-level interface like
utf_iterator
.
utf_iterator
specializationsnamespace std::uc {
template<
utf8_iter I,::sentinel_for<I> S = I,
std= use_replacement_character>
transcoding_error_handler ErrorHandler using utf_8_to_16_iterator =
<format::utf8, format::utf16, I, S, ErrorHandler>;
utf_iteratortemplate<
utf16_iter I,::sentinel_for<I> S = I,
std= use_replacement_character>
transcoding_error_handler ErrorHandler using utf_16_to_8_iterator =
<format::utf16, format::utf8, I, S, ErrorHandler>;
utf_iterator
template<
utf8_iter I,::sentinel_for<I> S = I,
std= use_replacement_character>
transcoding_error_handler ErrorHandler using utf_8_to_32_iterator =
<format::utf8, format::utf32, I, S, ErrorHandler>;
utf_iteratortemplate<
utf32_iter I,::sentinel_for<I> S = I,
std= use_replacement_character>
transcoding_error_handler ErrorHandler using utf_32_to_8_iterator =
<format::utf32, format::utf8, I, S, ErrorHandler>;
utf_iterator
template<
utf16_iter I,::sentinel_for<I> S = I,
std= use_replacement_character>
transcoding_error_handler ErrorHandler using utf_16_to_32_iterator =
<format::utf16, format::utf32, I, S, ErrorHandler>;
utf_iteratortemplate<
utf32_iter I,::sentinel_for<I> S = I,
std= use_replacement_character>
transcoding_error_handler ErrorHandler using utf_32_to_16_iterator =
<format::utf32, format::utf16, I, S, ErrorHandler>;
utf_iterator}
These aliases make it easier to spell
utf_iterator
s. Consider utf_8_to_32_iterator<char const *>
versus utf_iterator<format::utf8, format::utf32, char const *>
.
More importantly, they allow CTAD to work, as in utf_8_to_32_iterator(first, it, last)
.
These aliases are completely optional, of course. Let us poll.
unpack_iterator_and_sentinel
CPO
for iterator “unpacking”struct no_op_repacker {
template<class T> T operator()(T x) const { return x; }
};
template<format FormatTag, utf_iter I, sentinel_for<I> S, class Repack>
struct unpack_result {
static constexpr format format_tag = FormatTag;
I first;[[no_unique_address]] S last;
[[no_unique_address]] Repack repack;
};
// CPO equivalent to:
template<utf_iter I, sentinel_for<I> S, class Repack = no_op_repacker>
constexpr auto unpack_iterator_and_sentinel(I first, S last, Repack repack = Repack());
Any utf_iterator
ti
contains two iterators and a
sentinel. If one were to adapt
ti
in another transcoding
iterator ti2
, one quickly
encounters a problem – since for example utf_iterator<format::utf32, format::utf16, utf_iterator<format::utf8, format::utf32, char const *>>
would be the size of 9 pointers! Further, such an iterator would do a
UTF-8 to UTF-16 to UTF-32 conversion, when it could have done a direct
UTF-8 to UTF-32 conversion instead.
One would obviously never write a type like the monstrosity above. However, it is quite possible to accidentally construct one in generic code. Consider:
using namespace std::uc;
template<format IterFormat, typename Iter>
void f(Iter it, null_sentinel_t) {
#if _MSC_VER
// On Windows, do something with 'it' that requires UTF-16.
<IterFormat, format::utf16, Iter, null_sentinel_t> it16;
utf_iterator(it16, null_sentinel);
windows_function#endif
// ... etc.
}
int main(int argc, char const * argv[]) {
<format::utf8, format::utf32, char const *, null_sentinel_t> it(argv[1], null_sentinel);
utf_iterator
<format::utf32>(it, null_sentinel);
f
// ... etc.
}
This example is a bit contrived, since users will not create
iterators directly like this very often. Users are much more likely to
use the utfN_view
views and
as_utfN
view adaptors being
proposed below. The view adaptors are defined in such a way that they
avoid this problem altogether. They do this by unpacking the view they
are adapting before adapting it. For instance:
::u8string str = u8"some text";
std
auto utf16_str = str | std::uc::as_utf16;
static_assert(std::same_as<
decltype(utf16_str.begin()),
::uc::utf_iterator<std::uc::format::utf8, std::uc::format::utf16, std::u8string::iterator>
std>);
auto utf32_str = utf16_str | std::uc::as_utf32;
// Poof! The utf_iterator<format::utf8, format::utf16 iterator disappeared!
static_assert(std::same_as<
decltype(utf32_str.begin()),
::uc::utf_iterator<std::uc::format::utf8, std::uc::format::utf32, std::u8string::iterator>
std>);
The unpacking logic is used in the view adaptors, as shown above.
This allows you to write
r | std::uc::as_utf32
in a
generic context, without caring whether
r
is a range of UTF-8, UTF-16,
or UTF-32. You also do not need to care about whether
r
is a common range or not. You
also can ignore whether r
is
comprised of raw pointers, some other kind of iterator, or transcoding
iterators.
This becomes especially useful in the APIs proposed in later papers that depend on this paper. In particular, APIs in subsequent papers accept any UTF-N iterator, and then transcode internally to UTF-32. However, this creates a minor problem for some algorithms. Consider this algorithm (not proposed) as an example.
template<input_iterator I, sentinel_for<I> S, output_iterator<char8_t> O>
requires (utf8_code_unit<iter_value_t<I>> || utf16_code_unit<iter_value_t<I>>)
<I, O> transcode_to_utf32(I first, S last, O out); transcode_result
Such a transcoding algorithm is pretty similar to
std::ranges::copy
, in that you
should return both the output iterator and the final position
of the input iterator
(transcode_result
is an alias
for in_out_result
). For such
interfaces, it can be difficult in the general case to form an iterator
of type I
to return to the
user:
template<input_iterator I, sentinel_for<I> S, output_iterator<char8_t> O>
requires (utf8_code_unit<iter_value_t<I>> || utf16_code_unit<iter_value_t<I>>)
<I, O> transcode_to_utf32(I first, S last, O out) {
transcode_result// Get the input as UTF-32. This may involve unpacking, so possibly decltype(r.begin()) != I.
auto r = ranges::subrange(first, last) | uc::as_utf32;
// Do transcoding.
auto copy_result = ranges::copy(r, out);
// Return an in_out_result.
return result<I, O>{/* ??? */, copy_result.out};
}
What should we write for
/* ??? */
? That is, how do we
get back from the UTF-32 iterator
r.begin()
to an
I
iterator? It’s harder than it
first seems; consider the case where
I
is std::uc::utf_16_to_32_iterator<std::uc::utf_8_to_16_iterator<std::string::iterator>>
.
The solution is for the unpacking algorithm to remember the structure of
whatever iterator it unpacks, and then rebuild the structure when
returning the result. To demonstrate, here is the implementation of
transcode_to_utf32
from
Boost.Text:
template<std::input_iterator I, std::sentinel_for<I> S, std::output_iterator<char32_t> O>
requires (utf8_code_unit<std::iter_value_t<I>> || utf16_code_unit<std::iter_value_t<I>>)
<I, O> transcode_to_utf32(I first, S last, O out)
transcode_result{
auto const r = boost::text::unpack_iterator_and_sentinel(first, last);
auto unpacked = detail::transcode_to_32<false>(
::tag_t<r.format_tag>, r.first, r.last, -1, out);
detailreturn {r.repack(unpacked.in), unpacked.out};
}
Note the call to r.repack
.
This is an invocable created by the unpacking process itself.
If this all sounds way too complicated, it’s not bad at all. Here’s the unpacking/repacking implementation from Boost.Text: unpack.hpp.
unpack_iterator_and_sentinel
is a CPO. It is intended to work with UDTs that provide their own
unpacking implementation. It returns an
unpack_result
.
In telecon review, some concerns were voiced about the name
uc::unpack_iterator_and_sentinel
.
Some people felt that the name should include some mention of “UTF” or
“transcoding” or “Unicode”. I think that it’s fine as-is, since it’s in
namesapce std::uc
, but a poll on
renaming might be in order. I suggest uc::unpack_utf_iterator_and_sentinel
as a possible alternative.
Input iterators are messed up. They barely resemble the other
iterators. For one thing, they are single-pass. This means that when a
utf_iterator
adapting an input
iterator reads the next code point from the range it is adapting, it
must leave the iterator at a location that is just after the current
code point. It has no choice, since it cannot backtrack.
It is possible to unpack an input iterator in an entirely different way than other iterators. The unpack operation for input iterators could be to produce the underlying code unit iterator (the adapted input iterator itself), plus the current code point that the input iterator was just used to read.
However, this is not very much help. Consider a case in which we need
to unpack a UTF-8 to UTF-32 transcoding iterator so we can form a UTF-8
to UTF-16 iterator instead. The unpack operation will produce an
unpacked input transcoding iterator – the moral equivalent of
std::pair<I, char32_t>
.
What can you do with this? Well, you can try to construct a utf_iterator<format::utf8, format::utf16, I>
from it. That would mean adding a constructor that takes an input
iterator and a char32_t
. This
would also mean that any user transcoding iterator types that are usable
with the
unpack_iterator_and_sentinel
CPO
would also need to unpack their input iterator into an iterator/code
point pair, and that those user types would also need to add this odd
constructor.
This is all weird. It’s also a pretty small use case. People don’t use input iterators that often. Since this can always be added later, it is not being proposed right now.
project_view
This template is a
std::ranges
view and adaptor
that makes the implementation of the code unit views and adaptors nearly
trivial. It is being added based on input from SG-9. No one in the SG-9
telecon could think of a name everyone liked; suggestions are
welcome.
namespace std::ranges {
template<input_range V, auto F>
requires view<V> &&
<decltype(F)&, range_reference_t<V>> &&
regular_invocable<invoke_result_t<decltype(F)&, range_reference_t<V>>>
can-referenceclass project_view : public view_interface<project_view<V, F>>
{
= V(); // exposition only
V base_
template<bool Const>
class iterator; // exposition only
template<bool Const>
class sentinel; // exposition only
public:
constexpr project_view() requires default_initializable<V> = default;
constexpr explicit project_view(V base) : base_(std::move(base)) {}
constexpr V& base() & { return base_; }
constexpr const V& base() const& { return base_; }
constexpr V base() && { return std::move(base_); }
constexpr iterator<false> begin() { return iterator<false>{ranges::begin(base_)}; }
constexpr iterator<true> begin() const requires range<const V>
{ return iterator<true>{ranges::begin(base_)}; }
constexpr sentinel<false> end() { return sentinel<false>{ranges::end(base_)}; }
constexpr iterator<false> end() requires common_range<V> { return iterator<false>{ranges::end(base_)}; }
constexpr sentinel<true> end() const requires range<const V> { return sentinel<true>{ranges::end(base_)}; }
constexpr iterator<true> end() const requires common_range<const V>
{ return iterator<true>{ranges::end(base_)}; }
constexpr auto size() requires sized_range<V> { return ranges::size(base_); }
constexpr auto size() const requires sized_range<const V> { return ranges::size(base_); }
};
template<input_range V, auto F>
requires view<V> &&
<decltype(F)&, range_reference_t<V>> &&
regular_invocable<invoke_result_t<decltype(F)&, range_reference_t<V>>>
can-referencetemplate<bool Const>
class project_view<V, F>::iterator
: public std::proxy_iterator_interface<
<iterator_t<maybe-const<Const, V>>>,
iterator_to_tag_t<decltype(F)&, range_reference_t<V>>>
invoke_result_t{
public:
using reference_type = invoke_result_t<decltype(F)&, range_reference_t<V>>;
private:
using iterator_type = iterator_t<maybe-const<Const, V>>; // exposition only
friend std::iterator_interface_access;
& base_reference() noexcept { return it_; } // exposition only
iterator_type () const { return it_; } // exposition only
iterator_type base_reference
= iterator_type(); // exposition only
iterator_type it_
friend project_view<V, F>::sentinel<Const>;
public:
constexpr iterator() = default;
constexpr iterator(iterator_type it) : it_(std::move(it)) {}
constexpr reference_type operator*() const { return F(*it_); }
};
template<input_range V, auto F>
requires view<V> &&
<decltype(F)&, range_reference_t<V>> &&
regular_invocable<invoke_result_t<decltype(F)&, range_reference_t<V>>>
can-referencetemplate<bool Const>
class project_view<V, F>::sentinel
{
using Base = maybe-const<Const, V>; // exposition only
using sentinel_type = sentinel_t<Base>; // exposition only
= sentinel_type(); // exposition only
sentinel_type end_
public:
constexpr sentinel() = default;
constexpr explicit sentinel(sentinel_type end) : end_(std::move(end)) {}
constexpr sentinel(sentinel<!Const> i) requires Const
&& convertible_to<sentinel_t<V>, sentinel_t<Base>>;
constexpr sentinel_type base() const { return end_; }
template<bool OtherConst>
requires sentinel_for<sentinel_type, iterator_t<maybe-const<OtherConst, V>>>
friend constexpr bool operator==(const iterator<OtherConst> & x, const sentinel & y)
{ return x.it_ == y.end_; }
template<bool OtherConst>
requires sized_sentinel_for<sentinel_type, iterator_t<maybe-const<OtherConst, V>>>
friend constexpr range_difference_t<maybe-const<OtherConst, V>>
operator-(const iterator<OtherConst> & x, const sentinel & y)
{ return x.it_ - y.end_; }
template<bool OtherConst>
requires sized_sentinel_for<sentinel_type, iterator_t<maybe-const<OtherConst, V>>>
friend constexpr range_difference_t<maybe-const<OtherConst, V>>
operator-(const sentinel & y, const iterator<OtherConst> & x)
{ return y.end_ - x.it_; }
};
template<class R, auto F>
(R &&) -> project_view<views::all_t<R>, F>;
project_view}
project_view
presents a view
of an underlying sequence after applying a transformation function to
each element.
The name views::project
denotes a range adaptor object ([range.adaptor.object]). Given
subexpression E
and the non-type
template parameter F
, let
A
be:
template<class R>
using A = project_view<R, F>;
views::project<F>(E)
is
expression-equivalent to
A(E)
.
[Example 1:
<int> is{ 0, 1, 2, 3, 4 };
vectorstruct f {
static int operator()(int i) const { return i * i; }
};
auto squares = views::project<f{}>(is);
for (int i : squares)
<< i << ' '; // prints 0 1 4 9 16 cout
— end example]
namespace std::uc {
template<class I>
constexpr auto iterator-to-tag() { // exposition only
if constexpr (random_access_iterator<I>) {
return random_access_iterator_tag{};
} else if constexpr (bidirectional_iterator<I>) {
return bidirectional_iterator_tag{};
} else if constexpr (forward_iterator<I>) {
return forward_iterator_tag{};
} else {
return input_iterator_tag{};
}
}
template<class I>
using iterator-to-tag_t = decltype(iterator-to-tag<I>()); // exposition only
template<class Char>
struct cast-to-charn { // exposition only
static constexpr Char operator()(Char c) const { return c; }
};
template<class V>
using char8_view = project_view<V, cast-to-charn<char8_t>{}>;
template<class V>
using char16_view = project_view<V, cast-to-charn<char16_t>{}>;
template<class V>
using char32_view = project_view<V, cast-to-charn<char32_t>{}>;
inline constexpr unspecified as_char8_t;
inline constexpr unspecified as_char16_t;
inline constexpr unspecified as_char32_t;
}
char8_view
produces a view of
char8_t
elements from another
view. char16_view
produces a
view of char16_t
elements from
another view. char32_view
produces a view of char32_t
elements from another view. Let
charN_view
denote any one of the
views char8_view
,
char16_view
, and
char32_view
.
The names as_char8_t
,
as_char16_t
, and
as_char32_t
denote range adaptor
objects ([range.adaptor.object]).
as_char8_t
produces
char8_view
s,
as_char16_t
produces
char16_view
s, and
as_char32_t
produces
char32_view
s. Let
as_charN_t
denote any one of
as_char8_t
,
as_char16_t
, and
as_char32_t
, and let
V
denote the
charN_view
associated with that
object. Let E
be an expression
and let T
be remove_cvref_t<decltype((E))>
.
Let F
be the
format
enumerator associated
with as_charN_t
. If
decltype((E))
does not model
utf_pointer<T>
and if
charN_view(E)
is ill-formed,
as_charN_t(E)
is ill-formed. The
expression as_charN_t(E)
is
expression-equivalent to:
If T
is a specialization
of empty_view
([range.empty.view]), then empty_view<format-to-type-t<F>>{}
.
Otherwise, if
is_pointer_v<T>
is
true
, then V(ranges::subrange(E, null_sentinel))
.
Otherwise, V(E)
.
[Example 1:
<int> v = {'U', 'n', 'i', 'c', 'o', 'd', 'e'};
vectorfor (auto c : v | uc::as_char8_t) {
static_assert(same_as<decltype(c), char8_t>);
<< (char)c << ' '; // prints U n i c o d e
cout }
— end example]
as_charN_t
requires
utf_pointer
It may seem odd that
foo | as_charN_t
is well formed
if decltype(foo)
is
std::vector<int>
, but
ill-formed if decltype(foo)
is
int *
. However, this is
intentional.
If you write std::vector<int>{/* ... */} | as_char32_t
,
the result is always a view whose value type is
char32_t
. If you write:
int * ptr = /* ... */;
auto v = ptr | as_char32_t;
v
may be a view of
char32_t
values that ends in a
null terminator, or it may be an error that results in UB.
Null-terminated strings are very common, but null-terminated strings of
a non-character type are rare. It seems far more safe and idiomatic to
restrict the pointer-adaptation case only to
utf_pointer
s.
The macro
CODE_UNIT_CONCEPT_OPTION_2
is
used below to indicate the two options for how to define
format-of
, based on the
definition of code_unit
.
namespace std::uc {
template<typename T>
constexpr format format-of() { // exposition only
if constexpr (same_as<T, char8_t>) {
return format::utf8{};
} else if constexpr (same_as<T, char16_t>) {
return format::utf16{};
} else if constexpr (same_as<T, char32_t>) {
return format::utf32{};
#if CODE_UNIT_CONCEPT_OPTION_2
} else if constexpr (same_as<T, char>) {
return format::utf8{};
} else if constexpr (same_as<T, wchar_t>) {
return wchar-t-format;
#endif
}
}
template<utf_range V>
requires ranges::view<V> && ranges::forward_range<V>
class unpacking_view : public ranges::view_interface<unpacking_view<V>> {
= V(); // exposition only
V base_
public:
constexpr unpacking_view() requires default_initializable<V> = default;
constexpr unpacking_view(V base) : base_(std::move(base)) {}
constexpr V base() const & requires copy_constructible<V> { return base_; }
constexpr V base() && { return std::move(base_); }
constexpr auto code_units() const noexcept {
auto unpacked = uc::unpack_iterator_and_sentinel(ranges::begin(base_), ranges::end(base_));
return ranges::subrange(unpacked.first, unpacked.last);
}
constexpr auto begin() { return ranges::begin(code_units()); }
constexpr auto begin() const { return ranges::begin(code_units()); }
constexpr auto end() { return ranges::end(code_units()); }
constexpr auto end() const { return ranges::end(code_units()); }
};
template<class R>
(R &&) -> unpacking_view<views::all_t<R>>;
unpacking_view
template<class T>
constexpr bool is-charn-view = false; // exposition only
template<class V>
constexpr bool is-charn-view<char8_view<V>> = true; // exposition only
template<class V>
constexpr bool is-charn-view<char16_view<V>> = true; // exposition only
template<class V>
constexpr bool is-charn-view<char32_view<V>> = true; // exposition only
template<format Format, utf_range V>
requires ranges::view<V>
class utf_view : public ranges::view_interface<utf_view<Format, V>> {
= V(); // exposition only
V base_
template<format FromFormat, class I, class S>
static constexpr auto make_begin(I first, S last) { // exposition only
if constexpr (bidirectional_iterator<I>) {
return utf_iterator<FromFormat, Format, I, S>{
};
first, first, last} else {
return utf_iterator<FromFormat, Format, I, S>{first, last};
}
}
template<format FromFormat, class I, class S>
static constexpr auto make_end(I first, S last) { // exposition only
if constexpr (!same_as<I, S>) {
return last;
} else if constexpr (bidirectional_iterator<I>) {
return utf_iterator<FromFormat, Format, I, S>{
};
first, last, last} else {
return utf_iterator<FromFormat, Format, I, S>{last, last};
}
}
public:
constexpr utf_view() requires default_initializable<V> = default;
constexpr utf_view(V base) : base_{std::move(base)} {}
constexpr V base() const & requires copy_constructible<V> { return base_; }
constexpr V base() && { return std::move(base_); }
constexpr auto begin() {
constexpr format from_format = format-of<ranges::range_value_t<V>>();
if constexpr(is-charn-view<V>) {
return make_begin<from_format>(base_.impl_.begin().base(), base_.impl_.end().base());
} else {
return make_begin<from_format>(ranges::begin(base_), ranges::end(base_));
}
}
constexpr auto begin() const {
constexpr format from_format = format-of<ranges::range_value_t<const V>>();
if constexpr(is-charn-view<V>) {
return make_begin<from_format>(ranges::begin(base_.base()), ranges::end(base_.base()));
} else {
return make_begin<from_format>(ranges::begin(base_), ranges::end(base_));
}
}
constexpr auto end() {
constexpr format from_format = format-of<ranges::range_value_t<V>>();
if constexpr(is-charn-view<V>) {
return make_end<from_format>(base_.impl_.begin().base(), base_.impl_.end().base());
} else {
return make_end<from_format>(ranges::begin(base_), ranges::end(base_));
}
}
constexpr auto end() const {
constexpr format from_format = format-of<ranges::range_value_t<const V>>();
if constexpr(is-charn-view<V>) {
return make_end<from_format>(ranges::begin(base_.base()), ranges::end(base_.base()));
} else {
return make_end<from_format>(ranges::begin(base_), ranges::end(base_));
}
}
friend ostream & operator<<(ostream & os, utf_view v);
friend wostream & operator<<(wostream & os, utf_view v);
};
template<format Format, class R>
(R &&) -> utf_view<Format, views::all_t<R>>;
utf_view
template<class V>
using utf8_view = utf_view<format::utf8, V>;
template<class V>
using utf16_view = utf_view<format::utf16, V>;
template<class V>
using utf32_view = utf_view<format::utf32, V>;
}
namespace std::ranges {
template<class V>
inline constexpr bool enable_borrowed_range<uc::unpacking_view<V>> = enable_borrowed_range<V>;
template<uc::format Format, class V>
inline constexpr bool enable_borrowed_range<uc::utf_view<Format, V>> = enable_borrowed_range<V>;
}
namespace std::uc {
template<class R>
constexpr decltype(auto) unpack-range(R && r) {
using T = remove_cvref_t<R>;
if constexpr (ranges::forward_range<T>) {
auto unpacked = uc::unpack_iterator_and_sentinel(ranges::begin(r), ranges::end(r));
if constexpr (is_bounded_array_v<T>) {
constexpr auto n = extent_v<T>;
if (n && !r[n - 1])
--unpacked.last;
return ranges::subrange(unpacked.first, unpacked.last);
} else if constexpr (!same_as<decltype(unpacked.first), ranges::iterator_t<R>> ||
!same_as<decltype(unpacked.last), ranges::sentinel_t<R>>) {
return unpacking_view(forward<R>(r));
} else {
return forward<R>(r);
}
} else {
return forward<R>(r);
}
}
inline constexpr unspecified as_utf8;
inline constexpr unspecified as_utf16;
inline constexpr unspecified as_utf32;
}
unpacking_view
knows how to
unpack a range into code unit iterators using
unpack_iterator_and_sentinel
.
utf_view
produces a view in
UTF format Format
of the
elements from another UTF view.
utf8_view
produces a UTF-8 view
of the elements from another UTF view.
utf16_view
produces a UTF-16
view of the elements from another UTF view.
utf32_view
produces a UTF-32
view of the elements from another UTF view. Let
utfN_view
denote any one of the
views utf8_view
,
utf16_view
, and
utf32_view
.
The names as_utf8
,
as_utf16
, and
as_utf32
denote range adaptor
objects ([range.adaptor.object]).
as_utf8
produces
utf8_view
s,
as_utf16
produces
utf16_view
s, and
as_utf32
produces
utf32_view
s. Let
as_utfN
denote any one of
as_utf8
,
as_utf16
, and
as_utf32
, and let
V
denote the
utfN_view
associated with that
object. Let charN_view
denote
any one of char8_view
,
char16_view
, and
char32_view
. Let
E
be an expression and let
T
be remove_cvref_t<decltype((E))>
.
Let F
be the
format
enumerator associated
with as_utfN
. If
decltype((E))
does not model
utf_range_like
,
as_utfN(E)
is ill-formed. The
expression as_utfN(E)
is
expression-equivalent to:
If T
is a specialization
of empty_view
([range.empty.view]), then empty_view<format-to-type-t<F>>{}
.
Otherwise, if T
is a
specialization of utfN_view
,
then V(E.base())
.
Otherwise, if T
is a
specialization of charN_view
,
then V(E)
.
Otherwise, if
is_pointer_v<T>
is
true
, then V(ranges::subrange(E, null_sentinel))
.
Otherwise,
V(unpack-range(E))
.
[Example 1:
::u32string s = U"Unicode";
stdfor (char8_t c : s | std::uc::as_utf8)
<< (char)c << ' '; // prints U n i c o d e cout
— end example]
[Example 2:
auto * s = L"is weird.";
for (char8_t c : s | std::uc::as_utf8)
<< (char)c << ' '; // prints i s w e i r d . cout
— end example]
The ostream
and
wostream
stream operators
transcode the utf_view
to UTF-8
and UTF-16, respectively (if transcoding is needed). The
wostream
overload is only
defined on Windows.
utfN_view
s views plus
utf_view
The views in std::ranges
are
constrained to accept only
std::ranges::view
template
parameters. However, they accept
std::ranges::viewable_range
s in
practice, because they each have a deduction guide that likes like
this:
template<class R>
(R &&) -> utf8_view<views::all_t<R>>; utf8_view
It’s not possible to make this work for
utf_view
, since to use it you
must supply a format
NTTP. So,
we need the utfN_view
s. It might
be possible to make utf_view
an
exposition-only implementation detail, but I think some users might find
use for it, especially in generic contexts. For instance:
template<std::uc::format F, typename V>
auto f(std::uc::utf_view<F, V> const & view) {
// Use F, V, and view here....
}
unpacking_view
For a particular V
being
adapted by as_utfN
, there are
two cases: 1) V
is unpackable
(taht is, unpacking produces different iterator/sentinel types than what
you had before unpacking), and 2)
V
is not unpackable. The second
case is easy; since V
is already
unpacked, you just construct a
utfN_view
from
V
, and you’re done. The first
case is a little harder. For that case, we either need to let
utfN_view
know statically that
it must do some unpacking, or we must introduce yet another view that
does it for us. Introducing a view is the right answer, because
introducing an NTTP to utfN_view
would be unergonomic. For instance:
template<typename V>
void f(std::uc::utf32_view<V, /* ??? */> const & utf32) {
// ...
}
What do we write for the
/* ??? */
– the NTTP that
indicates whether V
is already
unpacked or not? We have to do a nontrivial amount of work involving
V
to know what to write
there.
So, we have unpacking_view
instead. When V
is unpackable,
as_utfN
returns a utfN_view<unpacking_view<V>>
.
In the previous revision of this paper, the
as_utfN
adaptor unpacked the
adapted range most of the time, except for the one case where it could
not. That case was when r
in
r | as_utfN
is an rvalue whose
begin()
and
end()
are unpackable. In that
case, we needed a special-case type called
unpacking_owning_view
that would
store r
and unpack
r.begin()
and
r.end()
. This is not ideal,
because doing the unpacking in the adaptor loses information. It loses
information because the unpacked view used to construct
utfN_view
is a
ranges::subrange
, not the
original range. For example, if you start with an lvalue
vector
, then keeping unpacking_view<ref_view<vector<T>>>
means that you can get access to the
vector
itself with a chain of
base()
calls. You lose that if
it’s a subrange<typename vector<T>::iterator, typename vector<T>::iterator>
.
struct my_text_type
{
() = default;
my_text_type(std::u8string utf8) : utf8_(std::move(utf8)) {}
my_text_type
auto begin() const {
return std::uc::utf_8_to_32_iterator(
.begin(), utf8_.begin(), utf8_.end());
utf8_}
auto end() const {
return std::uc::utf_8_to_32_iterator(
.begin(), utf8_.end(), utf8_.end());
utf8_}
private:
::u8string utf8_;
std};
static_assert(std::is_same_v<
decltype(my_text_type(u8"text") | std::uc::as_utf16),
::uc::utf16_view<std::uc::unpacking_view<std::ranges::owning_view<my_text_type>>>>);
std
static_assert(std::is_same_v<
decltype(u8"text" | std::uc::as_utf16),
::uc::utf16_view<std::ranges::subrange<const char8_t *>>>);
std
static_assert(std::is_same_v<
decltype(std::u8string(u8"text") | std::uc::as_utf16),
::uc::utf16_view<std::ranges::owning_view<std::u8string>>>);
std
::u8string const str = u8"text";
std
static_assert(std::is_same_v<
decltype(str | std::uc::as_utf16),
::uc::utf16_view<std::ranges::ref_view<std::u8string const>>>);
std
static_assert(std::is_same_v<
decltype(str.c_str() | std::uc::as_utf16),
::uc::utf16_view<std::ranges::subrange<const char8_t *, std::uc::null_sentinel_t>>>);
std
static_assert(std::is_same_v<
decltype(std::ranges::empty_view<int>{} | std::uc::as_char16_t),
::ranges::empty_view<char16_t>>);
std
::u16string str2 = u"text";
std
static_assert(std::is_same_v<
decltype(str2 | std::uc::as_utf16),
::uc::utf16_view<std::ranges::ref_view<std::u16string>>>);
std
static_assert(std::is_same_v<
decltype(str2.c_str() | std::uc::as_utf16),
::uc::utf16_view<std::ranges::subrange<const char16_t *, std::uc::null_sentinel_t>>>); std
utf_view
always uses
utf_iterator
, even in UTF-N to
UTF-N casesYou might expect that if r
in
r | as_utfN
is already in UTF-N,
r | as_utfN
might just be
r
. This is not what the
as_utfN
adaptors do, though.
The adaptors each produce a view
utfv
that stores a view of type
V
, where
V
is made from the result of
unpacking r
. Further,
utfv.begin()
is always a
specialization of utf_iterator
.
utfv.end()
is also a
specialization of utf_iterator
(if common_range<V>
), or
otherwise the sentinel value for
V
.
This gives r | as_utfN
some
nice, consistent properties. With the exception of
empty_view<T>{} | as_utfN
,
the following are always true:
r | as_utfN
produces
well-formed UTF. Since the default
ErrorHandler
template parameter
to utf_iterator
use_replacement_character
is
always used, any ill-formed UTF is replaced with
replacement_character
. This is
true even when the input was already UTF-N. Remember, the input could
have been UTF-N but had ill-formed UTF in it.
r | as_utfN
has a
consistent API. If r | as_utfN
were sometimes r
, and since
r
may be a reference to an
array, you’d have to use
std::ranges::begin(r)
and
::end(r)
all the time. However,
you’d probably write r.begin()
and r.end()
, only to one day get
bitten by an array-reference
r
.
r | as_utfN
is
formattable/printable. This means you can adapt anything that can be
UTF-transcoded to do I/O in a consistent way. For example:
auto str0 = std::format("{}", std::u8string{}); // Error: ill-formed!
auto str1 = std::format("{}", std::u8string{} | std::uc::as_utf8); // Ok.
utf_view
specialization of
formatter
These should be added to the list of “the debug-enabled string type
specializations” in [format.formatter.spec]. This allows
utf_view
and
utfN_view
to be used in
std::format()
and
std::print()
. The intention is
that the formatter will transcode to UTF-8 if the formatter’s
CharT
is
char
, or to UTF-16 or UTF-32
(which one is implementation defined) if the formatter’s
CharT
is
wchar_t
– if transcoding is
necessary at all.
namespace std {
template<uc::format Format, class V, class CharT>
struct formatter<uc::utf_view<Format, V>, CharT> {
private:
<basic_string<CharT>, CharT> underlying_; // exposition only
formatter
public:
template<class ParseContext>
constexpr typename ParseContext::iterator
(ParseContext& ctx);
parse
template<class FormatContext>
typename FormatContext::iterator
(const uc::utf_view<Format, V>& view, FormatContext& ctx) const;
format
constexpr void set_debug_format() noexcept;
};
}
template<class ParseContext>
constexpr typename ParseContext::iterator
(ParseContext& ctx); parse
Effects: Equivalent to:
return underlying_.parse(ctx);
template<class FormatContext>
typename FormatContext::iterator
(const uc::utf_view<Format, V>& view, FormatContext& ctx) const; format
Effects: Equivalent to:
auto adaptor = see below;
return underlying_.format(basic_string<CharT>(from_range, view | adaptor), ctx);
adaptor
is
uc::as_utf8
if
CharT
is
char
. Otherwise, it is
implementation defined whether
adaptor
is
uc::as_utf16
or
uc::as_utf32
.
constexpr void set_debug_format() noexcept;
Effects: Equivalent to:
.set_debug_format(); underlying_
Add the feature test macro
__cpp_lib_unicode_transcoding
.
None of the proposed interfaces is subject to change in future versions of Unicode; each relates to the guaranteed-stable subset. Just sayin’.
None of the proposed interfaces allocates or throws, unless the user
supplies a throwing ErrorHandler
template parameter to
utf_iterator
.
The proposed interfaces allow users to choose amongst multiple convenience-vs-compatibility tradeoffs. Explicitly, they are:
| as_utfN
adaptor
use, use the transcoding views.All the transcoding iterators allow you access to the underlying
iterator via .base()
(except
when adapting an input iterator), following the convention of the
iterator adaptors already in the standard.
The transcoding views are lazy, as you’d expect. They also compose
with the standard view adaptors, so just transcoding at most 10 UTF-16
code units out of some UTF can be done with foo | std::uc::as_utf16 | std::ranges::views::take(10)
.
Error handling is explicitly configurable in the transcoding
iterators. This gives control to those who want to do something other
than the default. The default, according to Unicode, is to produce a
replacement character (0xfffd
)
in the output when broken UTF encoding is seen in the input. This is
what all these interfaces do, unless you configure one of the iterators
as mentioned above.
The production of replacement characters as error-handling strategy is good for memory compactness and safety. It allows us to store all our text as UTF-8 (or, less compactly, as UTF-16), and then process code points as transcoding views. If an error occurs, the transcoding views will simply produce a replacement character; there is no danger of UB.
A null-terminated pointer p
to an 8-, 16-, or 32-bit string of code units is considered the implicit
range [p, null_sentinel)
. This
makes user code much more natural;
"foo" | as_utf16
,
"foo"sv | as_utf16
,
and "foo"s | as_utf16
are roughly equivalent (though the iterator type of the resulting view
may differ).
Iterators are constructed from more than one underlying iterator. To
do iteration in many text-handling contexts, you need to know the
beginning and the end of the range you are iterating over, just to be
able to do iteration correctly. Note that this is not a safety issue,
but a correctness one. For example, say we have a string
s
of UTF-8 code units that we
would like to iterate over to produce UTF-32 code points. If the last
code unit in s
is
0xe0
, we should expect two more
code units to follow. They are not present, though, because
0xe0
is the last code unit. Now
consider how you would implement
operator++()
for an iterator
iter
that transcodes from UTF-8
to UTF-32. If you advance far enough to get the next UTF-32 code point
in each call to operator++()
,
you may run off the end of s
when you find 0xe0
and try to
read two more code units. Note that it does not matter that
iter
probably comes from a range
with an end-iterator or sentinel as its mate; inside
iter
’s
operator++()
this is no help.
iter
must therefore have the
end-iterator or sentinel as a data member. The same logic applies to the
other end of the range if iter
is bidirectional — it must also have the iterator to the start of the
underlying range as a data member. This unfortunate reality comes up
over and over in the proposed iterators, not just the ones that are UTF
transcoding iterators. This is why iterators in this proposal (and the
ones to come) usually consist of three underlying iterators.
All the interfaces proposed here have been implemented, and re-implemented, several times over the last 5 years or so. They are part of a proposed (but not yet accepted!) Boost library, Boost.Text.
The library has hundreds of stars, though I’m not sure how many users that equates to. All of the interfaces proposed here are among the best-exercised in the library. There are comprehensive tests for all the proposed entities, and those entities are used as the foundation upon which all the other library entities are composed.
Though there are a lot of individual entities proposed here, at one time or another I have need each one of them, though maybe not in every UTF-N -> UTF-M permutation. Those transcoding permutations are there mostly for completeness. I have only ever needed UTF-8 <-> UTF->32 in any of my work that uses Unicode. Frequent Windows users will also need to convert to and from UTF-16 sometimes, because that is the UTF that the OS APIs use.