With ASCII, it's very simple to find the next character in a string: you can just increment an index (i++) or char pointer (pch++). But what happens when you have Unicode strings to process?
Finding the Next Unicode Code Point in Strings: UTF-8 vs. UTF-16
by Giovanni Dicanio
From the article:
Both UTF-16 and UTF-8 are variable-length encodings. In particular, UTF-8 encodes each valid Unicode code point using one to four 8-bit byte units. On the other hand, UTF-16 is somewhat simpler: In fact, Unicode code points are encoded in UTF-16 using just one or two 16-bit code units.
(...) The functions have the following prototypes:
// Returns the next Unicode code point and number of bytes consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-8 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf8(
const std::string& str,
size_t index
);
// Returns the next Unicode code point and the number of UTF-16 code units consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-16 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf16(
const std::wstring& input,
size_t index
);

Add a Comment
Comments are closed.