Finding the Next Code Point in Unicode Strings -- Giovanni Dicanio

With ASCII, it's very simple to find the next character in a string: you can just increment an index (i++) or char pointer (pch++). But what happens when you have Unicode strings to process?

Finding the Next Unicode Code Point in Strings: UTF-8 vs. UTF-16

by Giovanni Dicanio

From the article:

Both UTF-16 and UTF-8 are variable-length encodings. In particular, UTF-8 encodes each valid Unicode code point using one to four 8-bit byte units. On the other hand, UTF-16 is somewhat simpler: In fact, Unicode code points are encoded in UTF-16 using just one or two 16-bit code units.

(...) The functions have the following prototypes:

// Returns the next Unicode code point and number of bytes consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-8 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf8(
    const std::string& str,
    size_t index
);

// Returns the next Unicode code point and the number of UTF-16 code units consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-16 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf16(
    const std::wstring& input,
    size_t index
);

 

Add a Comment

Comments are closed.

Comments (0)

There are currently no comments on this entry.