Finding the Next Code Point in Unicode Strings -- Giovanni Dicanio
With ASCII, it's very simple to find the next character in a string: you can just increment an index (i++) or char pointer (pch++). But what happens when you have Unicode strings to process?
Finding the Next Unicode Code Point in Strings: UTF-8 vs. UTF-16
by Giovanni Dicanio
From the article:
Both UTF-16 and UTF-8 are variable-length encodings. In particular, UTF-8 encodes each valid Unicode code point using one to four 8-bit byte units. On the other hand, UTF-16 is somewhat simpler: In fact, Unicode code points are encoded in UTF-16 using just one or two 16-bit code units.
(...) The functions have the following prototypes:
// Returns the next Unicode code point and number of bytes consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-8 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf8(
const std::string& str,
size_t index
);
// Returns the next Unicode code point and the number of UTF-16 code units consumed.
// Throws std::out_of_range if index is out of bounds or string ends prematurely.
// Throws std::invalid_argument on invalid UTF-16 sequence.
[[nodiscard]] std::pair<char32_t, size_t> NextCodePointUtf16(
const std::wstring& input,
size_t index
);

Memory-safety vulnerabilities remain one of the most persistent and costly risks in large-scale C++ systems, even in well-tested production code. This article explores how hardening the C++ Standard Library—specifically LLVM’s libc++—can deliver meaningful security and reliability gains at massive scale with minimal performance overhead.
This post is in response to two claims about coroutines: 1) Their reference function parameters may become dangling too easily, and 2) They are indistinguishable from regular functions from the declaration alone.
xplore how the C++ standard evolved across versions with interactive side-by-side diffs