std::split()
: An algorithm for splitting stringsISO/IEC JTC1 SC22 WG21 N3593 - 2013-03-13
Greg Miller, [email protected]Splitting strings into substrings is a common task in many applications. When the need arises in C++, programmers must search for an existing solution or write one of their own. A typical solution might look like the following:
std::vector<std::string> my_split(const std::string& text, const std::string& delimiter);
A straightforward implementation of the above function would likely use
std::string::find
or std::string::find_first_of
to
identify substrings and move from one to the next, building the vector to
return. This is a fine solution for simple needs, but it is deficient in the
following ways:
std::set<string>
Google developed a flexible and fast string-splitting API to address these deficiencies. The new API has been well received by internal engineers developing serious applications. The rest of this paper describes Google's string splitting API as it might appear as a C++ standard.
This proposal depends on the following proposals:
std::string_view
)The first version of this proposal was N3430, which included features such as Predicates and implicit result type conversions. A number of these complicated features were removed in the following proposal, which was N3510. The following are the major changes in this revision.
std::string_view
referring to the input text's
end()
iterator to indicate Not Found. There are also
alternative options listed.find()
member function now takes
a size_t pos
argument indicating where to start looking for
the next delimiter.
namespace std {
template <typename Delimiter>
auto split(std::string_view text, Delimiter d) -> unspecified;
}
The std::split()
algorithm takes a std::string_view
and a Delimiter
as arguments, and it returns a Range
of std::string_view
objects as output. The
std::string_view
objects in the returned Range will refer to
substrings of the input text. The Delimiter
object defines the
boundaries between the returned substrings.
The general notion of a delimiter (aka separator) is not new. A
delimiter (little d) marks the boundary between two substrings in a larger
string. With the std::split()
API comes the generalized concept
of a Delimiter (big D). A Delimiter is an object with a
find()
member function that can find the next occurrence of
itself in a given std::string_view
starting at the given
position. Objects that conform to the Delimiter concept represent specific
kinds of delimiters. Some examples of Delimiter objects are an object that
finds a specific character in a string, an object that finds a substring in
a string, or even an object that finds regular expression matches in a given
string.
The result of a Delimiter's find()
member function must be a
std::string_view
referring to one of the following:
find()
's argument text referring to the
delimiter/separator that was found.std::string_view
referring to
find()
's argument's end iterator, (e.g.,
std::string_view(input_text.end(), 0)
). This indicates that
the delimiter/separator was not found.find()
function return a
std::string_view
is to instead have it return a
std::pair<size_t, size_t>
where the pair's first member
is the position of the found delimiter, and the second member is the length
of the found delimiter. In this case, Not Found could be prepresented as
std::make_pair(std::string_view::npos, 0)
.
—end footnote]
The following example shows a simple object that models the Delimiter
concept. It has a find()
member function that is responsible
for finding the next occurrence of the given character in the given text
starting at the given position.
struct char_delimiter {
char c_;
explicit char_delimiter(char c) : c_(c) {}
std::string_view find(std::string_view text, size_t pos) {
std::string_view substr = text.substr(pos);
size_t found = substr.find(c_);
if (found == std::string_view::npos)
return std::string_view(substr.end(), 0); // Not found.
return std::string_view(substr, found, 1); // Returns a string_view referring to the c_ that was found in the input string.
}
};
The following shows how the above delimiter could be used to split a string:
std::vector<std::string_view> v{std::split("a-b-c", char_delimiter('-'))};
// v is {"a", "b", "c"}
The following are standard delimiter implementations that will be part of the splitting API.
[Footnote: Here are a few more delimiters that might be worth including by default:std::fixed_delimiter
— this Delimiter breaks the
input string at fixed length intervals.std::limit_delimiter
— this Delimiter template
would take another Delimiter and a size_t limiting the given delimiter to
matching a max numbers of times. This is similar to the 3rd argument to
perl's split() function. std::regex_delimiter
— this Delimiter would take a
regex as an argument and would match everywhere the pattern matched in the
input string. As described so far, std::split()
may not work correctly if
splitting a std::string_view
that refers to a temporary string.
In particular, the following will not work:
for (std::string_view s : std::split(GetTemporaryString(), "-")) {
// s now refers to a temporary string that is no longer valid.
}
To address this, std::split()
will move ownership of
rvalues into the Range object that is returned from
std::split()
.
The function called to split an input string into a range of substrings.
namespace std {
template <typename Delimiter>
auto split(std::string_view text, Delimiter d) -> unspecified;
}
text
— a std::string_view
referring to the
input string to be split.
std::string
, std::string_view
, const
char*
, or char
, then the
std::literal_delimiter
will be used as a default. std::string_view
objects, each referring
to the split substrings within the given input text
. The
object returned from std::split()
will have
begin()
and end()
member functions and will
fully model the Range concept.
One question at this point is: why is this constrained to
strings/string_views? One could imagine std::split()
as an
algorithm that transforms an input Range into an output Range of Ranges.
This would make the algorithm more generally applicable.
However, this generalization may also make std::split()
less
convenient in the expected common case: that of splitting string data. For
example, the logic for detecting when to auto-construct a
std::literal_delimiter
may be more complicated, and it may not
be clear that that is a reasonable default delimiter in the generic case.
The current proposal limits std::split
to strings/string_views
to keep the function simple to use in the common case of splitting strings.
The second argument to std::split()
may be an object that
models the Delimiter concept. A Delimiter object must have the following
member function:
std::string_view find(std::string_view text, size_t pos);
This function is responsible for finding the next occurrence of the
represented delimiter in the given text
at or after the given
position pos
.
text
— the full input string that was originally passed
to std::split()
.
pos
— the position in text
where the
search for the represented delimiter should start.
std::string_view
referring to the found delimiter within the
given input text
, or std::string_view(text.end(),
0)
if the delimiter was not found.
A string delimiter. This is the default delimiter used if a string is given
as the delimiter argument to std::split()
.
The delimiter representing the empty string
(std::literal_delimiter("")
) will be defined to return each
individual character in the input string. This matches the behavior of
splitting on the empty string "" in perl.
The following is an example of what the std::literal_delimiter
might look like.
namespace std {
class literal_delimiter {
const string delimiter_;
public:
explicit literal(string_view sview)
: delimiter_(static_cast<string>(sview)) {}
string_view find(string_view text, size_t pos) const;
};
}
text
is the text to be split.
pos
is the position in text to start searching for the
delimiter.
std::string_view
referring to the first substring of
text
that matches delimiter_
, or
std::string_view(text.end(), 0)
if not found.
Each character in the given string is a delimiter. A
std::any_of_delimiter
with string of length 1 behaves the same
as a std::literal_delimiter
with the same string of length 1.
namespace std {
class any_of_delimiter {
const string delimiters_;
public:
explicit any_of_delimiter(string_view sview)
: delimiters_(static_cast<string>(sview)) {}
string_view find(string_view text, size_t pos) const;
};
}
text
is the text to be split.
pos
is the position in text to start searching for the
delimiter.
std::string_view
referring to the first occurrence of any
character from delimiters_
that is found in text
at or after pos
. The length of the returned
std::string_view
will always be 1. If no match is found,
std::string_view(text.end(), 0)
.
The following using declarations are assumed for brevity:
using std::deque;
using std::list;
using std::set;
using std::string_view;
using std::vector;
std::literal_delimiter
. The following two calls to
std::split()
are equivalent. The first form is provided for
convenience.
vector<string_view> v1{std::split("a-b-c", "-")};
vector<string_view> v2{std::split("a-b-c", std::literal_delimiter("-"))};
vector<string_view> v{std::split("a--c", "-")};
assert(v.size() == 3); // "a", "", "c"
vector<string_view> v{std::split("-a-b-c-", "-")};
assert(v.size() == 5); // "", "a", "b", "c", ""
vector<string_view> v{std::split("a-b-c", "-")};
deque<string_view> v{std::split("a-b-c", "-")};
set<string_view> s{std::split("a-b-c", "-")};
list<string_view> l{std::split("a-b-c", "-")};
vector<string_view> v{std::split("abc", "")};
assert(v.size() == 3); // "a", "b", "c"
for (string_view sview : std::split("a-b-c", "-")) {
// use sview
}
string s = "a-b-c";
auto r = std::split(s, "-");
s += "-d-e-f"; // This invalidates the results r
for (std::string_view token : r) { // Invalid
// ...
}
vector<string_view> v{std::split("", any-delimiter)};
assert(v.size() == 1); // ""
[Footnote:
This is logical behavior given that std::split()
doesn't skip
empty substrings. However, it might be surprising behavior to some users.
Would it be better if the result of splitting an empty string resulted in an
empty Range?
—end footnote]
std::pair<size_t, size_t> find(std::string_view text, size_t pos)
The returned pair's first and second members would refer to the found
position and length, respectively. Not Found would be represented simply
as std::make_pair(std::string_view::npos, 0)
, which is a
position of npos
and a length of 0
. This seems
quite natural.
std::fixed_delimiter
— this Delimiter breaks the
input string at fixed length intervals.std::limit_delimiter
— this Delimiter template
would take another Delimiter and a size_t limiting the given delimiter
to matching a max numbers of times. This is similar to the 3rd argument
to perl's split() function. std::regex_delimiter
— this Delimiter would take a
regex as an argument and would match everywhere the pattern matched in the
input string.operator()
rather than a named
find()
member function? The Delimiter API requires a member
function named find
. There is no technical requirement that
this function needs to be named. Perhaps it would be better for Delimiters
to use operator()
.