std::split(): An algorithm for splitting strings

ISO/IEC JTC1 SC22 WG21 N3593 - 2013-03-13

Greg Miller, [email protected]

Introduction

Splitting strings into substrings is a common task in many applications. When the need arises in C++, programmers must search for an existing solution or write one of their own. A typical solution might look like the following:

    std::vector<std::string> my_split(const std::string& text, const std::string& delimiter);
    

A straightforward implementation of the above function would likely use std::string::find or std::string::find_first_of to identify substrings and move from one to the next, building the vector to return. This is a fine solution for simple needs, but it is deficient in the following ways:

Google developed a flexible and fast string-splitting API to address these deficiencies. The new API has been well received by internal engineers developing serious applications. The rest of this paper describes Google's string splitting API as it might appear as a C++ standard.

This proposal depends on the following proposals:

Changes in this revision

The first version of this proposal was N3430, which included features such as Predicates and implicit result type conversions. A number of these complicated features were removed in the following proposal, which was N3510. The following are the major changes in this revision.

std::split() API

    namespace std {

      template <typename Delimiter>
      auto split(std::string_view text, Delimiter d) -> unspecified;

    }
    

The std::split() algorithm takes a std::string_view and a Delimiter as arguments, and it returns a Range of std::string_view objects as output. The std::string_view objects in the returned Range will refer to substrings of the input text. The Delimiter object defines the boundaries between the returned substrings.

Delimiters

The general notion of a delimiter (aka separator) is not new. A delimiter (little d) marks the boundary between two substrings in a larger string. With the std::split() API comes the generalized concept of a Delimiter (big D). A Delimiter is an object with a find() member function that can find the next occurrence of itself in a given std::string_view starting at the given position. Objects that conform to the Delimiter concept represent specific kinds of delimiters. Some examples of Delimiter objects are an object that finds a specific character in a string, an object that finds a substring in a string, or even an object that finds regular expression matches in a given string.

The result of a Delimiter's find() member function must be a std::string_view referring to one of the following:

[Footnote: An alternative to having a Delimiter's find() function return a std::string_view is to instead have it return a std::pair<size_t, size_t> where the pair's first member is the position of the found delimiter, and the second member is the length of the found delimiter. In this case, Not Found could be prepresented as std::make_pair(std::string_view::npos, 0). —end footnote]

The following example shows a simple object that models the Delimiter concept. It has a find() member function that is responsible for finding the next occurrence of the given character in the given text starting at the given position.

    struct char_delimiter {
      char c_;
      explicit char_delimiter(char c) : c_(c) {}
      std::string_view find(std::string_view text, size_t pos) {
        std::string_view substr = text.substr(pos);
        size_t found = substr.find(c_);
        if (found == std::string_view::npos)
          return std::string_view(substr.end(), 0);  // Not found.
        return std::string_view(substr, found, 1);  // Returns a string_view referring to the c_ that was found in the input string.
      }
    };
    

The following shows how the above delimiter could be used to split a string:

    std::vector<std::string_view> v{std::split("a-b-c", char_delimiter('-'))};
    // v is {"a", "b", "c"}
    

The following are standard delimiter implementations that will be part of the splitting API.

[Footnote: Here are a few more delimiters that might be worth including by default: end footnote]

Rvalue support

As described so far, std::split() may not work correctly if splitting a std::string_view that refers to a temporary string. In particular, the following will not work:

    for (std::string_view s : std::split(GetTemporaryString(), "-")) {
        // s now refers to a temporary string that is no longer valid.
    }
    

To address this, std::split() will move ownership of rvalues into the Range object that is returned from std::split().

API Synopsis

std::split()

The function called to split an input string into a range of substrings.

    namespace std {

      template <typename Delimiter>
      auto split(std::string_view text, Delimiter d) -> unspecified;

    }
    
Requires:
text — a std::string_view referring to the input string to be split.
Delimiter — an object that implements the Delimiter concept. Or if this argument type is a std::string, std::string_view, const char*, or char, then the std::literal_delimiter will be used as a default.
Returns:
a Range of std::string_view objects, each referring to the split substrings within the given input text. The object returned from std::split() will have begin() and end() member functions and will fully model the Range concept.
[Footnote:

One question at this point is: why is this constrained to strings/string_views? One could imagine std::split() as an algorithm that transforms an input Range into an output Range of Ranges. This would make the algorithm more generally applicable.

However, this generalization may also make std::split() less convenient in the expected common case: that of splitting string data. For example, the logic for detecting when to auto-construct a std::literal_delimiter may be more complicated, and it may not be clear that that is a reasonable default delimiter in the generic case.

The current proposal limits std::split to strings/string_views to keep the function simple to use in the common case of splitting strings.

end footnote]

Delimiter template parameter

The second argument to std::split() may be an object that models the Delimiter concept. A Delimiter object must have the following member function:

    std::string_view find(std::string_view text, size_t pos);
    

This function is responsible for finding the next occurrence of the represented delimiter in the given text at or after the given position pos.

Requires:
text — the full input string that was originally passed to std::split().
pos — the position in text where the search for the represented delimiter should start.
Returns:
a std::string_view referring to the found delimiter within the given input text, or std::string_view(text.end(), 0) if the delimiter was not found.

std::literal_delimiter

A string delimiter. This is the default delimiter used if a string is given as the delimiter argument to std::split().

The delimiter representing the empty string (std::literal_delimiter("")) will be defined to return each individual character in the input string. This matches the behavior of splitting on the empty string "" in perl.

The following is an example of what the std::literal_delimiter might look like.

    namespace std {

      class literal_delimiter {
        const string delimiter_;
       public:
        explicit literal(string_view sview)
        : delimiter_(static_cast<string>(sview)) {}
        string_view find(string_view text, size_t pos) const;
      };

    }
    
Requires:
text is the text to be split. pos is the position in text to start searching for the delimiter.
Returns:
A std::string_view referring to the first substring of text that matches delimiter_, or std::string_view(text.end(), 0) if not found.

std::any_of_delimiter

Each character in the given string is a delimiter. A std::any_of_delimiter with string of length 1 behaves the same as a std::literal_delimiter with the same string of length 1.

    namespace std {

      class any_of_delimiter {
        const string delimiters_;
       public:
        explicit any_of_delimiter(string_view sview)
        : delimiters_(static_cast<string>(sview)) {}
        string_view find(string_view text, size_t pos) const;
      };

    }
    
Requires:
text is the text to be split. pos is the position in text to start searching for the delimiter.
Returns:
A std::string_view referring to the first occurrence of any character from delimiters_ that is found in text at or after pos. The length of the returned std::string_view will always be 1. If no match is found, std::string_view(text.end(), 0).

API Usage

The following using declarations are assumed for brevity:

    using std::deque;
    using std::list;
    using std::set;
    using std::string_view;
    using std::vector;
    
  1. The default delimiter when not explicitly specified is std::literal_delimiter. The following two calls to std::split() are equivalent. The first form is provided for convenience.
        vector<string_view> v1{std::split("a-b-c", "-")};
        vector<string_view> v2{std::split("a-b-c", std::literal_delimiter("-"))};
        
  2. Empty substrings are included in the output.
        vector<string_view> v{std::split("a--c", "-")};
        assert(v.size() == 3);  // "a", "", "c"
        
  3. The previous example showed that empty substrings are included in the output. Leading and trailing delimiters result in leading and trailing empty strings in the output.
        vector<string_view> v{std::split("-a-b-c-", "-")};
        assert(v.size() == 5);  // "", "a", "b", "c", ""
        
  4. Results can be assigned to STL containers that support the Range concept.
        vector<string_view> v{std::split("a-b-c", "-")};
        deque<string_view> v{std::split("a-b-c", "-")};
        set<string_view> s{std::split("a-b-c", "-")};
        list<string_view> l{std::split("a-b-c", "-")};
        
  5. A delimiter of the empty string results in each character in the input string becoming one element in the output collection. This is a special case. It is done to match the behavior of splitting using the empty string in other programming languages (e.g., perl).
        vector<string_view> v{std::split("abc", "")};
        assert(v.size() == 3);  // "a", "b", "c"
        
  6. Iterating the results of a split in a range-based for loop.
        for (string_view sview : std::split("a-b-c", "-")) {
          // use sview
        }
        
  7. Modifying the input text invalidates the result of a split from that point on.
        string s = "a-b-c";
        auto r = std::split(s, "-");
        s += "-d-e-f";  // This invalidates the results r
        for (std::string_view token : r) {  // Invalid
          // ...
        }
        
  8. Splitting input text that is the empty string results in a collection containing one element that is the empty string.
        vector<string_view> v{std::split("", any-delimiter)};
        assert(v.size() == 1);  // ""
        
    [Footnote: This is logical behavior given that std::split() doesn't skip empty substrings. However, it might be surprising behavior to some users. Would it be better if the result of splitting an empty string resulted in an empty Range? —end footnote]

Open Questions