Extending std::search to use Additional Searching Algorithms (Version 4)

Document #: N3905

Marshall Clow [email protected]

2014-02-14

Note:

Changes

Overview

std::search is a powerful tool for searching sequences, but there are lots of other search algorithms in the literature. For specialized tasks, some of them perform significantly better than std::search. In general, they do this by precomputing statistics about the pattern to be searched for, betting that this time can be made up during the search.

The basic principle is to break the search operation into two parts; the first part creates a "search object", which is specific to the pattern being searched for, and then the search object is passed, along with the data being searched, to std::search.

This is done by adding an additional overload to std::search, and some additional functions to create the search objects.

Two additional search algorithms are proposed for inclusion into the standard: "Boyer-Moore" and "Boyer-Moore-Horspool". Additionally, the interface for the search objects is documented so that library implementers and end users can create their own search objects and use them with std::search.

Terminology: I will refer to the sequence that is being searched as the "corpus" and the sequence being searched for as the "pattern".

Search Objects

A search object is an object that is passed a pattern to search for in its constructor. Then, the operator () is called to search a sequence for matches.

Example

A search object that uses a very simple search algorithm might be implemented as:

    template <class ForwardIterator1>
    class sample_searcher {
    public:
        sample_searcher ( ForwardIterator1 first, ForwardIterator1 last ) :
            first_ ( first ), last_ ( last ) {}

        template <class ForwardIterator2>
        ForwardIterator2 operator () ( ForwardIterator2 cFirst, ForwardIterator2 cLast ) const {
            if ( first_ == last_ ) return cFirst;   // empty pattern
            if ( cFirst == cLast ) return cLast;    // empty corpus
            while ( cFirst != cLast ) {
            //  Advance to the first match
                while ( *cFirst != *first_ ) {
                    ++cFirst;
                    if ( cFirst == cLast )
                        return cLast;
                }
            //  At this point, we know that the first element matches
                ForwardIterator1 it1 = first_;
                ForwardIterator2 it2 = cFirst;
                while ( it1 != last_ && it2 != cLast && *++it1 == *++it2 )
                    ;
                if ( it1 == last_ ) // matched the whole pattern
                    return cFirst;
                ++cFirst;
                }
            return cLast;   // didn't find anything
            }

    private:
        ForwardIterator1 first_;
        ForwardIterator1 last_;
        };

Algorithms

Default Search and Default Search with Predicate

This is a convienience function that allows users of the new interface to use the existing functionality of std::search.

This searcher requires that both the corpus and the pattern be represented with forward iterators (or better).

Boyer-Moore

The Boyer-Moore string search algorithm is a particularly efficient string searching algorithm, and it has been the standard benchmark for the practical string search literature. The Boyer-Moore algorithm was invented by Bob Boyer and J. Strother Moore, and published in the October 1977 issue of the Communications of the ACM, and a copy of that article is available; another description is available on Wikipedia.

The Boyer-Moore algorithm uses two precomputed tables to give better performance than a naive search. These tables depend on the pattern being searched for, and give the Boyer-Moore algorithm a larger memory footprint and startup costs than a simpler algorithm, but these costs are recovered quickly during the searching process, especially if the pattern is longer than a few elements. Both tables contain only non-negative integers.

In the following discussion, I will refer to pattern_length, the length of the pattern being searched for; in other words, std::distance(pattern_first, pattern_last).

The first table contains one entry for each element in the "alphabet" being searched (i.e, the corpus). For searching (narrow) character sequences, a 256-element array can be used for quick lookup. For searching other types of data, a hash table can be used to save space. The contents of the first table are: For each element in the "alphabet" being processed (i.e, the set of all values contained in the corpus) If the element does not appear in the pattern, then pattern_length, otherwise pattern_length - j, where j is the maximum value for which *(pattern_first + j) == element.

Note: Even though the table may contain one entry for each element that occurs in the corpus, the contents of the tables only depend on the pattern.

The second table contains one entry for each element in the pattern; for example, a std::vector<size_t>(pattern_length). Each entry in the table is basically the amount that the matching window can be moved when a mismatch is found. The Boyer-Moore algorithm works by at each position, comparing an element in the pattern to one in the corpus. If it matches, it advances to the next element in both the pattern and the corpus. If the end of the pattern is reached, then a match has been found, and can be returned. If the elements being compared do not match, then the precomputed tables are consulted to determine where to position the pattern in the corpus, and what position in the pattern to resume the matching.

The Boyer-Moore algorithm requires that both the corpus and the pattern be represented with random-access iterators, and that both iterator types "point to" the same type.

Boyer-Moore-Horspool

The Boyer-Moore-Horspool search algorithm was published by Nigel Horspool in 1980. It is a refinement of the Boyer-Moore algorithm that trades space for time. It uses less space for internal tables than Boyer-Moore, and has poorer worst-case performance.

Like the Boyer-Moore algorithm, it has a table that (logically) contains one entry for each element in the pattern "alphabet". When a mismatch is found in the comparison between the pattern and the corpus, this table and the mismatched character in the corpus are used to decide how far to advance the pattern to start the new comparison.

A reasonable description (with sample code) is available on Wikipedia.

The Boyer-Moore algorithm requires that both the corpus and the pattern be represented with random-access iterators, and that both iterator types "point to" the same type.

Calls to be added to the standard library


    template <class ForwardIterator, class Searcher>
    ForwardIterator search ( ForwardIterator first, ForwardIterator last, const Searcher &searcher );

    template <class ForwardIterator,
              class BinaryPredicate = typename equal_to<>>
    default_searcher<ForwardIterator, BinaryPredicate> 
    make_searcher (
    		ForwardIterator first, ForwardIterator last, 
    		BinaryPredicate pred = BinaryPredicate ());

    template <class RandomAccessIterator, 
              class Hash =            typename hash <typename iterator_traits<RandomAccessIterator>::value_type>,
              class BinaryPredicate = typename equal_to<>>
    boyer_moore_searcher<RandomAccessIterator, Hash, BinaryPredicate> 
    make_boyer_moore_searcher ( 
            RandomAccessIterator first, RandomAccessIterator last, 
            Hash hf = Hash (), BinaryPredicate pred = BinaryPredicate ());

    template <class RandomAccessIterator, 
              class Hash =            typename hash <typename iterator_traits<RandomAccessIterator>::value_type>,
              class BinaryPredicate = typename equal_to<>>
    boyer_moore_horspool_searcher<RandomAccessIterator, Hash, BinaryPredicate> 
    make_boyer_moore_horspool_searcher ( 
            RandomAccessIterator first, RandomAccessIterator last, 
            Hash hf = Hash (), BinaryPredicate pred = BinaryPredicate ());

Example usage

    // existing code
    iter = search ( corpus_first, corpus_last, pattern_first, pattern_last );

    // same code, new interface
    iter = search ( corpus_first, corpus_last, make_default_searcher ( pattern_first, pattern_last ));

    // same code, different algorithm
    iter = search ( corpus_first, corpus_last, make_boyer_moore_searcher ( pattern_first, pattern_last ));

Restrictions on comparison predicates

Boyer-Moore and Boyer-Moore-Horspool require a hash function as well as a comparison function. The two functions must compare in the same way. In particular for any two values A and B, if pred(A,B) then hf(A) == hf(B).

Performance

Using the new interface with the existing search algorithms should fulfill all the performance guarantees for the current interface of std::search. No additional comparisons are necessary. However, the creation of the search object may add some additional overhead. Different algorithms will have different amounts of overhead to create the search object. The default_search objects, for example, should be cheap to create - they will typically be a pair of iterators. The Boyer-Moore search object, on the other hand, will contain a pair of tables, and require a significant amount of computation to create.

In my tests, on medium sized search patterns (about 100 entries), the Boyer-Moore and Boyer-Moore-Horspool were about 8-10x faster than std::search. For longer patterns, the advantage increases. For short patterns, they may actually be slower.

Test timings

These results were run on a MacBookPro (i5) computer, compiled with clang 3.2 -O3, and compared against libc++. The corpus was about 2.8MB of base64-encoded text. All timings are normalized against std::search.

The titles of the test indicate where the pattern is located in the corpus being searched; "At Start", etc. "Not found" is the case where the pattern does not exist in the corpus, i.e, the search will fail.

AlgorithmAt StartMiddleAt EndNot Found
Pattern Length1191054391
std::search100.0100.0100.0100.0
default search107.192.79104.6107
Boyer-Moore110.714.3423.1412.86
Boyer-Moore Horspool82.1411.820.0410.41

Sample Implementation

An implementation of this proposal, available under the Boost Software License can be found on GitHub.

Placement into the standard

We (myself and Daniel) believe that the searcher functions should go into <functional>, and the specialization of search should go into <algorithm>.

It would also be possible to put the searchers into <utility>. We think that since there are not function objects in <algorithm>, only algorithms, the searchers do not belong there.

We are not 100% sure that the new search overload is a real algorithm, and if not, does it belong in <algorithm>.

Wording

The proposed wording changes refer to N3797.

Editorial note: The specification of the class templates boyer_moore_searcher and boyer_moore_horspool_searcher are exactly equal except for their names. The editor may merge them into one sub-clause at his discretion.

  1. Add to [function.objects], header <experimental/functional> synopsis as indicated:

    namespace std { namespace experimental { inline namespace fundamentals_v1 {
    	
      // [searchers], searchers:
      template<class ForwardIterator, class BinaryPredicate = equal_to<>>
        class default_searcher;
    
      template<class RandomAccessIterator,
               class Hash = hash<typename iterator_traits<RandomAccessIterator>::value_type>,
               class BinaryPredicate = equal_to<>>
        class boyer_moore_searcher;
    
      template<class RandomAccessIterator,
               class Hash = hash<typename iterator_traits<RandomAccessIterator>::value_type>,
               class BinaryPredicate = equal_to<>>
        class boyer_moore_horspool_searcher;
    
    
      template<class ForwardIterator, class BinaryPredicate = equal_to<>>
      default_searcher<ForwardIterator, BinaryPredicate>
      make_default_searcher(ForwardIterator pat_first, ForwardIterator pat_last,
                            BinaryPredicate pred = BinaryPredicate());
    
      template<class RandomAccessIterator,
               class Hash = hash<typename iterator_traits<RandomAccessIterator>::value_type>,
               class BinaryPredicate = equal_to<>>
      boyer_moore_searcher<RandomAccessIterator, Hash, BinaryPredicate>
      make_boyer_moore_searcher(RandomAccessIterator pat_first, RandomAccessIterator pat_last, 
                                Hash hf = Hash(), BinaryPredicate pred = BinaryPredicate());
    
      template<class RandomAccessIterator,
               class Hash = hash<typename iterator_traits<RandomAccessIterator>::value_type>,
               class BinaryPredicate = equal_to<>>
      boyer_moore_horspool_searcher<RandomAccessIterator, Hash, BinaryPredicate>
      make_boyer_moore_horspool_searcher(RandomAccessIterator pat_first, RandomAccessIterator pat_last, 
                                         Hash hf = Hash(), BinaryPredicate pred = BinaryPredicate());
    }}}
    
  2. Add a new clause to [function.objects] and a series of paragraphs as indicated:

    ?? Searchers [searchers]

    This sub-clause provides function object types ([function.objects]) for operations that search for a sequence [pat_first, pat_last) in another sequence [first, last) that is provided to the object's function call operator. The first sequence (the pattern to be searched for) is provided to the object's constructor, and the second (the sequence to be searched) is provided to the function call operator.

    Each specialization of a class template specified in this sub-clause [searchers] shall meet the CopyConstructible and CopyAssignable requirements. Template parameters named ForwardIterator, ForwardIterator1, ForwardIterator2, RandomAccessIterator, RandomAccessIterator1, RandomAccessIterator2, and BinaryPredicate of templates specified in this sub-clause [searchers] shall meet the same requirements and semantics as specified in [algorithms.general]. Template parameters named Hash shall meet the requirements as specified in [hash.requirements].

    ??.1 Class template default_searcher [searchers.default]

    namespace std { namespace experimental { inline namespace fundamentals_v1 {
      template<class ForwardIterator1, class BinaryPredicate = equal_to<>>
      class default_searcher {
      public:
        default_searcher(ForwardIterator1 pat_first, ForwardIterator1 pat_last, 
                         BinaryPredicate pred = BinaryPredicate());
    
        template<class ForwardIterator2>
        ForwardIterator2 
        operator()(ForwardIterator2 first, ForwardIterator2 last) const;
    
      private:
        ForwardIterator1 pat_first_; // exposition only
        ForwardIterator1 pat_last_;  // exposition only
        BinaryPredicate  pred_;      // exposition only
      };
    }}}
    
    default_searcher(ForwardIterator pat_first, ForwardIterator pat_last, 
                     BinaryPredicate pred = BinaryPredicate());
    

    -?- Effects: Constructs a default_searcher object, initializing pat_first_ with pat_first, pat_last_ with pat_last, and pred_ with pred.

    -?- Throws: Any exception thrown by the copy constructor of BinaryPredicate or ForwardIterator1.

    template<class ForwardIterator2>
    ForwardIterator2 
    operator()(ForwardIterator2 first, ForwardIterator2 last) const;
    

    -?- Effects: Equivalent to std::search(first, last, pat_first_, pat_last_, pred_).

    ??.1.1 default_searcher creation functions [searchers.default.creation]

    template<class ForwardIterator, class BinaryPredicate = equal_to<>>
    default_searcher<ForwardIterator, BinaryPredicate> 
    make_default_searcher(ForwardIterator pat_first, ForwardIterator pat_last, 
                  BinaryPredicate pred = BinaryPredicate());
    

    -?- Effects: Equivalent to default_searcher<ForwardIterator, BinaryPredicate>(pat_first, pat_last, pred).

    ??.2 Class template boyer_moore_searcher [searchers.boyer_moore]

    namespace std { namespace experimental { inline namespace fundamentals_v1 {
      template<class RandomAccessIterator1,
               class Hash = hash<typename iterator_traits<RandomAccessIterator1>::value_type>,
               class BinaryPredicate = equal_to<>>
      class boyer_moore_searcher {
      public:
        boyer_moore_searcher(RandomAccessIterator1 pat_first, RandomAccessIterator1 pat_last, 
                             Hash hf = Hash(), BinaryPredicate pred = BinaryPredicate());
    
        template<class RandomAccessIterator2>
        RandomAccessIterator2 
        operator()(RandomAccessIterator2 first, RandomAccessIterator2 last) const;
    
      private:
        RandomAccessIterator1 pat_first_; // exposition only
        RandomAccessIterator1 pat_last_;  // exposition only
        Hash                  hash_;      // exposition only
        BinaryPredicate       pred_;      // exposition only
      };
    }}}
    
    boyer_moore_searcher(RandomAccessIterator1 pat_first, RandomAccessIterator1 pat_last, 
                         Hash hf = Hash(), BinaryPredicate pred = BinaryPredicate());
    

    -?- Requires: The value type of RandomAccessIterator1 shall meet the DefaultConstructible, CopyConstructible, and CopyAssignable requirements.

    -?- Requires: For any two values A and B of the type iterator_traits<RandomAccessIterator1>::value_type, if pred(A,B)==true, then hf(A)==hf(B) shall be true.

    -?- Effects: Constructs a boyer_moore_searcher object, initializing pat_first_ with pat_first, pat_last_ with pat_last, hash_ with hf, and pred_ with pred.

    -?- Throws: Any exception thrown by the copy constructor of BinaryPredicate or RandomAccessIterator1, or by the default constructor, copy constructor, or the copy assignment operator of the value type of RandomAccessIterator1, or the copy constructor or operator() of Hash. May throw bad_alloc if cannot allocate additional memory for internal data structures needed.

    template<class RandomAccessIterator2>
    RandomAccessIterator2 
    operator()(RandomAccessIterator2 first, RandomAccessIterator2 last) const;
    

    -?- Requires: RandomAccessIterator1 and RandomAccessIterator2 shall have the same value type.

    -?- Effects: Finds a subsequence of equal values in a sequence.

    -?- Returns: The first iterator i in the range [first, last - (pat_last_ - pat_first_)) such that for every non-negative integer n less than pat_last_ - pat_first_ the following condition holds: pred(*(i + n), *(pat_first_ + n)) != false. Returns first if [pat_first_, pat_last_) is empty, otherwise returns last if no such iterator is found.

    -?- Complexity: At most (last - first) * (pat_last_ - pat_first_) applications of the predicate.

    ??.2.1 boyer_moore_searcher creation functions [searchers.boyer_moore.creation]

    template<class RandomAccessIterator, 
             class Hash = hash<typename iterator_traits<RandomAccessIterator>::value_type>,
             class BinaryPredicate = equal_to<>>
    boyer_moore_searcher<RandomAccessIterator, Hash, BinaryPredicate> 
    make_boyer_moore_searcher(RandomAccessIterator pat_first, RandomAccessIterator pat_last, 
                              Hash hf = Hash(), BinaryPredicate pred = BinaryPredicate());
    

    -?- Effects: Equivalent to boyer_moore_searcher<RandomAccessIterator, Hash, BinaryPredicate>(pat_first, pat_last, hf, pred).

    ??.3 Class template boyer_moore_horspool_searcher [searchers.boyer_moore_horspool]

    namespace std { namespace experimental { inline namespace fundamentals_v1 {
      template<class RandomAccessIterator1,
               class Hash = hash<typename iterator_traits<RandomAccessIterator1>::value_type>,
               class BinaryPredicate = equal_to<>>
      class boyer_moore_horspool_searcher {
      public:
        boyer_moore_horspool_searcher(RandomAccessIterator1 pat_first, RandomAccessIterator1 pat_last, 
                                      BinaryPredicate pred = BinaryPredicate());
    
        template<class RandomAccessIterator2>
        RandomAccessIterator2 
        operator()(RandomAccessIterator2 first, RandomAccessIterator2 last) const;
    
      private:
        RandomAccessIterator1 pat_first_; // exposition only
        RandomAccessIterator1 pat_last_;  // exposition only
        Hash                  hash_;      // exposition only
        BinaryPredicate       pred_;      // exposition only
      };
    }}}
    
    boyer_moore_horspool_searcher(RandomAccessIterator1 pat_first, RandomAccessIterator1 pat_last, 
                                  Hash hf = Hash(), BinaryPredicate pred = BinaryPredicate());
    

    -?- Requires: The value type of RandomAccessIterator1 shall meet the DefaultConstructible, CopyConstructible, and CopyAssignable requirements.

    -?- Requires: For any two values A and B of the type iterator_traits<RandomAccessIterator1>::value_type, if pred(A,B)==true, then hf(A)==hf(B) shall be true.

    -?- Effects: Constructs a boyer_moore_horspool_searcher object, initializing pat_first_ with pat_first, pat_last_ with pat_last, hash_ with hf, and pred_ with pred.

    -?- Throws: Any exception thrown by the copy constructor of BinaryPredicate or RandomAccessIterator1, or by the default constructor, copy constructor, or the copy assignment operator of the value type of RandomAccessIterator1 or the copy constructor or operator() of Hash. May throw bad_alloc if the system cannot allocate additional memory for internal data structures needed.

    template<class RandomAccessIterator2>
    RandomAccessIterator2 
    operator()(RandomAccessIterator2 first, RandomAccessIterator2 last) const;
    

    -?- Requires: RandomAccessIterator1 and RandomAccessIterator2 shall have the same value type.

    -?- Effects: Finds a subsequence of equal values in a sequence.

    -?- Returns: The first iterator i in the range [first, last - (pat_last_ - pat_first_)) such that for every non-negative integer n less than pat_last_ - pat_first_ the following condition holds: pred(*(i + n), *(pat_first_ + n)) != false. Returns first if [pat_first_, pat_last_) is empty, otherwise returns last if no such iterator is found.

    -?- Complexity: At most (last - first) * (pat_last_ - pat_first_) applications of the predicate.

    ??.3.1 boyer_moore_horspool_searcher creation functions [searchers.boyer_moore_horspool.creation]

    template<class RandomAccessIterator,
             class Hash = hash<typename iterator_traits<RandomAccessIterator>::value_type>,
             class BinaryPredicate = equal_to<>>
    boyer_moore_searcher_horspool<RandomAccessIterator, Hash, BinaryPredicate> 
    make_boyer_moore_horspool_searcher(RandomAccessIterator pat_first, RandomAccessIterator pat_last, 
                                       Hash hf = Hash(), BinaryPredicate pred = BinaryPredicate());
    

    -?- Effects: Equivalent to boyer_moore_horspool_searcher<RandomAccessIterator, Hash, BinaryPredicate>(pat_first, pat_last, hf, pred).

  3. Add to [algorithms.general], header <experimental/algorithm> synopsis as indicated:

    namespace std { namespace experimental { inline namespace fundamentals_v1 {
      template<class ForwardIterator, class Searcher>
      ForwardIterator search(ForwardIterator first, ForwardIterator last, 
        const Searcher& searcher);
      […]
    }}}
    
  4. Add the following sequence of declarations and paragraphs to [alg.search]:

    template<class ForwardIterator, class Searcher>
    ForwardIterator search(ForwardIterator first, ForwardIterator last, 
                           const Searcher& searcher);
    

    -?- Effects: Equivalent to searcher(first, last).

    -?- Remarks: Searcher need not meet the CopyConstructible requirements.

Acknowledgments

Thanks to LWG, which reviewed an earlier version of this document, Matt Austern, who suggested overloading std::search, and especially Daniel Krügler, who wrote most of the wording for the standard, and Stephan T. Lavavej, who reviewed it.