Parsing XML at the Speed of Light--Arseny Kapoulkine

Some high-performance techniques that you an use for more than just parsing, including this week's darling of memory management:

Parsing XML at the Speed of Light

a chapter from "The Performance of Open Source Applications"
by Arseny Kapoulkine

From the chapter:

This chapter describes various performance tricks that allowed the author to write a very high-performing parser in C++: pugixml. While the techniques were used for an XML parser, most of them can be applied to parsers of other formats or even unrelated software (e.g., memory management algorithms are widely applicable beyond parsers). ...

Optimizing software is hard. In order to be successful, optimization efforts almost always involve a combination of low-level micro-optimizations, high-level performance-oriented design decisions, careful algorithm selection and tuning, balancing among memory, performance, implementation complexity, and more. Pugixml is an example of a library that needs all of these approaches to deliver a very fast production-ready XML parser–even though compromises had to be made to achieve this. A lot of the implementation details can be adapted to different projects and tasks, be it another parsing library or something else entirely.

Continue reading...

Add a Comment

Comments are closed.

Comments (6)

1 1

Kristian Ivarsson said on Apr 11, 2014 11:02 AM:

pugixml is a brilliant DOM-like lib, but what we need is some plain (static:ish) serialization-tool (with encoding-support) (perfect with static reflection) and some simple xPath-like interface/tool for dynamic lookup

Let the data interface just be some character/byte stream and leave the to_boolean, from_integer, etc to some other abstraction layer and pls do not incoorperate DOM and/or SAX into standard C++ ... and of course this model should be able to apply to other protocols (e.g. YAML) as well

That's my birthday wish
1 0

squelart said on Apr 11, 2014 04:46 PM:

Lots of seemingly-good optimizations, but not one word about profiling... Can we trust that these tricks really work?
1 0

Bjarne Stroustrup said on Apr 12, 2014 04:45 PM:

Very nice article. Can you show us some performance measurements that shows the benefits of your optimizations compared to other XML parsers?
0 0

matt said on Apr 13, 2014 08:32 AM:

Just a mere submitter (not the library author), but here it is: http://pugixml.org/benchmark/
// Note: the archive containing benchmark sources and data files is linked at the bottom.
0 0

Fernando Pelliccioni said on Apr 14, 2014 07:30 PM:

As Alex Stepanov said, "Go back to Fortran".
He means, use a Fortran-style of programming, using arrays instead of spreading objects non-contiguously in memory (Smalltalk-style).
0 0

Bjarne Stroustrup said on Apr 14, 2014 09:06 PM:

Thanks for the benchmark data