performance

Blaze 2.2 released

Blaze, an open-source, high-performance C++ math library for dense and sparse arithmetic, released their new version.

Blaze 2.2 Released

After a total of five and a half months, a little late for SC'14, but right on time for Meeting C++, we finally release Blaze 2.2! But the waiting time was worthwhile! This release comes with several bug fixes and hundreds of improvements, many based on your hints, suggestions and ideas. Thank you very much for your support and help to make the Blaze library even better!

The big new feature of Blaze 2.2 is symmetric matrices. And this is not just any implementation of symmetric matrices, but one of the most complete and powerful implementations available. See the Blaze tutorial to get an idea of how symmetric matrices work and how they can help you prevent some inadvertent pessimizations of your code.

 

Efficiency with Algorithms, Performance with Data Structures -- Chandler Carruth

carruth-cppcon2014.PNGAt the recent CppCon 2014, Chandler Carruth gave a great talk on using Modern C++ for writing high-performance applications. 

Efficiency with Algorithms, Performance with Data Structures

by Chandler Carruth

From the video introduction:

C++ programmers throughout the industry have an insatiable desire for writing high performance code. Unfortunately, even with C++, this can be really challenging. Over the past twenty years processors, memory, software libraries, and even compilers have radically changed what makes C++ code fast. Even measuring the performance of your code can be a daunting task. This talk will dig into how modern processors work, what makes them fast, and how to exploit them effectively with modern C++ code.

New optimizations for X86 in upcoming GCC 5.0 -- Evgeny Stupachenko

Fresh on the Intel Developer Zone blog:

New optimizations for X86 in upcoming GCC 5.0

by Evgeny Stupachenko

From the article:

Part 1. Vectorization of loads/stores group.

GCC 5.0 significantly improves vector code quality for load groups and store groups. By loads/stores group I mean iterated consecutive sequence of loads/stores. For example:

x = a[i], y = a[i + 1], z = a[i + 2] iterated by “i” is loads group of size 3

...

The most frequent case where loads/stores groups are applicable is array of structures.
  1. Image conversion (RGB structure to some other) ...
  2. N-dimentional coordinates. (Normalize array of XYZ points) ...
  3. Multiplication of vectors by constant matrix: ...

... GCC 5.0:

  1. Introduces vectorization of load/store groups of size 3
  2. Improves load groups vectorization for all supported sizes
  3. Maximizes load/store groups performance by generating code that is more optimal for particular x86 CPU...

 

HPX version 0.9.9 released -- STE||AR Group

The STE||AR Group has released V0.9.9 of HPX -- A general purpose parallel C++ runtime system for applications of any scale.

HPX V0.9.9 Released

The newest version of HPX (V0.9.9) is now available for download! Please see here for the release notes.

HPX now exposes an API fully conforming to the concurrency related parts of the C++11 and C++14 standards, extended and applied to distributed computing.

From the announcement:

  • We completed the refactoring of hpx::future to be properly C++11 standards conforming.
  • We overhauled our build system to support newer CMake features to make it more robust and more portable.
  • We implemented a large part of the parallel algorithms and other parallel facilities proposed by C++ Technical Specifications N4104, N4088, and N4107.
  • We added many examples such as the 1D Stencil and the Matrix Transpose series.
  • We significantly improved the performance of the library and the existing documentation

C++ and Zombies: a moving question

One of the issues I was thinking about since C++Now: move and move-destruction

C++ and Zombies: a moving question

by Jens Weller

From the article:

This has been on my things to think about since C++Now. At C++Now, I realized, that we've might got zombies in the C++ standard. And that there are two fractions, one of them stating, that it is ok to have well defined zombies, while some people think that you'd better kill them.

Insights into new and C++

I've written down some basic thinking on new and new standards:

Insights into new and C++

by Jens Weller

From the article:

Every now and then, I've been thinking about this. So this blogpost is also a summary of my thoughts on this topic, dynamic memory allocation and C++. Since I wrote the blog entries on smart pointers, and C++14 giving us make_unique, raw new and delete seem to disappear from C++ in our future code...

The Drawbacks of Implementing Move Assignment in Terms of Swap -- Scott Meyers

Hot off the Meyers press: How would you implement move, and why? Scott Meyers explains two related issues:

The Drawbacks of Implementing Move Assignment in Terms of Swap

by Scott Meyers

From the article:

More and more, I bump into people who, by default, want to implement move assignment in terms of swap. This disturbs me, because (1) it's often a pessimization in a context where optimization is important, and (2) it has some unpleasant behavioral implications as regards resource management.

Vector of Objects vs Vector of Pointers Updated -- Bartlomiej Filipek

More in the "contiguous enables fast" department:

Vector of Objects vs Vector of Pointers Updated

by Bartlomiej Filipek

From the article:

For 1000 particles we need on the average 2000 cache line reads! This is 78% more cache line reads than the first case! Additionally Hardware Prefetcher cannot figure out the pattern -- it is random -- so there will be a lot of cache misses and stalls.

In our experiment the pointer code for 80k of particles was more 266% slower than the continuous case.

Fast Polymorphic Collections -- Joaquín M López Muñoz

munuz-poly.PNGOn the theme of "contiguous enables fast":

Fast Polymorphic Collections

by Joaquín M López Muñoz

From the article:

poly_collection behaves excellently and is virtually not affected by the size of the container. For n < 105, the differences in performance between poly_collection and a std::vector of std::unique_ptrs are due to worse virtual call branch prediction in the latter case; when n > 105, massive cache misses are added to the first degrading factor.

Parsing XML at the Speed of Light--Arseny Kapoulkine

Some high-performance techniques that you an use for more than just parsing, including this week's darling of memory management:

Parsing XML at the Speed of Light

a chapter from "The Performance of Open Source Applications"
by Arseny Kapoulkine

From the chapter:

This chapter describes various performance tricks that allowed the author to write a very high-performing parser in C++: pugixml. While the techniques were used for an XML parser, most of them can be applied to parsers of other formats or even unrelated software (e.g., memory management algorithms are widely applicable beyond parsers). ...

Optimizing software is hard. In order to be successful, optimization efforts almost always involve a combination of low-level micro-optimizations, high-level performance-oriented design decisions, careful algorithm selection and tuning, balancing among memory, performance, implementation complexity, and more. Pugixml is an example of a library that needs all of these approaches to deliver a very fast production-ready XML parser–even though compromises had to be made to achieve this. A lot of the implementation details can be adapted to different projects and tasks, be it another parsing library or something else entirely.

Continue reading...