As Crocodile Dundee famously said, "That's not a loop, that's a loop." Just programmers having fun:
by Ivica Bogosavljević
From the article:
In our experiments with the memory access pattern, we have seen that good data locality is a key to good software performance. Accessing memory sequentially and splitting the data set into small-sized pieces which are processed individually improves data locality and software speed.
In this post, we will present a few techniques to improve the memory access pattern and increase data locality. The advantage of this approach is that the changes are localized in the algorithm itself, i.e. there is no need to change the data layout or the memory layout. The disadvantage is that the algorithms often become more difficult to understand and modify. ...
[... and deep inside the article ...]
Notice that, because the whole cache line is brought from the memory to the data cache, after accessing a[j][i] we can access a[j][i + 1]cheaply, since these two pieces of data belong to the same cache line. The problem in our case is that if n is large, access to a[j][i + 1] will come much after a[j][i] and by that time a[j][i + 1] will be evicted from the data cache. ...