Improvements in CoMD Cell Method Performance
A previous post showed some performance improvements with the nsquared version of the code. This post will tackle the cell version of the code. In the nsquared version the time-consuming inner loop had no function calls. The cell version does call other functions, which may complicate optimization.
Language/compiler | Version | Initial time | Final time |
---|---|---|---|
C | 4.8.2 | 2.3 | 2.3 |
Python | 2.7.10 | 1014 | 1014 |
PyPy | 2.6.1 | 96 | 96 |
Julia | 0.4.0-rc3 | 87 | 6.1 |
Cython | 0.23.3 | 729 | 13 |
Numba | 0.21.0 | 867 | 47 |
Times are all in μs/atom. System size is 4000 atoms.
Hardware is Xeon E5-2630 v3 @ 2.4 Ghz, OS is Ubuntu 12.04.
The 'Initial Time' column results from the minimal amount of code changes to get the compiler working.
The 'Final Time' is the time after the tuning in this post.
Julia
The cell version contains the same issue with array operations as the nsquared version - the computation of dd
allocates a temporary to hold the results every time through the loop.
The Devectorize package can automatically convert array notation
to a loop. If we add the @devec
annotation, the time drops to 43 μs/atom.
Unfortunately, the allocation to hold the result must still be performed, and it remains inside the inner particle loop.
If we manually create the loop and hoist the allocation out of the loop, time is 27 μs/atom.
The code uses dot
to compute the vector norm. This calls a routine (julia_dot
) to perform the
dot product.
For long vectors calling an optimized linear algebra routine is beneficial, but for a vector of length 3 this adds overhead.
Replacing dot
with the equivalent loop reduces the time to 23 μs/atom.
Looking through the memory allocation output (--track-allocation=user
) shows some vector operations
when the force array is zeroed and accumulated.
Also in putAtomInBox
in linkcell.jl
.
These spots are also visible in the profile output, but the profile output is less convenient because it is not shown with source.
The @devec
macro does work here, and the performance is now 7.7 μs/atom. Explicit loops
give a slightly better time of 7.3 μs/atom.
Profiling shows even more opportunities for devectorization in advanceVelocity
and advancePosition
in simflat.jl
Time is now 6.4 μs/atom.
The Julia executable has a -O
switch for more time-intensive optimizations (it adds more LLVM optimization passes). This improves the time to 6.2 μs/atom.
The @fastmath
macro improves the time a little more, to 6.1 μs/atom.
The @inbounds
macro to skip the bounds checks did not seem to improve the time.
The final Julia time is now within a factor of 3 of the C time. The code is here. It's not clear where the remaining time overhead comes from.
PyPy
The PyPy approach to JIT compilation is very general, but that also makes it difficult to target what code
changes might improve performance.
The Jitviewer tool is nice, but not helpful at a cursory glance.
The vmprof profiler solves an important problem by collecting the native code stack plus the python stack.
In this particular case, it reports at the function level, and the bulk of the time was spent in computeForce
.
I hope to write more about vmprof in a future post, as it could help with integrated profiling of Python + native code (either compiled or JIT-ed).
Cython
The simplest step is to add an initialization line and move some .py
files to .pyx
files. This gives 729 μs/atom.
Adding types to the computeForce function and assigning a few attribute lookups to local variables so the types can be assigned (playing a game of 'remove the yellow' in the Cython annotation output) gives 30 μs/atom.
Adding types and removing bounds checks more routines (in halo.py
, linkcell.py
, simflat.py
) gives 13 μs/atom.
Code is here. Further progress needs deeper investigation with profiling tools.
Numba
Starting with adding @numba.jit
decorators to computeForce
, and the functions it calls gives the
initial time of 867 μs/atom.
Extracting all the attribute lookups (including the function call to getNeighborBoxes
) gives 722 μs/atom.
We should ensure the call to getNeighborBoxes
is properly JIT-ed. Unfortunately, this requires more involved
code restructuring. Functions need to be split into a wrapper that performs any needed attribute lookups, and
an inner function that gets JIT-ed. Loop lifting automatically performs this transformation on functions
with loops. On functions without loops, however, it needs to be done manually.
Once this is done, the time improves dramatically to 47 μs/atom.
Hopefully the upcoming "JIT Classes" feature will make this easier, and require less code restructuring.
Code is here
Summary
Julia is leading in terms of getting the best performance on this example. Many of these projects are rapidly improving, so this is just a snapshot at their current state.
All these projects need better profiling tools to show the user where code is slow and to give feedback on why the code is slow. The Cython annotated output is probably the the best - it highlights which lines need attention. However it is not integrated with profiler output, so in a project of any size, it's not clear where a user should spend time adding types.
Julia has some useful collection and feedback tools, but they would be much more helpful if combined. The memory allocation output is bytes allocated. It's useful for finding allocations where none were expected, or for allocations in known hot loops, but it's less clear which other allocations are impacting performance. Ideally this could be integrated with profiler output and weighted by time spent to show which allocations are actually affecting execution time.
Comments
Comments powered by Disqus