Having acquired a shiny new profiler, it's time to dig into the performance of the Numba version some more.
Picking up from the previous optimizations, I can't seem to reproduce the timing (47 μs/atom) in the that table. Now I get 40 μs/atom.
Run the profiler (
vmprofrun comd.py) and display the results in KCacheGrind (
Sorting by self time, we see
getBoxFromCoord at the top:
Also a screen shot of the call graph -
getBoxFromCoord gets called from two different places -
To improve performance here, convert
getBoxFromCoord to a free function and put all the attribute references into function arguments.
def getBoxFromCoord(self, r): ix = int(math.floor(r*self.invBoxSize)) iy = int(math.floor(r*self.invBoxSize)) iz = int(math.floor(r*self.invBoxSize)) return self.getBoxFromTuple(ix, iy, iz)
@numba.njit def getBoxFromCoordInner(x, y, z, invBoxSize, nLocalBoxes, gs): ix = int(math.floor(x*invBoxSize)) iy = int(math.floor(y*invBoxSize)) iz = int(math.floor(z*invBoxSize)) return getBoxFromTupleInner(ix, iy, iz, nLocalBoxes, gs)
(Changing the parameter
r to individual components was not strictly necessary.)
And the call sites change (for example) from
iBox = self.getBoxFromCoord([x, y, z])
iBox = getBoxFromCoordInner(x, y, z, self.invBoxSize, self.nLocalBoxes, self.gridSize)
This improves performance to 20 μs/atom.
Repeating the same transformation for putAtomInBox gives 18.4 μs/atom.
Run the profiler again. By self time,
loadAtomsBuffer is at the top. Let's look at that in context.
Sort by inclusive time, and we see that the parts of the call tree starting at
redistributeAtoms take a significant amount of time.
This part of the code:
- Applies periodic boundary conditions
- Moves atoms to a new cell
- Packs atom into a buffer at source
- Unpacks buffer into atom data structure at destination
This packing/unpacking anticipates the parallel version, which transmits the buffer across processors.
A previous attempt at using numpy records did not work well (and ran into a serious performance regression with numpy 1.10). This time I went with two buffers - one for integers, and one for floating point numbers. This works better, and the performance is now 10.2 μs/atom.
Continuing the process of profiling, and converting routines to be Numba friendly eventually reached a performance of 2.9 μs/atom. (Wow, this is only about 30% slower than C.)
Modified code is here
The updated performance table is
|Language/compiler||Version||Initial time||Final time|
The 'Initial Time' column results from the minimal amount of code changes to get the compiler working.
The 'Final Time' is the time after tuning.
Do note that the Cython, Julia, and Numba results reflect the amount of effort put into optimization. Cython and Julia still need to be improved with the assistance of a profiler (or in Julia's case, a better viewer for existing profile data).
Using a profiler to guide our optimization efforts has been very helpful.
The Numba results are really promising, but, in addition to creating ugly code, it required an amount of work that I would not want to perform on a regular basis. These transformations are fairly regular, so it should be possible to incorporate them into Numba. Alternately, if doing so in a safe manner inside the compiler is difficult, some sort of automated AST transformation of the source should be possible.
As the optimization process proceeds on this code, increasing amount of time is being spent in the core routine, computeForce, (as it should), and we will need to move beyond a function-level profiler to look for optimization opportunities.