Building A Parallella Cluster

I finally got around to assembling my small pile of Parallella boards into a cluster.

Alternate side view

This post documents the choices I made about power distribution, mechanical assembly, cooling, software management, etc. For background on the individual boards, see my previous post: Introduction to Parallella.

The software directions are based on the Linaro 14.04 image. (At some point I should upgrade to the latest Paraubuntu release.)

Power

There are several options for powering multiple boards:

  1. Solder a jumper and use the mounting pads with metal standoffs to transmit the power. This option requires the fewest cords.
  2. Use micro USB connector (need to switch jumper J14). There are multi-port USB power supplies that should work. This is probably the simplest option for a small number of boards.
  3. Use the 5.5mm barrel plug (default settings). There are versions with screw terminals on Amazon (these for example).

I went with option 3 and used an old PC power supply for the power.

Mechanical assembly

I used 20mm nylon standoffs and assembled the boards into two stacks. A small piece of 1/4" plywood was used for the base.

Cooling

For air flow, I used a cooling fan from an old PC and mounted it to the plywood base.

Network

Needing ports for a maximum of 10 boards, plus one port for the external connection, I chose a 16-port Gigabit D-Link switch. Eventually I would like to power the network switch from the PC power supply, but need to get the right plug first.

The nodes use DHCP from my home network get their addresses. A future improvement is to run a DHCP server on one node and use that to supply addresses to the other nodes. This would make the cluster independent of running on my home network.

MicroSD cards

Follow the directions on the Parallella Create SD Card page. After burning the image, use gparted to resize the partition to the full capacity of the card.

Management and control

Performing software installs and other management on all of boards individually would be too tedious. There are many solutions for managing clusters. I decided to use Ansible, as it seemed the simplest (no software needed on nodes) and it runs over ssh.

In addition to controlling operations from a PC, it is useful to designate one node as a 'head node' and install Ansible there as well. For MPI, it's easier to run MPI programs from the head node than from the PC. For setting up configuration files, it can be useful to create or edit the file and make sure it works on one node, and then copy the file to all the other nodes.

Ansible and MPI (and general convenience) require setting up passwordless ssh login.

Once the keys are set up locally, you can use ssh-copy-id to copy the credentials.

ssh-copy-id -i ~/.ssh/id_rsa.pub parallella@10.0.0.145

I keep a small script that puts this line in the history (named ssh_copy_id)

history -s "ssh-copy-id -i ~/.ssh/id_rsa.pub parallella@10.0.0.145"

Run the command source ssh_copy_id to put the command on the history list. Use the bash history and line editing features to select the command and update to a new address.

Rather than create a new user, I'm using the default 'parallella' user on the nodes. The SSH config (in '.ssh/config') can be set up to switch users upon login.

Host 10.0.0.127
    User parallella

You might wish to set up a apt-cache server on a local machine to save on download bandwidth when installing software to all the nodes.

Using Ansible

See the intro docs.

The inventory file is a list of IP addresses (one per line). It can be specified on the command line with `-i'. To see if all the nodes work

ansible -i cluster.ini all -m ping

To copy the apt-cache server configuration to all the nodes, use

 ansible -i hosts all --sudo -m copy -a "src=/etc/apt/apt.conf.d/02proxy dest=/etc/apt/apt.conf.d/02proxy"

To shutdown all the nodes, use

ansible -i cluster.ini all --sudo -m shell -a "shutdown now"

Compilation

In the introductory post I talked about cross compiling for the board. That gets more complicated with larger software packages. For instance, one of my project dependencies, HDF, doesn't cross-compile easily (or at all).

Since the nodes use a regular Ubuntu distribution, native compilation is easy, but slow.

One solution is to use distcc.

The particular software package I'm working with (QMCPACK, which does simulations on atoms, molecules, and solids) uses CMake for configuration and build, and builds fine with distcc.

Install on all nodes with

ansible -i cluster.ini all --sudo  -a "apt-get install -y distcc"

Set the DISTCC_HOSTS variable to the set of systems to use

export DISTCC_HOSTS="localhost @10.0.0.144/2 @10.0.0.145/2"

This example shows three hosts. The initial '@' means to use ssh (no daemon required on remote) and the '/2' on the end means to use two threads.

Now set the C and C++ compilers to distcc <compiler> and run the build.

For CMake, building a project with MPI, this is

export CC="distcc mpicc"
export CXX="distcc mpic++"

Then

make -j 8

You might see an warning message

distccd[<pid>] (tcp_cork_sock) Warning: setsockopt(corked=1) failed: Socket operation on non-socket

This can be ignored. Or follow some directions to silence it.

MPI

One popular method for writing programs that communicate across the boards is MPI (Message Passing Interface).

Install the 'mpi-default-dev' package (which should also install the 'mpi-default-bin' package). This installs the Open MPI implementation (the alternative being MPICH). Note that this MPI is only concerned with communication between the ARM cores on different boards. There is also the Brown Deer Technology version of MPI for programming the Epiphany accelerator.

It's common to use a networked file system so each local node has access to the executables and input files. Ansible has file distribution commands that work well enough that a networked file system isn't strictly necessary. (Be aware when copying directories with Ansible that if the directory specified in src does not end with '/', the directory and it's contents are copied. If it does end with '/', just the directory contents are copied.)

Open MPI uses a tree-based launcher for better scalable start-up performance. Because of this, each node should be able to log into each other node (not just head node to other nodes).

MPI needs a list of machines to run on. One method is to create a host file and pass it to mpirun with the --hostfile option. The host file, at its simplest, is one hostname or IP address per line (same as a simple Ansible inventory file.)

Gallery

Front/side view   Back view  

Side view   Top view   Alternate side view

Integration Callbacks with Sympy and LLVM

This post explores various packages for multi-dimensional integration along with generating callbacks for the integrands from Sympy using an LLVM JIT

Problem to integrate

The particular problem is using the variational principle to find the ground state energy for atoms. Some Jupyter notebooks with a description of the problem, along with various integration methods:

Ground state energy of Hydrogen Atom (This yields a 3 dimensional integral.)

Ground state energy of Helium Atom (This yields a 6 dimensional integral.)

The standard solution to these integrals is to use Markov Chain Monte Carlo (the Quantum Monte Carlo method).
However, I'm curious to see how far alternate integration schemes or existing integration packages would work.

Integration libraries

The scipy quadrature routines accept a natively compiled callback for the integrand. (Noticing this in the documentation initiated the idea for using JIT compilation for callback functions.)

Next up is the Cubature integration package, with the Python wrapper for cubature

Finally is the Cuba library, with the PyCuba interface (part of the PyMultiNest package)

There are some other libraries such at HIntLib that I would also like to try. There doesn't seem to be a python interface for HIntLib. Let me know if there is one somewhere. And if there are other multidimensional integration packages to try.

Evaluating the integrand

One of my scientific programming goals is to generate efficient code from a symbolic expression. To this end, I've been working on an LLVM JIT converter for Sympy expressions (using the llvmlite wrapper).

For the Sympy code, see these pull requests:

As an aside, one can question if is this the right approach, compared with

  1. Generate C++ or Fortran and compile using the existing autowrap functionality in Sympy.
  2. Generate Python/Numpy and use Numba.
  3. Use Julia

There is always a tradeoff between a narrow, specialized solution, which is faster to implement and perhaps easier to understand, and a more general solution, which applies in more cases, but is harder and slower to implement.

Using an LLVM JIT is a specialized solution, but it does have an advantage that there is a short path from the expressions to the compiled code. One disadvantage is that it does not leverage existing compilers (Numba or C++), though LLVM compiler optimization passes are available.

Sometimes a solution just needs to be tried to gain experience with the advantages and drawbacks.

Results

For the helium atom, the integration times are reported in the table below

Integrator   Time (seconds)
Cubature 171
Cubature w/CSE 141
Cubature w/CSE and multiple evals 100
Cuba (Vegas) 29
Cuba (Cuhre) 22


Note that scipy.nquad was not used for the 6D integral. It quickly runs out of steam because it consists of iterated one dimensional integrations, and the glue between the dimensions goes through Python, reducing the effectiveness of a compiled integrand.

The Cubature library does better. Profiling shows that most of the time is spent internal to cubature and allocating memory, so faster integrand evaluation is not going to improve the time. Some other approaches can help. One is Common Subexpression Elimination (CSE), which Sympy can perform on the expression. This extracts duplicate fragments so their value only needs to be computed once.

The library also allows multiple integrals to be performed at once. This can amortize some of the overhead of the library. In this case, the individual calls to integrator for the numerator and denominator can be reduced to a single call.

The Cuba library performs even better, as there is apparently less overhead inside the integration library. The Cuhre integrator uses a deterministic grid-based algorithm similar to Cubature. Vegas uses an adaptive Monte Carlo approach.

The results are not shown here, but I also used SIMD vectorization to make the function evaluation even faster, which succeeded for the bare function evaluation. (This was one of the original motivations for compiling straight to LLVM, as it would be easier to get vectorization working.) Unfortunately, it did not speed up the overall integration much (if at all), due to overhead in the libraries.

Conclusions and future work

Using an LLVM JIT to create callbacks for integration works fairly well.

One important question is how to scale the creation of the callbacks to new libraries without explicitly programming them into Sympy.
The last pull request has expanded the CodeSignature class, which seems like a starting point for such a more general callback specification.

Notes on CMake

Recently I started working on a project that uses CMake. I've used CMake a little before, but never really had to dive much into it. In particular, I needed to understand the scripting parts of CMake for adding tests for CTest.

Below are some comments on aspects of CMake.

Variables and variable substitution

Variables names are strings. Substitution occurs when the variable is dereferenced with ${}.

SET(var, a)
MESSAGE("var = ${var}")

produces

var = a

Nested substitutions are possible

SET(var, a)
SET(a, b)
MESSAGE("var = ${var}  ${${var}}")

will produce var = a b

Variable names can be composed during substitution

SET(var, a)
SET(a_one, apple)
MESSAGE("var =  ${${var}_one}")

will produce var = apple

Variables and functions

Variable references act a little like pointers, but without a type system to enforce (and guide) how many indirections should be performed.

Example of using a variable inside a function:

FUNCTION(MY_FUNC arg1)
    MESSAGE("arg1 = ${arg1}")
ENDFUNCTION()

MY_FUNC(hello)
SET(var, a)
MY_FUNC(var) # arg1 is set to 'var'
MY_FUNC(${var}) # arg1 is set to 'a' - this is usually what you want

produces

arg1 = var
arg1 = a

Return values from functions

There is no built-in notion of a return value from a function. To get values out of a function, write to one of the arguments.

A function creates a new scope - changes to a variable will only affect the variable's value inside the function. To affect the value in the parent, the PARENT_SCOPE modifier should be given to the SET command. (More on variable scopes here)

Another issue is the variable name for the output value needs to be dereferenced before being set. Otherwise a variable with the name used in the function will be set in the parent, which can work by accident if the variables have the same name.

Example:

FUNCTION(MY_FUNC_WITH_RET ret)
    # The following line works by accident if the name of variable in the parent
    # is the same as in the function
    SET(ret "in function" PARENT_SCOPE)
    # This is the correct way to get the variable name passed to the function
    SET(${ret} "in function" PARENT_SCOPE)
ENDFUNCTION()

SET(ret "before function")
MY_FUNC_WITH_RET(ret)
MESSAGE("output from function = ${ret}")

will produce output from function = in function

Data structures

There is only the List type, with some functions for operations on lists. The documentation on lists says that "Lists ... should not be used for complex data processing tasks", but doesn't say what to use instead.

For associative arrays or maps there are some options:

  • Two parallel lists - one of keys and one of values. Search the key list and use the index to look up value. More awkward for passing to functions.
  • Single list with alternating keys and values. Search the list for the key, and use index+1 to look up the value. Only works if the range of possibilities for keys and values are distinct (e.g., keys are strings and values are always numbers).
  • The environment (ENV{key}) is a built-in associative array. It could be overloaded to store other values, at the risk of polluting the environment.

Test timeout

The default timeout per test is 1500 seconds (25 minutes).

To increase this, adjust the value of DART_TESTING_TIMEOUT. It needs to be set as a cache variable, and it needs to be set before the enable_testing() or include(CTest) is specified.

SET( DART_TESTING_TIMEOUT 3600 CACHE STRING "Maximum time for one test")

See also this Stack Overflow post

Visualizing MD Data

Visualizing atomic positions in 3D is useful when working on a molecular dynamics code.

More generally, being able to visualize the structures and algorithms inside a code can help with understanding and debugging. With all the advances in graphics hardware, it should be possible to quickly create visualizations for various aspects of the code. But this seems harder than it should be. In this particular case, normal plotting packages can sometimes plot atomic positions with 3D scatter plots. But then further algorithm visualization is hard (animation, drawing boxes, etc).

The VisPy project looks promising. It contains three API levels

  1. gloo- the lowest level API around OpenGL
  2. scene- uses a scene graph description
  3. plot - standard plotting interface

The 'scene' interface is the appropriate abstraction level for our purposes. Note this API is marked experimental and may change in the future.

Pretty pictures

Static screenshots (click for larger version)

Screenshot2 Screenshot3

And an animation (click for larger version)

Screenshot3

Code

The modified comd.py is here. It should be a drop-in replacement for that file in the nsquared version of the code. The bulk of the visualization additions start around line 154.

The perceived speed of the simulation can vary. Pure Python, even at the smallest system size, is too slow. Using the Numba-accelerated loop makes it much faster. However, for the smallest system, this feels 'too fast'. Increasing the system size will slow it down (`-x 6 -y 6 -z 6' seems to work well on my system). There are much better ways of adjusting the speed, but this is the most expedient.

The -N option specifies the number of steps (defaults to 100). Set it to a larger value to run the animation longer.

During the run, press s to take a static screenshot (stored in screenshot.png). Press 'a' key to start/stop saving an animated segment (stored in animation.gif). These features require that the imageio package is installed.

The multiprocessing package is used to run the simulation and the visualization event loop in separate processes. Positions are passed from the simulation to the visualization via a Queue. A Timer on the visualization side checks the queue for new positions periodically.

This code snippet uses the Marker visual element to display the center of each point. This size is the size of the element on the screen, not the size in the scene (that is, elements don't change size when zooming) The current size was chosen to easily see the motion of the atoms, not to accurately represent the atom's size. Displaying a Sphere at each point would be more accurate, but is much slower.

Summary

I'm very pleased with VisPy. It enabled live, interactive animation of atoms from a molecular dynamics code with a small amount code. I expect extending this to visualize more complex algorithms and data structures should be be straightforward.

More Performance With Numba

Having acquired a shiny new profiler, it's time to dig into the performance of the Numba version some more.

Picking up from the previous optimizations, I can't seem to reproduce the timing (47 μs/atom) in the that table. Now I get 40 μs/atom.

First step

Run the profiler (vmprofrun comd.py) and display the results in KCacheGrind (kcachegrind vmprof-20664.out)

Sorting by self time, we see getBoxFromCoord at the top:

KCachegrind screenshot of functions sorted by self time

Also a screen shot of the call graph - getBoxFromCoord gets called from two different places - putAtomInBox and updateLinkCells.

KCachegrind screenshot of call graph

To improve performance here, convert getBoxFromCoord to a free function and put all the attribute references into function arguments.

Before:

    def getBoxFromCoord(self, r):
        ix = int(math.floor(r[0]*self.invBoxSize[0]))
        iy = int(math.floor(r[1]*self.invBoxSize[1]))
        iz = int(math.floor(r[2]*self.invBoxSize[2]))
        return self.getBoxFromTuple(ix, iy, iz)

After:

@numba.njit
def getBoxFromCoordInner(x, y, z, invBoxSize, nLocalBoxes, gs):
    ix = int(math.floor(x*invBoxSize[0]))
    iy = int(math.floor(y*invBoxSize[1]))
    iz = int(math.floor(z*invBoxSize[2]))
    return getBoxFromTupleInner(ix, iy, iz, nLocalBoxes, gs)

(Changing the parameter r to individual components was not strictly necessary.)

And the call sites change (for example) from

        iBox = self.getBoxFromCoord([x, y, z])

to

        iBox = getBoxFromCoordInner(x, y, z, self.invBoxSize, self.nLocalBoxes, self.gridSize)

This improves performance to 20 μs/atom.

Repeating the same transformation for putAtomInBox gives 18.4 μs/atom.

Second step

Run the profiler again. By self time, loadAtomsBuffer is at the top. Let's look at that in context. Sort by inclusive time, and we see that the parts of the call tree starting at redistributeAtoms take a significant amount of time.

KCachegrind screenshot of functions sorted by inclusive time

KCachegrind screenshot of call graph

This part of the code:

  • Applies periodic boundary conditions
  • Moves atoms to a new cell
  • Packs atom into a buffer at source
  • Unpacks buffer into atom data structure at destination

This packing/unpacking anticipates the parallel version, which transmits the buffer across processors.

A previous attempt at using numpy records did not work well (and ran into a serious performance regression with numpy 1.10). This time I went with two buffers - one for integers, and one for floating point numbers. This works better, and the performance is now 10.2 μs/atom.

More steps

Continuing the process of profiling, and converting routines to be Numba friendly eventually reached a performance of 2.9 μs/atom. (Wow, this is only about 30% slower than C.)

Modified code is here

The updated performance table is

Language/compiler   Version    Initial time   Final time
Python 2.7.10 1014 1014
PyPy 4.0 30 30
Cython 0.23.3 729 13
Julia 0.4.0-rc3 87 6.1
Numba 0.22.1 867 2.9     New result
C (clang) 3.7 2.2 2.2


Times are all in μs/atom. System size is 4000 atoms. Hardware is Xeon E5-2630 v3 @ 2.4 Ghz, OS is Ubuntu 12.04.
The 'Initial Time' column results from the minimal amount of code changes to get the compiler working.
The 'Final Time' is the time after tuning.

Do note that the Cython, Julia, and Numba results reflect the amount of effort put into optimization. Cython and Julia still need to be improved with the assistance of a profiler (or in Julia's case, a better viewer for existing profile data).

Summary

Using a profiler to guide our optimization efforts has been very helpful.

The Numba results are really promising, but, in addition to creating ugly code, it required an amount of work that I would not want to perform on a regular basis. These transformations are fairly regular, so it should be possible to incorporate them into Numba. Alternately, if doing so in a safe manner inside the compiler is difficult, some sort of automated AST transformation of the source should be possible.

As the optimization process proceeds on this code, increasing amount of time is being spent in the core routine, computeForce, (as it should), and we will need to move beyond a function-level profiler to look for optimization opportunities.

Performance Updates with PyPy 4.0

The PyPy team recently released version 4.0 (The jump in version number is to reduce confusion with the Python version supported.) One of the features is improved performance.

But first, an issue with reporting accurate timings with this version of CoMD should be addressed. The initial iteration contains overhead from tracing and JIT compilation (Cython and Numba have the same issue). For this example we are concerned with the steady-state timing, so the time for the first iteration should be excluded. I've added a '--skip' parameter to the CoMD code (default: 1) that skips the first printRate steps (default: 10) in computing the overall average update rate at the end.

Now the table with the most recent performance numbers (from this post ), updated with PyPy 4.0 results:

Language/compiler   Version    Time
Python 2.7.10 1014
PyPy 2.6.1 96
Numba 0.21.0 47
PyPy 4.0 30     New result
Cython 0.23.3 13
Julia 0.4.0-rc3 6.1
C 4.8.2 2.3


Times are all in μs/atom. System size is 4000 atoms. Hardware is Xeon E5-2630 v3 @ 2.4 Ghz, OS is Ubuntu 12.04.

The new release of PyPy also includes some SIMD vectorization support (--jit vec=1 or --jit vec_all=1). Neither of these provided any improvement in performance on this code. Not too surprising given the vectorization support is new, and the code contains conditionals in the inner loop.

PyPy 4.0 gives a very good 34x performance improvement over bare Python, and 3x improvement over the previous release (2.6.1). PyPy is attractive here in that no modifications made to the source. (Both Cython and Numba required source changes)

Prototype for Profiling Python

Last post covered some technical background using vmprof to profile Python with compiled or JIT'ed extensions. Now I've created a prototype that can convert the output to callgrind format so it can be viewed with KCachegrind.

To install the prototype using the Anaconda distribution:

  1. Create a new environment (if you do not use a new environment, these packages may conflict with an existing Numba install): conda create -n profiling python numpy
  2. Switch to the new environment: source activate profiling
  3. Install prototype versions of Numba and llvmlite: conda install -c https://conda.anaconda.org/mdewing numba-profiling
  4. Install prototype version of vmprof: conda install -c https://conda.anaconda.org/mdewing vmprof-numba
  5. Make sure libunwind is installed. (On Ubuntu apt-get install libunwind8-dev.) (On Ubuntu, it must be the -dev version. If not installed, the error message when trying to run vmprof is ImportError: libunwind.so.8: cannot open shared object file: No such file or directory)
  6. Install KCachegrind (On Ubuntu, apt-get install kcachegrind)

There is a wrapper (vmprofrun) that automates the running and processing steps. To use it, run vmprofrun <target python script> [arguments to the python script]. (No need to specify python - that gets added to the command line automatically.) By default it will output vmprof-<pid>.out, which can be viewed in KCachegrind.

Underneath, the vmprofrun tool saves the vmprof output during the run to out.vmprof. After the run, it automatically copies the /tmp/perf-<pid>.map file to the current directory (if running under Numba). It moves out.vmprof to out-<pid>.vmprof. Finally it runs vmproftocallgrind using these files as input.

Limitations

  1. Only works on 64-bit Linux
  2. Function-level profiles only - no line information (for either python or native code)
  3. Sometimes the profiling hangs during the run - kill the process and try again.
  4. Works with Python 2.7 or 3.4
  5. Not well validated or tested yet
  6. It does not work well yet with the existing vmprof web visualization and CLI tools.

Other notes:

  • The stack dump tool will process the stacks to remove the Python interpreter frames.
  • By default the Numba Dispatcher_call level is removed. Otherwise the call graph in KCachegrind gets tangled by all the call paths running through this function.
  • It should work with C extensions and Cython as well.

Towards Profiling Accelerated Python

One of the conclusions from last post is a need for better profiling tools to show where time is spent in the code. Profiling Python + JIT'ed code requires dealing with a couple of issues.

The first issue is collecting stack information at different language levels. A native profiler collects a stack for the JIT'ed (or compiled extension) code, but eventually the stack enters the implementation of the Python interpreter loop. Unless we are trying to optimized the interpreter loop, this is not useful. We would rather know what Python code is being executed. Python profilers can collect the stack at the Python level, but can't collect native code stacks.

The PyPy developers created a solution in vmprof. It walks the stack like a native profiler, but also hooks the Python interpreter so that it can collect the Python code's file, function, and line number. This solution is general to any type of compiled extension (C extensions, Cython, Numba, etc.) Read the section in the vmprof docs on Why a new profiler? for more information.

The second issue is particular to JIT'ed code - resolving symbol information after the run. For low overhead, native profilers collect a minimum of information at runtime (usually the Instruction Pointer (IP) address at each stack level). These IP addresses need to resolved to symbol information after collection. Normally this information is kept in debug sections that are generated at compile time. However, with JIT compilation, the functions and their address mappings are generated at runtime.

LLVM includes an interface to get symbol information at runtime. The simplest way to keep it for use after the run is to follow the Linux perf standard (documented here), which stores the address, size, and function name in a file /tmp/perf-<pid>.map.

To enable Numba with vmprof, I've created a version of llvmlite that is amenable to stack collection, at the perf branch here.

This does two things:

  1. Keep the frame pointer in JIT'ed code, so a backtrace can be taken.1
  2. Output a perf-compatible JIT map file (not on by default - need to call enable_jit_events to turn it on)

To use this, modify Numba to enable JIT events and frame pointers:

  1. In targets\codegen.py, at the end of the _init method of BaseCPUCodegen, add self._engine.enable_jit_events()
  2. And for good measure, turn on frame pointers for Numba code as well (set CFLAGS=-fno-omit-frame-pointer before building it)

The next piece is a modified version of vmprof ( in branch numba ). So far all it does is read the perf compatible output and dump raw stacks. Filtering and aggregating Numba stacks remains to be done (meaning neither the CLI nor the GUI display work yet).

How to use what works, so far:

  1. Run vmprof, using perf-enabled Numba above: python -m vmprof -o vmprof.out <target python>
  2. Copy map file /tmp/perf-<pid>.map to some directory. I usually copy vmprof.out to something like vmprof-<pid>.out to remember which files correlate.
  3. View raw stacks with vmprofdump vmprof-<pid>.out --perf perf-<pid>.map.

  1. With x86_64, it is possible to use DWARF debug information to walk the stack. I couldn't figure out how to output the appropriate debug information. LLVM 3.6 has a promising target option named JITEmitDebugInfo. However, JITEmitDebugInfo is a lie! It's not hooked up to anything, and has been removed in LLVM 3.7. 

Improvements in CoMD Cell Method Performance

A previous post showed some performance improvements with the nsquared version of the code. This post will tackle the cell version of the code. In the nsquared version the time-consuming inner loop had no function calls. The cell version does call other functions, which may complicate optimization.

Language/compiler   Version    Initial time   Final time
C 4.8.2 2.3 2.3
Python 2.7.10 1014 1014
PyPy 2.6.1 96 96
Julia 0.4.0-rc3 87 6.1
Cython 0.23.3 729 13
Numba 0.21.0 867 47


Times are all in μs/atom. System size is 4000 atoms. Hardware is Xeon E5-2630 v3 @ 2.4 Ghz, OS is Ubuntu 12.04.
The 'Initial Time' column results from the minimal amount of code changes to get the compiler working.
The 'Final Time' is the time after the tuning in this post.

Julia

The cell version contains the same issue with array operations as the nsquared version - the computation of dd allocates a temporary to hold the results every time through the loop.

The Devectorize package can automatically convert array notation to a loop. If we add the @devec annotation, the time drops to 43 μs/atom. Unfortunately, the allocation to hold the result must still be performed, and it remains inside the inner particle loop. If we manually create the loop and hoist the allocation out of the loop, time is 27 μs/atom.

The code uses dot to compute the vector norm. This calls a routine (julia_dot) to perform the dot product. For long vectors calling an optimized linear algebra routine is beneficial, but for a vector of length 3 this adds overhead. Replacing dot with the equivalent loop reduces the time to 23 μs/atom.

Looking through the memory allocation output (--track-allocation=user) shows some vector operations when the force array is zeroed and accumulated. Also in putAtomInBox in linkcell.jl. These spots are also visible in the profile output, but the profile output is less convenient because it is not shown with source. The @devec macro does work here, and the performance is now 7.7 μs/atom. Explicit loops give a slightly better time of 7.3 μs/atom.

Profiling shows even more opportunities for devectorization in advanceVelocity and advancePosition in simflat.jl Time is now 6.4 μs/atom.

The Julia executable has a -O switch for more time-intensive optimizations (it adds more LLVM optimization passes). This improves the time to 6.2 μs/atom.

The @fastmath macro improves the time a little more, to 6.1 μs/atom. The @inbounds macro to skip the bounds checks did not seem to improve the time.

The final Julia time is now within a factor of 3 of the C time. The code is here. It's not clear where the remaining time overhead comes from.

PyPy

The PyPy approach to JIT compilation is very general, but that also makes it difficult to target what code changes might improve performance. The Jitviewer tool is nice, but not helpful at a cursory glance. The vmprof profiler solves an important problem by collecting the native code stack plus the python stack. In this particular case, it reports at the function level, and the bulk of the time was spent in computeForce. I hope to write more about vmprof in a future post, as it could help with integrated profiling of Python + native code (either compiled or JIT-ed).

Cython

The simplest step is to add an initialization line and move some .py files to .pyx files. This gives 729 μs/atom. Adding types to the computeForce function and assigning a few attribute lookups to local variables so the types can be assigned (playing a game of 'remove the yellow' in the Cython annotation output) gives 30 μs/atom.

Adding types and removing bounds checks more routines (in halo.py, linkcell.py, simflat.py) gives 13 μs/atom.

Code is here. Further progress needs deeper investigation with profiling tools.

Numba

Starting with adding @numba.jit decorators to computeForce, and the functions it calls gives the initial time of 867 μs/atom. Extracting all the attribute lookups (including the function call to getNeighborBoxes) gives 722 μs/atom.

We should ensure the call to getNeighborBoxes is properly JIT-ed. Unfortunately, this requires more involved code restructuring. Functions need to be split into a wrapper that performs any needed attribute lookups, and an inner function that gets JIT-ed. Loop lifting automatically performs this transformation on functions with loops. On functions without loops, however, it needs to be done manually. Once this is done, the time improves dramatically to 47 μs/atom.

Hopefully the upcoming "JIT Classes" feature will make this easier, and require less code restructuring.

Code is here

Summary

Julia is leading in terms of getting the best performance on this example. Many of these projects are rapidly improving, so this is just a snapshot at their current state.

All these projects need better profiling tools to show the user where code is slow and to give feedback on why the code is slow. The Cython annotated output is probably the the best - it highlights which lines need attention. However it is not integrated with profiler output, so in a project of any size, it's not clear where a user should spend time adding types.

Julia has some useful collection and feedback tools, but they would be much more helpful if combined. The memory allocation output is bytes allocated. It's useful for finding allocations where none were expected, or for allocations in known hot loops, but it's less clear which other allocations are impacting performance. Ideally this could be integrated with profiler output and weighted by time spent to show which allocations are actually affecting execution time.

Two Meanings of Vectorization

The term 'vectorize' as used by programmers has at least two separate uses. Both uses can have implications for performance, which sometimes leads to confusion.

One meaning refers to a language syntax to express operations on multiple values - typically an entire array, or a slice of a array. This can be a very convenient notation for expressing algorithms.

Here is a simple example (in Julia) using loop-oriented (devectorized) notation

a = ones(Float64, 10)
b = ones(Float64, 10)
# allocate space for result
c = zeros(Float64, 10)

for i in 1:10
    c[i] = a[i] + b[i]
end

Now compare with using vectorized (array) notation

a = ones(Float64, 10)
b = ones(Float64, 10)

# space for result automatically allocated

c = a + b

The vectorized notation is more compact. Julia and Python/Numpy programmers usually mean this when referring to 'vectorization'. See more in the Wikipedia entry on Array programming

John Myles White wrote a post discussing the performance implications of vectorized and devectorized code in Julia and R. Note that Python/Numpy operates similar to R as described in the post - good performance usually requires appropriately vectorized code, because that skips the interpreter and calls higher performing C routines underneath.

The other meaning of 'vectorization' refers to generating assembly code to make effective use of fine-grained parallelism in hardware SIMD units. This is what Fortran or C/C++ programmers (and their compilers) mean by 'vectorization'. In Julia, the @simd macro gives hints to the compiler that a given loop can be vectorized.

See more in the Wikipedia entry on Automatic vectorization