Guide to my Github Repositories

This post is a short guide to my Github repositories.

Some of them I use as a wiki for keeping information. Markdown makes adding and editing text easy, and it can also contain code.

Other repositories have more code to explore a particular topic. I should expand on some of these topics in future blog posts. (Especially the contents and use of the "QMC Algorithms" repository. It's used as a base for creating validated unit test data for QMCPACK, and it's a direction I think more projects should move towards.)

Wiki-like Information

GPU Internals - Documentation on how GPUs work and programming models

ML in scientific computation - Machine learning and scientific computation

Programming Ideas

Next Steps in Programming - Ideas on programming scientific applications. Includes a focus on code changes a first class/primary object in programming.

Programming Tutorial Maker - Attempt at a tool to make writing incremental tutorials easier.

Scientific Computing

QMC Algorithms - Scripts, Jupyter notebooks, Sympy derivations, and code generation to support validating QMC codes. (Also here)

QMC Kernels - Kernels used in Quantum Monte Carlo in general, and QMCPACK in particular. Also some simple kernels used to explore GPU offload programming models.

Small Molecules - Example input files for quantum chemistry packages and information on various small molecules.

Quadrature - Multidimensional quadrature algorithms. (Technically "cubature"). In addition to notes and links, this has some implementations.

Floating point - Floating point arithmetic. Particularly for exploring low precision. (the 'silver' part of the repository name comes from Github's initial repository name suggestion)

Symbolic Derivatives with Code Generation Represents scientific computations using Sympy. It can take derivatives symbolically, then code generate the result to Python, Julia or C++.

Multitevo Translation of CoMD molecular dynamics miniapp to other languages, such as Python and Julia. And Javascript, should I get it checked-in and pushed.

Bash for C programmers

Bash scripting

Why use Bash? It is not my preferred language, but it has two big advantages:

  • Availability - we can safely assume bash is present everywhere. No need to install additional software.
  • Ease of starting and incremental building - a script can start as a small list of commands and grow incrementally.

The rest of this post not a tutorial, but some ways to think about bash scripting in order to avoid common traps when coming from C/C++ or Python.

Quick Intro

Whitespace is significant

Almost everything in bash is a command with arguments. After all the substitutions and quoting, the basic structure of each line is a command with all its arguments are separated by whitespace. This makes whitespace significant in ways that can be confusing when coming from other languages.

One place this shows up is in setting variables - there should be no whitespace on either side of the equal sign. The statement a = 1 will fail with an error about 'a' not found - the shell interprets it as a command. Similarly a= 1 will fail with an error about '1' not found.

Another place significant whitespace appears is in if statements. The opening bracket for the test ([) looks like syntax, but it is a command. (By symmetry you might expect ] to be a command, but it's not - it gets passed as an argument to [).

a=0
if [ $a -eq 0 ]; then
  echo "a is 0"
fi

Also note the semicolon after the comparison to terminate the command. As an alternative, the then could be put on the next line

a=0
if [ $a -eq 0 ]
then
  echo "a is 0"
fi

Bash also has a double bracket form of testing that is a built-in, and not an external command.

Exit values

Exit values from commands are 'success' or 'failure'. Tests of the exit value, like the 'if' statement or the control operators (&& and ||) operate on these exit values (and only exit values). Numerically, success maps to 0, whereas failure maps to a non-zero value. This numerical mapping is opposite of many other languages, like C/C++ and Python, where 0 is false and non-zero values are true. Using multiple values for failure can give some indication of what went wrong with the command.

The best way to think about this is not to mentally reverse the values, but to think in terms of 'success' and 'failure', and only think about the numerical mapping if necessary.

Sequencing and control operator idioms:

  • A;B Run A and then B, regardless of success of A
  • A && B Run B if A succeeded
  • A || B Run B if A failed
  • A & Run A in the background

( from https://unix.stackexchange.com/questions/24684/confusing-use-of-and-operators )

Functions

Functions start with "function name {" or "name () {". The body is a series of commands, and the function ends with }. Once again, whitespace is significant. The function declaration should be on a line by itself, and the closing brace should also be on a line by itself.

A function call looks like a normal command statement, with the usual space-separated arguments.

Inside a function, arguments are referenced using $1, $2, etc. Unset variables are empty.

There is a syntax for optional arguments:

var=${1:-default_string_here}

Variable replacement

Variable replacement occurs before bash breaks the line into a command and arguments based on whitespace. Unexpected behavior can happen if a variable contains whitespace and the developer was not expecting it. This can happen when the variable holds a directory name, and the script is run using a directory containing a space.

a="hi there"
cmd $a

expands to calling cmd with two arguments 'hi' and 'there'. Enclose the variable in quotes to keep spaces confined.

This will call cmd with one argument: 'hi there'.

a="hi there"
cmd "$a"

For a gory step-by-step look at how bash processes input, see https://mywiki.wooledge.org/BashParser

Robust bash scripts

See https://www.davidpashley.com/articles/writing-robust-shell-scripts/ for advice on writing robust scripts.

Part of the advice is to use these settings

  • set -o nounset to detect the use of uninitialized variables
  • set -o errexit to exit the script if any statement fails

Building A Parallella Cluster

I finally got around to assembling my small pile of Parallella boards into a cluster.

Alternate side view

This post documents the choices I made about power distribution, mechanical assembly, cooling, software management, etc. For background on the individual boards, see my previous post: Introduction to Parallella.

The software directions are based on the Linaro 14.04 image. (At some point I should upgrade to the latest Paraubuntu release.)

Power

There are several options for powering multiple boards:

  1. Solder a jumper and use the mounting pads with metal standoffs to transmit the power. This option requires the fewest cords.
  2. Use micro USB connector (need to switch jumper J14). There are multi-port USB power supplies that should work. This is probably the simplest option for a small number of boards.
  3. Use the 5.5mm barrel plug (default settings). There are versions with screw terminals on Amazon (these for example).

I went with option 3 and used an old PC power supply for the power.

Mechanical assembly

I used 20mm nylon standoffs and assembled the boards into two stacks. A small piece of 1/4" plywood was used for the base.

Cooling

For air flow, I used a cooling fan from an old PC and mounted it to the plywood base.

Network

Needing ports for a maximum of 10 boards, plus one port for the external connection, I chose a 16-port Gigabit D-Link switch. Eventually I would like to power the network switch from the PC power supply, but need to get the right plug first.

The nodes use DHCP from my home network get their addresses. A future improvement is to run a DHCP server on one node and use that to supply addresses to the other nodes. This would make the cluster independent of running on my home network.

MicroSD cards

Follow the directions on the Parallella Create SD Card page. After burning the image, use gparted to resize the partition to the full capacity of the card.

Management and control

Performing software installs and other management on all of boards individually would be too tedious. There are many solutions for managing clusters. I decided to use Ansible, as it seemed the simplest (no software needed on nodes) and it runs over ssh.

In addition to controlling operations from a PC, it is useful to designate one node as a 'head node' and install Ansible there as well. For MPI, it's easier to run MPI programs from the head node than from the PC. For setting up configuration files, it can be useful to create or edit the file and make sure it works on one node, and then copy the file to all the other nodes.

Ansible and MPI (and general convenience) require setting up passwordless ssh login.

Once the keys are set up locally, you can use ssh-copy-id to copy the credentials.

ssh-copy-id -i ~/.ssh/id_rsa.pub parallella@10.0.0.145

I keep a small script that puts this line in the history (named ssh_copy_id)

history -s "ssh-copy-id -i ~/.ssh/id_rsa.pub parallella@10.0.0.145"

Run the command source ssh_copy_id to put the command on the history list. Use the bash history and line editing features to select the command and update to a new address.

Rather than create a new user, I'm using the default 'parallella' user on the nodes. The SSH config (in '.ssh/config') can be set up to switch users upon login.

Host 10.0.0.127
    User parallella

You might wish to set up a apt-cache server on a local machine to save on download bandwidth when installing software to all the nodes.

Using Ansible

See the intro docs.

The inventory file is a list of IP addresses (one per line). It can be specified on the command line with `-i'. To see if all the nodes work

ansible -i cluster.ini all -m ping

To copy the apt-cache server configuration to all the nodes, use

 ansible -i hosts all --sudo -m copy -a "src=/etc/apt/apt.conf.d/02proxy dest=/etc/apt/apt.conf.d/02proxy"

To shutdown all the nodes, use

ansible -i cluster.ini all --sudo -m shell -a "shutdown now"

Compilation

In the introductory post I talked about cross compiling for the board. That gets more complicated with larger software packages. For instance, one of my project dependencies, HDF, doesn't cross-compile easily (or at all).

Since the nodes use a regular Ubuntu distribution, native compilation is easy, but slow.

One solution is to use distcc.

The particular software package I'm working with (QMCPACK, which does simulations on atoms, molecules, and solids) uses CMake for configuration and build, and builds fine with distcc.

Install on all nodes with

ansible -i cluster.ini all --sudo  -a "apt-get install -y distcc"

Set the DISTCC_HOSTS variable to the set of systems to use

export DISTCC_HOSTS="localhost @10.0.0.144/2 @10.0.0.145/2"

This example shows three hosts. The initial '@' means to use ssh (no daemon required on remote) and the '/2' on the end means to use two threads.

Now set the C and C++ compilers to distcc <compiler> and run the build.

For CMake, building a project with MPI, this is

export CC="distcc mpicc"
export CXX="distcc mpic++"

Then

make -j 8

You might see an warning message

distccd[<pid>] (tcp_cork_sock) Warning: setsockopt(corked=1) failed: Socket operation on non-socket

This can be ignored. Or follow some directions to silence it.

MPI

One popular method for writing programs that communicate across the boards is MPI (Message Passing Interface).

Install the 'mpi-default-dev' package (which should also install the 'mpi-default-bin' package). This installs the Open MPI implementation (the alternative being MPICH). Note that this MPI is only concerned with communication between the ARM cores on different boards. There is also the Brown Deer Technology version of MPI for programming the Epiphany accelerator.

It's common to use a networked file system so each local node has access to the executables and input files. Ansible has file distribution commands that work well enough that a networked file system isn't strictly necessary. (Be aware when copying directories with Ansible that if the directory specified in src does not end with '/', the directory and it's contents are copied. If it does end with '/', just the directory contents are copied.)

Open MPI uses a tree-based launcher for better scalable start-up performance. Because of this, each node should be able to log into each other node (not just head node to other nodes).

MPI needs a list of machines to run on. One method is to create a host file and pass it to mpirun with the --hostfile option. The host file, at its simplest, is one hostname or IP address per line (same as a simple Ansible inventory file.)

Gallery

Front/side view   Back view  

Side view   Top view   Alternate side view

Integration Callbacks with Sympy and LLVM

This post explores various packages for multi-dimensional integration along with generating callbacks for the integrands from Sympy using an LLVM JIT

Problem to integrate

The particular problem is using the variational principle to find the ground state energy for atoms. Some Jupyter notebooks with a description of the problem, along with various integration methods:

Ground state energy of Hydrogen Atom (This yields a 3 dimensional integral.)

Ground state energy of Helium Atom (This yields a 6 dimensional integral.)

The standard solution to these integrals is to use Markov Chain Monte Carlo (the Quantum Monte Carlo method).
However, I'm curious to see how far alternate integration schemes or existing integration packages would work.

Integration libraries

The scipy quadrature routines accept a natively compiled callback for the integrand. (Noticing this in the documentation initiated the idea for using JIT compilation for callback functions.)

Next up is the Cubature integration package, with the Python wrapper for cubature

Finally is the Cuba library, with the PyCuba interface (part of the PyMultiNest package)

There are some other libraries such at HIntLib that I would also like to try. There doesn't seem to be a python interface for HIntLib. Let me know if there is one somewhere. And if there are other multidimensional integration packages to try.

Evaluating the integrand

One of my scientific programming goals is to generate efficient code from a symbolic expression. To this end, I've been working on an LLVM JIT converter for Sympy expressions (using the llvmlite wrapper).

For the Sympy code, see these pull requests:

As an aside, one can question if is this the right approach, compared with

  1. Generate C++ or Fortran and compile using the existing autowrap functionality in Sympy.
  2. Generate Python/Numpy and use Numba.
  3. Use Julia

There is always a tradeoff between a narrow, specialized solution, which is faster to implement and perhaps easier to understand, and a more general solution, which applies in more cases, but is harder and slower to implement.

Using an LLVM JIT is a specialized solution, but it does have an advantage that there is a short path from the expressions to the compiled code. One disadvantage is that it does not leverage existing compilers (Numba or C++), though LLVM compiler optimization passes are available.

Sometimes a solution just needs to be tried to gain experience with the advantages and drawbacks.

Results

For the helium atom, the integration times are reported in the table below

Integrator   Time (seconds)
Cubature 171
Cubature w/CSE 141
Cubature w/CSE and multiple evals 100
Cuba (Vegas) 29
Cuba (Cuhre) 22


Note that scipy.nquad was not used for the 6D integral. It quickly runs out of steam because it consists of iterated one dimensional integrations, and the glue between the dimensions goes through Python, reducing the effectiveness of a compiled integrand.

The Cubature library does better. Profiling shows that most of the time is spent internal to cubature and allocating memory, so faster integrand evaluation is not going to improve the time. Some other approaches can help. One is Common Subexpression Elimination (CSE), which Sympy can perform on the expression. This extracts duplicate fragments so their value only needs to be computed once.

The library also allows multiple integrals to be performed at once. This can amortize some of the overhead of the library. In this case, the individual calls to integrator for the numerator and denominator can be reduced to a single call.

The Cuba library performs even better, as there is apparently less overhead inside the integration library. The Cuhre integrator uses a deterministic grid-based algorithm similar to Cubature. Vegas uses an adaptive Monte Carlo approach.

The results are not shown here, but I also used SIMD vectorization to make the function evaluation even faster, which succeeded for the bare function evaluation. (This was one of the original motivations for compiling straight to LLVM, as it would be easier to get vectorization working.) Unfortunately, it did not speed up the overall integration much (if at all), due to overhead in the libraries.

Conclusions and future work

Using an LLVM JIT to create callbacks for integration works fairly well.

One important question is how to scale the creation of the callbacks to new libraries without explicitly programming them into Sympy.
The last pull request has expanded the CodeSignature class, which seems like a starting point for such a more general callback specification.

Notes on CMake

Recently I started working on a project that uses CMake. I've used CMake a little before, but never really had to dive much into it. In particular, I needed to understand the scripting parts of CMake for adding tests for CTest.

Below are some comments on aspects of CMake.

Variables and variable substitution

Variables names are strings. Substitution occurs when the variable is dereferenced with ${}.

SET(var, a)
MESSAGE("var = ${var}")

produces

var = a

Nested substitutions are possible

SET(var, a)
SET(a, b)
MESSAGE("var = ${var}  ${${var}}")

will produce var = a b

Variable names can be composed during substitution

SET(var, a)
SET(a_one, apple)
MESSAGE("var =  ${${var}_one}")

will produce var = apple

Variables and functions

Variable references act a little like pointers, but without a type system to enforce (and guide) how many indirections should be performed.

Example of using a variable inside a function:

FUNCTION(MY_FUNC arg1)
    MESSAGE("arg1 = ${arg1}")
ENDFUNCTION()

MY_FUNC(hello)
SET(var, a)
MY_FUNC(var) # arg1 is set to 'var'
MY_FUNC(${var}) # arg1 is set to 'a' - this is usually what you want

produces

arg1 = var
arg1 = a

Return values from functions

There is no built-in notion of a return value from a function. To get values out of a function, write to one of the arguments.

A function creates a new scope - changes to a variable will only affect the variable's value inside the function. To affect the value in the parent, the PARENT_SCOPE modifier should be given to the SET command. (More on variable scopes here)

Another issue is the variable name for the output value needs to be dereferenced before being set. Otherwise a variable with the name used in the function will be set in the parent, which can work by accident if the variables have the same name.

Example:

FUNCTION(MY_FUNC_WITH_RET ret)
    # The following line works by accident if the name of variable in the parent
    # is the same as in the function
    SET(ret "in function" PARENT_SCOPE)
    # This is the correct way to get the variable name passed to the function
    SET(${ret} "in function" PARENT_SCOPE)
ENDFUNCTION()

SET(ret "before function")
MY_FUNC_WITH_RET(ret)
MESSAGE("output from function = ${ret}")

will produce output from function = in function

Data structures

There is only the List type, with some functions for operations on lists. The documentation on lists says that "Lists ... should not be used for complex data processing tasks", but doesn't say what to use instead.

For associative arrays or maps there are some options:

  • Two parallel lists - one of keys and one of values. Search the key list and use the index to look up value. More awkward for passing to functions.
  • Single list with alternating keys and values. Search the list for the key, and use index+1 to look up the value. Only works if the range of possibilities for keys and values are distinct (e.g., keys are strings and values are always numbers).
  • The environment (ENV{key}) is a built-in associative array. It could be overloaded to store other values, at the risk of polluting the environment.

Test timeout

The default timeout per test is 1500 seconds (25 minutes).

To increase this, adjust the value of DART_TESTING_TIMEOUT. It needs to be set as a cache variable, and it needs to be set before the enable_testing() or include(CTest) is specified.

SET( DART_TESTING_TIMEOUT 3600 CACHE STRING "Maximum time for one test")

See also this Stack Overflow post

Visualizing MD Data

Visualizing atomic positions in 3D is useful when working on a molecular dynamics code.

More generally, being able to visualize the structures and algorithms inside a code can help with understanding and debugging. With all the advances in graphics hardware, it should be possible to quickly create visualizations for various aspects of the code. But this seems harder than it should be. In this particular case, normal plotting packages can sometimes plot atomic positions with 3D scatter plots. But then further algorithm visualization is hard (animation, drawing boxes, etc).

The VisPy project looks promising. It contains three API levels

  1. gloo- the lowest level API around OpenGL
  2. scene- uses a scene graph description
  3. plot - standard plotting interface

The 'scene' interface is the appropriate abstraction level for our purposes. Note this API is marked experimental and may change in the future.

Pretty pictures

Static screenshots (click for larger version)

Screenshot2 Screenshot3

And an animation (click for larger version)

Screenshot3

Code

The modified comd.py is here. It should be a drop-in replacement for that file in the nsquared version of the code. The bulk of the visualization additions start around line 154.

The perceived speed of the simulation can vary. Pure Python, even at the smallest system size, is too slow. Using the Numba-accelerated loop makes it much faster. However, for the smallest system, this feels 'too fast'. Increasing the system size will slow it down (`-x 6 -y 6 -z 6' seems to work well on my system). There are much better ways of adjusting the speed, but this is the most expedient.

The -N option specifies the number of steps (defaults to 100). Set it to a larger value to run the animation longer.

During the run, press s to take a static screenshot (stored in screenshot.png). Press 'a' key to start/stop saving an animated segment (stored in animation.gif). These features require that the imageio package is installed.

The multiprocessing package is used to run the simulation and the visualization event loop in separate processes. Positions are passed from the simulation to the visualization via a Queue. A Timer on the visualization side checks the queue for new positions periodically.

This code snippet uses the Marker visual element to display the center of each point. This size is the size of the element on the screen, not the size in the scene (that is, elements don't change size when zooming) The current size was chosen to easily see the motion of the atoms, not to accurately represent the atom's size. Displaying a Sphere at each point would be more accurate, but is much slower.

Summary

I'm very pleased with VisPy. It enabled live, interactive animation of atoms from a molecular dynamics code with a small amount code. I expect extending this to visualize more complex algorithms and data structures should be be straightforward.

More Performance With Numba

Having acquired a shiny new profiler, it's time to dig into the performance of the Numba version some more.

Picking up from the previous optimizations, I can't seem to reproduce the timing (47 μs/atom) in the that table. Now I get 40 μs/atom.

First step

Run the profiler (vmprofrun comd.py) and display the results in KCacheGrind (kcachegrind vmprof-20664.out)

Sorting by self time, we see getBoxFromCoord at the top:

KCachegrind screenshot of functions sorted by self time

Also a screen shot of the call graph - getBoxFromCoord gets called from two different places - putAtomInBox and updateLinkCells.

KCachegrind screenshot of call graph

To improve performance here, convert getBoxFromCoord to a free function and put all the attribute references into function arguments.

Before:

    def getBoxFromCoord(self, r):
        ix = int(math.floor(r[0]*self.invBoxSize[0]))
        iy = int(math.floor(r[1]*self.invBoxSize[1]))
        iz = int(math.floor(r[2]*self.invBoxSize[2]))
        return self.getBoxFromTuple(ix, iy, iz)

After:

@numba.njit
def getBoxFromCoordInner(x, y, z, invBoxSize, nLocalBoxes, gs):
    ix = int(math.floor(x*invBoxSize[0]))
    iy = int(math.floor(y*invBoxSize[1]))
    iz = int(math.floor(z*invBoxSize[2]))
    return getBoxFromTupleInner(ix, iy, iz, nLocalBoxes, gs)

(Changing the parameter r to individual components was not strictly necessary.)

And the call sites change (for example) from

        iBox = self.getBoxFromCoord([x, y, z])

to

        iBox = getBoxFromCoordInner(x, y, z, self.invBoxSize, self.nLocalBoxes, self.gridSize)

This improves performance to 20 μs/atom.

Repeating the same transformation for putAtomInBox gives 18.4 μs/atom.

Second step

Run the profiler again. By self time, loadAtomsBuffer is at the top. Let's look at that in context. Sort by inclusive time, and we see that the parts of the call tree starting at redistributeAtoms take a significant amount of time.

KCachegrind screenshot of functions sorted by inclusive time

KCachegrind screenshot of call graph

This part of the code:

  • Applies periodic boundary conditions
  • Moves atoms to a new cell
  • Packs atom into a buffer at source
  • Unpacks buffer into atom data structure at destination

This packing/unpacking anticipates the parallel version, which transmits the buffer across processors.

A previous attempt at using numpy records did not work well (and ran into a serious performance regression with numpy 1.10). This time I went with two buffers - one for integers, and one for floating point numbers. This works better, and the performance is now 10.2 μs/atom.

More steps

Continuing the process of profiling, and converting routines to be Numba friendly eventually reached a performance of 2.9 μs/atom. (Wow, this is only about 30% slower than C.)

Modified code is here

The updated performance table is

Language/compiler   Version    Initial time   Final time
Python 2.7.10 1014 1014
PyPy 4.0 30 30
Cython 0.23.3 729 13
Julia 0.4.0-rc3 87 6.1
Numba 0.22.1 867 2.9     New result
C (clang) 3.7 2.2 2.2


Times are all in μs/atom. System size is 4000 atoms. Hardware is Xeon E5-2630 v3 @ 2.4 Ghz, OS is Ubuntu 12.04.
The 'Initial Time' column results from the minimal amount of code changes to get the compiler working.
The 'Final Time' is the time after tuning.


Do note that the Cython, Julia, and Numba results reflect the amount of effort put into optimization. Cython and Julia still need to be improved with the assistance of a profiler (or in Julia's case, a better viewer for existing profile data).

Summary

Using a profiler to guide our optimization efforts has been very helpful.

The Numba results are really promising, but, in addition to creating ugly code, it required an amount of work that I would not want to perform on a regular basis. These transformations are fairly regular, so it should be possible to incorporate them into Numba. Alternately, if doing so in a safe manner inside the compiler is difficult, some sort of automated AST transformation of the source should be possible.

As the optimization process proceeds on this code, increasing amount of time is being spent in the core routine, computeForce, (as it should), and we will need to move beyond a function-level profiler to look for optimization opportunities.

Performance Updates with PyPy 4.0

The PyPy team recently released version 4.0 (The jump in version number is to reduce confusion with the Python version supported.) One of the features is improved performance.

But first, an issue with reporting accurate timings with this version of CoMD should be addressed. The initial iteration contains overhead from tracing and JIT compilation (Cython and Numba have the same issue). For this example we are concerned with the steady-state timing, so the time for the first iteration should be excluded. I've added a '--skip' parameter to the CoMD code (default: 1) that skips the first printRate steps (default: 10) in computing the overall average update rate at the end.

Now the table with the most recent performance numbers (from this post ), updated with PyPy 4.0 results:

| Language/compiler   | Version   | Time| |-------------------|--------------|--------------:|------------:| | Python | 2.7.10 | 1014 | | | PyPy | 2.6.1 | 96 | | | Numba | 0.21.0 | 47 | | | PyPy | 4.0 | 30 |     New result | | Cython | 0.23.3 | 13 | | | Julia | 0.4.0-rc3 | 6.1 | | | C | 4.8.2 | 2.3 | |


Times are all in μs/atom. System size is 4000 atoms. Hardware is Xeon E5-2630 v3 @ 2.4 Ghz, OS is Ubuntu 12.04.

The new release of PyPy also includes some SIMD vectorization support (--jit vec=1 or --jit vec_all=1). Neither of these provided any improvement in performance on this code. Not too surprising given the vectorization support is new, and the code contains conditionals in the inner loop.

PyPy 4.0 gives a very good 34x performance improvement over bare Python, and 3x improvement over the previous release (2.6.1). PyPy is attractive here in that no modifications made to the source. (Both Cython and Numba required source changes)

Prototype for Profiling Python

Last post covered some technical background using vmprof to profile Python with compiled or JIT'ed extensions. Now I've created a prototype that can convert the output to callgrind format so it can be viewed with KCachegrind.

To install the prototype using the Anaconda distribution:

  1. Create a new environment (if you do not use a new environment, these packages may conflict with an existing Numba install): conda create -n profiling python numpy
  2. Switch to the new environment: source activate profiling
  3. Install prototype versions of Numba and llvmlite: conda install -c https://conda.anaconda.org/mdewing numba-profiling
  4. Install prototype version of vmprof: conda install -c https://conda.anaconda.org/mdewing vmprof-numba
  5. Make sure libunwind is installed. (On Ubuntu apt-get install libunwind8-dev.) (On Ubuntu, it must be the -dev version. If not installed, the error message when trying to run vmprof is ImportError: libunwind.so.8: cannot open shared object file: No such file or directory)
  6. Install KCachegrind (On Ubuntu, apt-get install kcachegrind)

There is a wrapper (vmprofrun) that automates the running and processing steps. To use it, run vmprofrun <target python script> [arguments to the python script]. (No need to specify python - that gets added to the command line automatically.) By default it will output vmprof-<pid>.out, which can be viewed in KCachegrind.

Underneath, the vmprofrun tool saves the vmprof output during the run to out.vmprof. After the run, it automatically copies the /tmp/perf-<pid>.map file to the current directory (if running under Numba). It moves out.vmprof to out-<pid>.vmprof. Finally it runs vmproftocallgrind using these files as input.

Limitations

  1. Only works on 64-bit Linux
  2. Function-level profiles only - no line information (for either python or native code)
  3. Sometimes the profiling hangs during the run - kill the process and try again.
  4. Works with Python 2.7 or 3.4
  5. Not well validated or tested yet
  6. It does not work well yet with the existing vmprof web visualization and CLI tools.

Other notes:

  • The stack dump tool will process the stacks to remove the Python interpreter frames.
  • By default the Numba Dispatcher_call level is removed. Otherwise the call graph in KCachegrind gets tangled by all the call paths running through this function.
  • It should work with C extensions and Cython as well.

Towards Profiling Accelerated Python

One of the conclusions from last post is a need for better profiling tools to show where time is spent in the code. Profiling Python + JIT'ed code requires dealing with a couple of issues.

The first issue is collecting stack information at different language levels. A native profiler collects a stack for the JIT'ed (or compiled extension) code, but eventually the stack enters the implementation of the Python interpreter loop. Unless we are trying to optimized the interpreter loop, this is not useful. We would rather know what Python code is being executed. Python profilers can collect the stack at the Python level, but can't collect native code stacks.

The PyPy developers created a solution in vmprof. It walks the stack like a native profiler, but also hooks the Python interpreter so that it can collect the Python code's file, function, and line number. This solution is general to any type of compiled extension (C extensions, Cython, Numba, etc.) Read the section in the vmprof docs on Why a new profiler? for more information.

The second issue is particular to JIT'ed code - resolving symbol information after the run. For low overhead, native profilers collect a minimum of information at runtime (usually the Instruction Pointer (IP) address at each stack level). These IP addresses need to resolved to symbol information after collection. Normally this information is kept in debug sections that are generated at compile time. However, with JIT compilation, the functions and their address mappings are generated at runtime.

LLVM includes an interface to get symbol information at runtime. The simplest way to keep it for use after the run is to follow the Linux perf standard (documented here), which stores the address, size, and function name in a file /tmp/perf-<pid>.map.

To enable Numba with vmprof, I've created a version of llvmlite that is amenable to stack collection, at the perf branch here.

This does two things:

  1. Keep the frame pointer in JIT'ed code, so a backtrace can be taken.1
  2. Output a perf-compatible JIT map file (not on by default - need to call enable_jit_events to turn it on)

To use this, modify Numba to enable JIT events and frame pointers:

  1. In targets\codegen.py, at the end of the _init method of BaseCPUCodegen, add self._engine.enable_jit_events()
  2. And for good measure, turn on frame pointers for Numba code as well (set CFLAGS=-fno-omit-frame-pointer before building it)

The next piece is a modified version of vmprof ( in branch numba ). So far all it does is read the perf compatible output and dump raw stacks. Filtering and aggregating Numba stacks remains to be done (meaning neither the CLI nor the GUI display work yet).

How to use what works, so far:

  1. Run vmprof, using perf-enabled Numba above: python -m vmprof -o vmprof.out <target python>
  2. Copy map file /tmp/perf-<pid>.map to some directory. I usually copy vmprof.out to something like vmprof-<pid>.out to remember which files correlate.
  3. View raw stacks with vmprofdump vmprof-<pid>.out --perf perf-<pid>.map.

  1. With x86_64, it is possible to use DWARF debug information to walk the stack. I couldn't figure out how to output the appropriate debug information. LLVM 3.6 has a promising target option named JITEmitDebugInfo. However, JITEmitDebugInfo is a lie! It's not hooked up to anything, and has been removed in LLVM 3.7.