Introduction to Chrome Tracing

The Chrome trace viewer (and the improved Perfetto viewer) were written for Chrome and Android system profiling, but they make great general purpose timeline viewers for your own trace data.

This post is about the trace format and some minimal C++ code to generate the trace output in your own projects.

The viewer built-in to Chrome can be opened with "chrome:tracing" in the URL bar, or use the improved Perfetto viewer at ui.perfetto.dev

The Chrome Trace Format is documented here.

There are a number of event types, but to start with we need the duration events ('B' and 'E' pairs) and/or completed events ('X', which combine a duration event into a single event entry)

To start, we can experiment by writing a file by hand. The pid (Process Id), tid (Thread Id), ph (Phase, basically the event type), and ts (timestep, in microseconds) fields are required. The name field is technically not required, but you'll most likely want to use it. The args field holds arbitrary extra information that can be display when the event is selected.

Note that duration events have some ordering requirements - they must be ordered by time within a single thread. The complete events have no such requirement, and so are a little easier to work with.

The following file has two events (cleverly named "event1" and "event2"):

[
 {"name":"event1","ph":"B","ts":"0",  "pid":"1","tid":"1"},
 {"name":"event1","ph":"E","ts":"100","pid":"1","tid":"1"},
 {"name":"event2","ph":"B","ts":"200","pid":"1","tid":"2"},
 {"name":"event2","ph":"E","ts":"400","pid":"1","tid":"2",
   "args":{"iteration":1    }}
]

Load that into Perfetto and we get this:

Perfetto timeline viewer with two events

In the screenshot, "event2" is selected and the information in the args field is shown in the lower right.

Here is the same file using complete events. The complete events add a 'dur' element which is the duration of the event. The unit of time is microsecond, but the times and durations can be floating point numbers for more precision.

[
 {"name":"event1","ph":"X","ts":"0",  "dur":"100","pid":1,"tid":1},
 {"name":"event2","ph":"X","ts":"200","dur":"200","pid":1,"tid":2,
   "args":{"iteration": 1    } }
]

Simple Tracing in C++

One possible interface could use the following class.

class Timer {
public:
    using TimePoint = std::chrono::steady_clock::time_point;

    void start();
    void stop();

    Timer(const std::string& name) : name_(name) {}

private:
    TimePoint start_time_;
    std::string name_;
};

The start time is stored for use in computing the elapsed time. This could be avoided by just outputting 'B' and 'E' events, but for this example we will output 'X' events.

The timer also keeps a name, which is used for the name of the event.

Also declare a function to write the trace at the end of the run.

// Write trace to file in JSON format
void writeTrace(const std::string& fname);

Now, on to the implementation file. Create a structure to hold all the information for an event.

struct EventRecord
{
    double timestamp; // in seconds
    char event_type;
    std::string name;
    std::thread::id tid;
    double duration; // in seconds

    EventRecord(double ts, char type, const std::string& name, std::thread::id tid1, double dur=0.0) : timestamp(ts), event_type(type), name(name), tid(tid1), duration(dur) {}
};

Then some storage for the events and a mutex for thread safety.

std::vector<EventRecord> events;
std::mutex event_mutex;

It's convenient to store the program start time so all the event start times can be computed relative to the start of the program. (Yes, I'm using global variables. That's good enough for a simple implementation.)

using TimePoint = std::chrono::steady_clock::time_point;
TimePoint program_start = std::chrono::steady_clock::now();

The implementation of the start function is simple - just save the clock time for later.

void Timer::start()
{
    start_time_ = std::chrono::steady_clock::now();
}

The stop function is a bit more involved. It gets the clock, computes the start time (relative to program start), the duration, and gets the thread id.

void Timer::stop() {
  TimePoint stop_time = std::chrono::steady_clock::now();

  std::chrono::duration<double> from_program_start =
      start_time_ - program_start;
  std::chrono::duration<double> dur = stop_time - start_time_;

  std::thread::id tid = std::this_thread::get_id();

The next part of the stop function saves the record in the event record vector. The mutex is there in case these timers are being called from multiple threads.

  event_mutex.lock();
  events.push_back(
      EventRecord(from_program_start.count(), 'X', name_, tid, dur.count()));
  event_mutex.unlock();
}

Finally, the event information is written to a JSON file. It's convenient to map the thread id to a small integer for readability. This code segment also open the file.

void writeTrace(const std::string &fname) {
  // Map thread id to a small integer
  int small_tid = 0;
  std::map<std::thread::id, int> tid_map;

  std::ofstream f(fname);

Next is to output the opening JSON section. Then the loop over events and the code for mapping the thread id.

  f << "{\n";
  f << R"("traceEvents": [)" << "\n";
  for (int idx = 0; idx < events.size(); idx++) {
    EventRecord &er = events[idx];
    if (tid_map.find(er.tid) == tid_map.end()) {
      tid_map[er.tid] = small_tid++;
    }

In the next part of the function, the event record is output Times are converted from seconds in the record to microseconds for the JSON file. At the end of the loop is a check to avoid the comma after the final entry.

    f << R"({"name":")" << er.name << R"(",)" << R"("ph":")" << er.event_type
      << R"(",)" << R"("pid":")" << 1 << R"(",)" << R"("tid":")"
      << tid_map[er.tid] << R"(",)" << R"("ts":)" << (er.timestamp * 1.0e6);

    if (er.event_type == 'X')
      f << R"(,"dur":)" << (er.duration * 1.0e6);
    f << "}";
    if (idx < events.size() - 1) {
      f << ",\n";
    } else {
      f << "\n";
    }
  }
  f << "}" << std::endl;
}

An example of using the trace code

#include "trace.h"
#include <unistd.h> // for sleep

int main()
{
    Timer t1("first");
    t1.start();
    sleep(1);

    Timer t2("second");
    t2.start();
    sleep(1);

    t2.stop();

    sleep(1);

    t1.stop();

    writeTrace("out1.json");
    return 0;
}

Viewing the "out1.json" file in Perfetto, we get Perfetto timeline view from tracing example

These files can be found at this gist

Final thoughts

This post has looked at using the Chrome Trace Format and the Perfetto viewer and presented a minimal C++ library for collecting and writing out trace data.

A future post will look at the trace information from the OpenMP offload runtime (built-in to the LLVM mainline OpenMP) and how to combine it with application-level trace data.

Testing Scientific Software

A perennial question when writing scientific code is how do you know the code is producing the right answer?

This post describes the QMC algorithms repository and testing in QMCPACK development to provide some answers this question. QMCPACK implements Quantum Monte Carlo (QMC) methods for solving the Schödinger equation for atoms, molecules, and solids.

The repository focuses on a few areas to contribute to testing and understanding scientific software:

  1. Derived formulas using symbolic math
  2. Code generation from symbolic math
  3. Small problems with simple algorithms
  4. Reproduce and explain papers

Another driving force in this work is that I have a strong conviction that symbolic mathematics is the most appropriate semantic level to describe scientific algorithms. In this repository, Sympy is used for symbolic mathematics.

Deriving formulas

Some parts of scientific code are the final result of a series of derivation steps that are then translated to code. How do we know these steps are correct? Especially the final step of math to code. Usually these steps are all carried out by hand, with opportunities for errors to creep in. We can use the computer to perform and check some of these steps.

Computing derivatives for comparisons

A Padé (rational function) form is the simplest functional form for a Jastrow factor, which describes electron-electron or electron-nucleus correlation in a wavefunction. The Pade_Jastrow.ipynb notebook simply expresses the form, computes some derivatives, and evaluates them at a particular value.

The simplest way to use the results is to cut and paste the values from the notebook to the test case. This gets tedious very quickly and is not easy to regenerate, especially when there are more than a few values to copy. It's straightforward to make a little template to output the values in a form that is more directly usable in the test.

For example, the following Python code will output a line suitable for use in a C++ unit test (with Catch2 assertions).

A,B,r = symbols('A B r')
sym_u = A*r/(1.0 + B*r) - A/B
val_u = u.subs({A:-0.25, B:0.1, r:1.0})
print(f'CHECK(u == Approx({val_u}));')

Other Derivations

There are a few types of splines used in QMCPACK.

For cubic splines, the notebook sets up the equations based on constraints and boundary conditions. It computes an example and solve the equations using a dense linear solver. (An example of using a simpler algorithm for validation. The real code uses a more efficient tridiagonal solver.) This gets incorporated into QMCPACK tests with the script gen_cubic_spline.py and the output of that script is in test_one_dim_cubic_spline.cpp.

B-splines are used in a one-dimensional form for Jastrow electron-electron and electron-ion correlation factors. And also in a three-dimensional form for representing single-particle periodic orbitals. They are much faster to evaluate than a plane wave representation. The Explain_Bspline.ipnyb notebook starts from B-splines defined in Sympy and works through how they get used in QMCPACK.

Code generation

Code generation is used for the solver for the coefficients of cubic splines. It starts from the tridiagonal matrix equations from Wikipedia, derives the cubic spline equations (same as previously), combines the two and outputs the resulting algorithm to C++ code.

The notebook is CubicSplineSolver.ipynb. The corresponding script in QMCPACK is gen_cubic_spline_solver.py. The generated spline solver is located in SplineSolvers.h.

Some simpler code generation in the QMCPACK repository involves the angular part of atomic orbitals.
There is a notebook about Gaussian orbitals. In this case, the starting expression is simple and computing the various derivatives is tedious and the script in QMCPACK, gen_cartesian_tensor.py, takes care of that part.

Simple algorithms

Some ways to obtain reference values for test cases include special cases that have analytic solutions and small problems that can be solved using an alternative method that is easier to understand and verify than the implemented algorithm.

Computing long-range sums of periodic Coulomb potentials requires some tricks to ensure convergence. The Ewald method splits the sum in two pieces and uses the Fourier transform of the long-range part for faster convergence. It is implemented in a simple Python script, ewald_sum.py, which can be used to verify more sophisticated splits of the sum used in the code.

For the Schrödinger equation, an exact analytic solution is known for the hydrogen atom. A deliberately incorrect wavefunction is used as a trial wavefunction in Variational_Hydrogen.ipynb to demonstrate the variational principle.

A slightly larger problem is that of a helium atom. The ground state wavefunction for a helium atom does not have an analytic solution and is the simplest example involving electron-electron correlation. The notebook Variational_Helium.ipynb uses symbolic means to find the derivatives of the local energy, and grid-based integration instead of Monte Carlo. (The details of the integration are discussed here.)

Another approach to the derivatives is autodifferentiation. There is a minimal QMC code used for creating reference values, particular for derivatives of variational parameters and orbital rotation. It uses the autograd package for differentiation. The QMC loop is in run_qmc.py.

The target systems are: * Helium atom in gen_rotated_lcao_wf.py * Beryllium atom in rot_be_sto_wf.py * Beryllium atom with two determinants in rot_multi_be_sto_wf.py * Solid Beryllium with a psuedopotential in eval_bspline_spo.py

(To improve on the performance of run_qmc.py and related scripts, I wrote a version using the multiprocessing package in Python. For even more performance, there is a port of these scripts to Julia.)

Reproduce and explain papers

Published papers contain derivations that leave intermediate steps as an exercise for the reader should they want to reproduce or implement the described algorithm.

There is a nice writeup, contributed by Andrew Baczewski, to explain and reproduce a paper "Observations on the statistical iteration of matrices" in this notebook Reproduce Hetherington PRA1984.ipynb The paper describes a stochastic reconfiguration method for controlling variance and bias in a Monte Carlo simulation.

Another example involves a cusp correction scheme, which adds some modifications to Gaussian-type orbitals commonly used in quantum chemistry to work better for a Quantum Monte Carlo wavefunction. It comes from the paper Scheme for adding electron-nucleus cusps to Gaussian orbitals. The notebook CuspCorrection.ipynb follows through solving some polynomial equations and presents some examples. The script used to generate reference values for tests in QMCPACK is gen_cusp_corr.py.

The process for using notebooks is 1. Jupyter notebook provides exposition of an algorithm or concept. 2. The code gets copied to a script in the appropriate test directory in the QMCPACK repository, and that location is the code that produces the results for the QMCPACK tests.

This does result in duplication of code, but at least the scripts and tests in the QMCPACK repository are self-contained.

Final thoughts

This post described several techniques for testing scientific software.

One of the important aspects of learning is working with the material and wrestling with it. At some point, you have to do the derivation yourself in order to learn it. These notebooks are a record of my approach to exploring these topics. Are these useful to anyone else? Could they be made more useful?

The descriptions in many of the notebooks are very short and could be expanded. What would make them more useful? For someone trying to understand the code? Or for someone trying to reproduce and extend the work?

Guide to my Github Repositories

This post is a short guide to my Github repositories.

Some of them I use as a wiki for keeping information. Markdown makes adding and editing text easy, and it can also contain code.

Other repositories have more code to explore a particular topic. I should expand on some of these topics in future blog posts. (Especially the contents and use of the "QMC Algorithms" repository. It's used as a base for creating validated unit test data for QMCPACK, and it's a direction I think more projects should move towards.)

Wiki-like Information

GPU Internals - Documentation on how GPUs work and programming models

ML in scientific computation - Machine learning and scientific computation

Programming Ideas

Next Steps in Programming - Ideas on programming scientific applications. Includes a focus on code changes a first class/primary object in programming.

Programming Tutorial Maker - Attempt at a tool to make writing incremental tutorials easier.

Scientific Computing

QMC Algorithms - Scripts, Jupyter notebooks, Sympy derivations, and code generation to support validating QMC codes. (Also here)

QMC Kernels - Kernels used in Quantum Monte Carlo in general, and QMCPACK in particular. Also some simple kernels used to explore GPU offload programming models.

Small Molecules - Example input files for quantum chemistry packages and information on various small molecules.

Quadrature - Multidimensional quadrature algorithms. (Technically "cubature"). In addition to notes and links, this has some implementations.

Floating point - Floating point arithmetic. Particularly for exploring low precision. (the 'silver' part of the repository name comes from Github's initial repository name suggestion)

Symbolic Derivatives with Code Generation Represents scientific computations using Sympy. It can take derivatives symbolically, then code generate the result to Python, Julia or C++.

Multitevo Translation of CoMD molecular dynamics miniapp to other languages, such as Python and Julia. And Javascript, should I get it checked-in and pushed.

Bash for C programmers

Bash scripting

Why use Bash? It is not my preferred language, but it has two big advantages:

  • Availability - we can safely assume bash is present everywhere. No need to install additional software.
  • Ease of starting and incremental building - a script can start as a small list of commands and grow incrementally.

The rest of this post not a tutorial, but some ways to think about bash scripting in order to avoid common traps when coming from C/C++ or Python.

Quick Intro

Whitespace is significant

Almost everything in bash is a command with arguments. After all the substitutions and quoting, the basic structure of each line is a command with all its arguments are separated by whitespace. This makes whitespace significant in ways that can be confusing when coming from other languages.

One place this shows up is in setting variables - there should be no whitespace on either side of the equal sign. The statement a = 1 will fail with an error about 'a' not found - the shell interprets it as a command. Similarly a= 1 will fail with an error about '1' not found.

Another place significant whitespace appears is in if statements. The opening bracket for the test ([) looks like syntax, but it is a command. (By symmetry you might expect ] to be a command, but it's not - it gets passed as an argument to [).

a=0
if [ $a -eq 0 ]; then
  echo "a is 0"
fi

Also note the semicolon after the comparison to terminate the command. As an alternative, the then could be put on the next line

a=0
if [ $a -eq 0 ]
then
  echo "a is 0"
fi

Bash also has a double bracket form of testing that is a built-in, and not an external command.

Exit values

Exit values from commands are 'success' or 'failure'. Tests of the exit value, like the 'if' statement or the control operators (&& and ||) operate on these exit values (and only exit values). Numerically, success maps to 0, whereas failure maps to a non-zero value. This numerical mapping is opposite of many other languages, like C/C++ and Python, where 0 is false and non-zero values are true. Using multiple values for failure can give some indication of what went wrong with the command.

The best way to think about this is not to mentally reverse the values, but to think in terms of 'success' and 'failure', and only think about the numerical mapping if necessary.

Sequencing and control operator idioms:

  • A;B Run A and then B, regardless of success of A
  • A && B Run B if A succeeded
  • A || B Run B if A failed
  • A & Run A in the background

( from https://unix.stackexchange.com/questions/24684/confusing-use-of-and-operators )

Functions

Functions start with "function name {" or "name () {". The body is a series of commands, and the function ends with }. Once again, whitespace is significant. The function declaration should be on a line by itself, and the closing brace should also be on a line by itself.

A function call looks like a normal command statement, with the usual space-separated arguments.

Inside a function, arguments are referenced using $1, $2, etc. Unset variables are empty.

There is a syntax for optional arguments:

var=${1:-default_string_here}

Variable replacement

Variable replacement occurs before bash breaks the line into a command and arguments based on whitespace. Unexpected behavior can happen if a variable contains whitespace and the developer was not expecting it. This can happen when the variable holds a directory name, and the script is run using a directory containing a space.

a="hi there"
cmd $a

expands to calling cmd with two arguments 'hi' and 'there'. Enclose the variable in quotes to keep spaces confined.

This will call cmd with one argument: 'hi there'.

a="hi there"
cmd "$a"

For a gory step-by-step look at how bash processes input, see https://mywiki.wooledge.org/BashParser

Robust bash scripts

See https://www.davidpashley.com/articles/writing-robust-shell-scripts/ for advice on writing robust scripts.

Part of the advice is to use these settings

  • set -o nounset to detect the use of uninitialized variables
  • set -o errexit to exit the script if any statement fails

Building A Parallella Cluster

I finally got around to assembling my small pile of Parallella boards into a cluster.

Alternate side view

This post documents the choices I made about power distribution, mechanical assembly, cooling, software management, etc. For background on the individual boards, see my previous post: Introduction to Parallella.

The software directions are based on the Linaro 14.04 image. (At some point I should upgrade to the latest Paraubuntu release.)

Power

There are several options for powering multiple boards:

  1. Solder a jumper and use the mounting pads with metal standoffs to transmit the power. This option requires the fewest cords.
  2. Use micro USB connector (need to switch jumper J14). There are multi-port USB power supplies that should work. This is probably the simplest option for a small number of boards.
  3. Use the 5.5mm barrel plug (default settings). There are versions with screw terminals on Amazon (these for example).

I went with option 3 and used an old PC power supply for the power.

Mechanical assembly

I used 20mm nylon standoffs and assembled the boards into two stacks. A small piece of 1/4" plywood was used for the base.

Cooling

For air flow, I used a cooling fan from an old PC and mounted it to the plywood base.

Network

Needing ports for a maximum of 10 boards, plus one port for the external connection, I chose a 16-port Gigabit D-Link switch. Eventually I would like to power the network switch from the PC power supply, but need to get the right plug first.

The nodes use DHCP from my home network get their addresses. A future improvement is to run a DHCP server on one node and use that to supply addresses to the other nodes. This would make the cluster independent of running on my home network.

MicroSD cards

Follow the directions on the Parallella Create SD Card page. After burning the image, use gparted to resize the partition to the full capacity of the card.

Management and control

Performing software installs and other management on all of boards individually would be too tedious. There are many solutions for managing clusters. I decided to use Ansible, as it seemed the simplest (no software needed on nodes) and it runs over ssh.

In addition to controlling operations from a PC, it is useful to designate one node as a 'head node' and install Ansible there as well. For MPI, it's easier to run MPI programs from the head node than from the PC. For setting up configuration files, it can be useful to create or edit the file and make sure it works on one node, and then copy the file to all the other nodes.

Ansible and MPI (and general convenience) require setting up passwordless ssh login.

Once the keys are set up locally, you can use ssh-copy-id to copy the credentials.

ssh-copy-id -i ~/.ssh/id_rsa.pub parallella@10.0.0.145

I keep a small script that puts this line in the history (named ssh_copy_id)

history -s "ssh-copy-id -i ~/.ssh/id_rsa.pub parallella@10.0.0.145"

Run the command source ssh_copy_id to put the command on the history list. Use the bash history and line editing features to select the command and update to a new address.

Rather than create a new user, I'm using the default 'parallella' user on the nodes. The SSH config (in '.ssh/config') can be set up to switch users upon login.

Host 10.0.0.127
    User parallella

You might wish to set up a apt-cache server on a local machine to save on download bandwidth when installing software to all the nodes.

Using Ansible

See the intro docs.

The inventory file is a list of IP addresses (one per line). It can be specified on the command line with `-i'. To see if all the nodes work

ansible -i cluster.ini all -m ping

To copy the apt-cache server configuration to all the nodes, use

 ansible -i hosts all --sudo -m copy -a "src=/etc/apt/apt.conf.d/02proxy dest=/etc/apt/apt.conf.d/02proxy"

To shutdown all the nodes, use

ansible -i cluster.ini all --sudo -m shell -a "shutdown now"

Compilation

In the introductory post I talked about cross compiling for the board. That gets more complicated with larger software packages. For instance, one of my project dependencies, HDF, doesn't cross-compile easily (or at all).

Since the nodes use a regular Ubuntu distribution, native compilation is easy, but slow.

One solution is to use distcc.

The particular software package I'm working with (QMCPACK, which does simulations on atoms, molecules, and solids) uses CMake for configuration and build, and builds fine with distcc.

Install on all nodes with

ansible -i cluster.ini all --sudo  -a "apt-get install -y distcc"

Set the DISTCC_HOSTS variable to the set of systems to use

export DISTCC_HOSTS="localhost @10.0.0.144/2 @10.0.0.145/2"

This example shows three hosts. The initial '@' means to use ssh (no daemon required on remote) and the '/2' on the end means to use two threads.

Now set the C and C++ compilers to distcc <compiler> and run the build.

For CMake, building a project with MPI, this is

export CC="distcc mpicc"
export CXX="distcc mpic++"

Then

make -j 8

You might see an warning message

distccd[<pid>] (tcp_cork_sock) Warning: setsockopt(corked=1) failed: Socket operation on non-socket

This can be ignored. Or follow some directions to silence it.

MPI

One popular method for writing programs that communicate across the boards is MPI (Message Passing Interface).

Install the 'mpi-default-dev' package (which should also install the 'mpi-default-bin' package). This installs the Open MPI implementation (the alternative being MPICH). Note that this MPI is only concerned with communication between the ARM cores on different boards. There is also the Brown Deer Technology version of MPI for programming the Epiphany accelerator.

It's common to use a networked file system so each local node has access to the executables and input files. Ansible has file distribution commands that work well enough that a networked file system isn't strictly necessary. (Be aware when copying directories with Ansible that if the directory specified in src does not end with '/', the directory and it's contents are copied. If it does end with '/', just the directory contents are copied.)

Open MPI uses a tree-based launcher for better scalable start-up performance. Because of this, each node should be able to log into each other node (not just head node to other nodes).

MPI needs a list of machines to run on. One method is to create a host file and pass it to mpirun with the --hostfile option. The host file, at its simplest, is one hostname or IP address per line (same as a simple Ansible inventory file.)

Gallery

Front/side view   Back view  

Side view   Top view   Alternate side view

Integration Callbacks with Sympy and LLVM

This post explores various packages for multi-dimensional integration along with generating callbacks for the integrands from Sympy using an LLVM JIT

Problem to integrate

The particular problem is using the variational principle to find the ground state energy for atoms. Some Jupyter notebooks with a description of the problem, along with various integration methods:

Ground state energy of Hydrogen Atom (This yields a 3 dimensional integral.)

Ground state energy of Helium Atom (This yields a 6 dimensional integral.)

The standard solution to these integrals is to use Markov Chain Monte Carlo (the Quantum Monte Carlo method).
However, I'm curious to see how far alternate integration schemes or existing integration packages would work.

Integration libraries

The scipy quadrature routines accept a natively compiled callback for the integrand. (Noticing this in the documentation initiated the idea for using JIT compilation for callback functions.)

Next up is the Cubature integration package, with the Python wrapper for cubature

Finally is the Cuba library, with the PyCuba interface (part of the PyMultiNest package)

There are some other libraries such at HIntLib that I would also like to try. There doesn't seem to be a python interface for HIntLib. Let me know if there is one somewhere. And if there are other multidimensional integration packages to try.

Evaluating the integrand

One of my scientific programming goals is to generate efficient code from a symbolic expression. To this end, I've been working on an LLVM JIT converter for Sympy expressions (using the llvmlite wrapper).

For the Sympy code, see these pull requests:

As an aside, one can question if is this the right approach, compared with

  1. Generate C++ or Fortran and compile using the existing autowrap functionality in Sympy.
  2. Generate Python/Numpy and use Numba.
  3. Use Julia

There is always a tradeoff between a narrow, specialized solution, which is faster to implement and perhaps easier to understand, and a more general solution, which applies in more cases, but is harder and slower to implement.

Using an LLVM JIT is a specialized solution, but it does have an advantage that there is a short path from the expressions to the compiled code. One disadvantage is that it does not leverage existing compilers (Numba or C++), though LLVM compiler optimization passes are available.

Sometimes a solution just needs to be tried to gain experience with the advantages and drawbacks.

Results

For the helium atom, the integration times are reported in the table below

Integrator   Time (seconds)
Cubature 171
Cubature w/CSE 141
Cubature w/CSE and multiple evals 100
Cuba (Vegas) 29
Cuba (Cuhre) 22


Note that scipy.nquad was not used for the 6D integral. It quickly runs out of steam because it consists of iterated one dimensional integrations, and the glue between the dimensions goes through Python, reducing the effectiveness of a compiled integrand.

The Cubature library does better. Profiling shows that most of the time is spent internal to cubature and allocating memory, so faster integrand evaluation is not going to improve the time. Some other approaches can help. One is Common Subexpression Elimination (CSE), which Sympy can perform on the expression. This extracts duplicate fragments so their value only needs to be computed once.

The library also allows multiple integrals to be performed at once. This can amortize some of the overhead of the library. In this case, the individual calls to integrator for the numerator and denominator can be reduced to a single call.

The Cuba library performs even better, as there is apparently less overhead inside the integration library. The Cuhre integrator uses a deterministic grid-based algorithm similar to Cubature. Vegas uses an adaptive Monte Carlo approach.

The results are not shown here, but I also used SIMD vectorization to make the function evaluation even faster, which succeeded for the bare function evaluation. (This was one of the original motivations for compiling straight to LLVM, as it would be easier to get vectorization working.) Unfortunately, it did not speed up the overall integration much (if at all), due to overhead in the libraries.

Conclusions and future work

Using an LLVM JIT to create callbacks for integration works fairly well.

One important question is how to scale the creation of the callbacks to new libraries without explicitly programming them into Sympy.
The last pull request has expanded the CodeSignature class, which seems like a starting point for such a more general callback specification.

Notes on CMake

Recently I started working on a project that uses CMake. I've used CMake a little before, but never really had to dive much into it. In particular, I needed to understand the scripting parts of CMake for adding tests for CTest.

Below are some comments on aspects of CMake.

Variables and variable substitution

Variables names are strings. Substitution occurs when the variable is dereferenced with ${}.

SET(var, a)
MESSAGE("var = ${var}")

produces

var = a

Nested substitutions are possible

SET(var, a)
SET(a, b)
MESSAGE("var = ${var}  ${${var}}")

will produce var = a b

Variable names can be composed during substitution

SET(var, a)
SET(a_one, apple)
MESSAGE("var =  ${${var}_one}")

will produce var = apple

Variables and functions

Variable references act a little like pointers, but without a type system to enforce (and guide) how many indirections should be performed.

Example of using a variable inside a function:

FUNCTION(MY_FUNC arg1)
    MESSAGE("arg1 = ${arg1}")
ENDFUNCTION()

MY_FUNC(hello)
SET(var, a)
MY_FUNC(var) # arg1 is set to 'var'
MY_FUNC(${var}) # arg1 is set to 'a' - this is usually what you want

produces

arg1 = var
arg1 = a

Return values from functions

There is no built-in notion of a return value from a function. To get values out of a function, write to one of the arguments.

A function creates a new scope - changes to a variable will only affect the variable's value inside the function. To affect the value in the parent, the PARENT_SCOPE modifier should be given to the SET command. (More on variable scopes here)

Another issue is the variable name for the output value needs to be dereferenced before being set. Otherwise a variable with the name used in the function will be set in the parent, which can work by accident if the variables have the same name.

Example:

FUNCTION(MY_FUNC_WITH_RET ret)
    # The following line works by accident if the name of variable in the parent
    # is the same as in the function
    SET(ret "in function" PARENT_SCOPE)
    # This is the correct way to get the variable name passed to the function
    SET(${ret} "in function" PARENT_SCOPE)
ENDFUNCTION()

SET(ret "before function")
MY_FUNC_WITH_RET(ret)
MESSAGE("output from function = ${ret}")

will produce output from function = in function

Data structures

There is only the List type, with some functions for operations on lists. The documentation on lists says that "Lists ... should not be used for complex data processing tasks", but doesn't say what to use instead.

For associative arrays or maps there are some options:

  • Two parallel lists - one of keys and one of values. Search the key list and use the index to look up value. More awkward for passing to functions.
  • Single list with alternating keys and values. Search the list for the key, and use index+1 to look up the value. Only works if the range of possibilities for keys and values are distinct (e.g., keys are strings and values are always numbers).
  • The environment (ENV{key}) is a built-in associative array. It could be overloaded to store other values, at the risk of polluting the environment.

Test timeout

The default timeout per test is 1500 seconds (25 minutes).

To increase this, adjust the value of DART_TESTING_TIMEOUT. It needs to be set as a cache variable, and it needs to be set before the enable_testing() or include(CTest) is specified.

SET( DART_TESTING_TIMEOUT 3600 CACHE STRING "Maximum time for one test")

See also this Stack Overflow post

Visualizing MD Data

Visualizing atomic positions in 3D is useful when working on a molecular dynamics code.

More generally, being able to visualize the structures and algorithms inside a code can help with understanding and debugging. With all the advances in graphics hardware, it should be possible to quickly create visualizations for various aspects of the code. But this seems harder than it should be. In this particular case, normal plotting packages can sometimes plot atomic positions with 3D scatter plots. But then further algorithm visualization is hard (animation, drawing boxes, etc).

The VisPy project looks promising. It contains three API levels

  1. gloo- the lowest level API around OpenGL
  2. scene- uses a scene graph description
  3. plot - standard plotting interface

The 'scene' interface is the appropriate abstraction level for our purposes. Note this API is marked experimental and may change in the future.

Pretty pictures

Static screenshots (click for larger version)

Screenshot2 Screenshot3

And an animation (click for larger version)

Screenshot3

Code

The modified comd.py is here. It should be a drop-in replacement for that file in the nsquared version of the code. The bulk of the visualization additions start around line 154.

The perceived speed of the simulation can vary. Pure Python, even at the smallest system size, is too slow. Using the Numba-accelerated loop makes it much faster. However, for the smallest system, this feels 'too fast'. Increasing the system size will slow it down (`-x 6 -y 6 -z 6' seems to work well on my system). There are much better ways of adjusting the speed, but this is the most expedient.

The -N option specifies the number of steps (defaults to 100). Set it to a larger value to run the animation longer.

During the run, press s to take a static screenshot (stored in screenshot.png). Press 'a' key to start/stop saving an animated segment (stored in animation.gif). These features require that the imageio package is installed.

The multiprocessing package is used to run the simulation and the visualization event loop in separate processes. Positions are passed from the simulation to the visualization via a Queue. A Timer on the visualization side checks the queue for new positions periodically.

This code snippet uses the Marker visual element to display the center of each point. This size is the size of the element on the screen, not the size in the scene (that is, elements don't change size when zooming) The current size was chosen to easily see the motion of the atoms, not to accurately represent the atom's size. Displaying a Sphere at each point would be more accurate, but is much slower.

Summary

I'm very pleased with VisPy. It enabled live, interactive animation of atoms from a molecular dynamics code with a small amount code. I expect extending this to visualize more complex algorithms and data structures should be be straightforward.

More Performance With Numba

Having acquired a shiny new profiler, it's time to dig into the performance of the Numba version some more.

Picking up from the previous optimizations, I can't seem to reproduce the timing (47 μs/atom) in the that table. Now I get 40 μs/atom.

First step

Run the profiler (vmprofrun comd.py) and display the results in KCacheGrind (kcachegrind vmprof-20664.out)

Sorting by self time, we see getBoxFromCoord at the top:

KCachegrind screenshot of functions sorted by self time

Also a screen shot of the call graph - getBoxFromCoord gets called from two different places - putAtomInBox and updateLinkCells.

KCachegrind screenshot of call graph

To improve performance here, convert getBoxFromCoord to a free function and put all the attribute references into function arguments.

Before:

    def getBoxFromCoord(self, r):
        ix = int(math.floor(r[0]*self.invBoxSize[0]))
        iy = int(math.floor(r[1]*self.invBoxSize[1]))
        iz = int(math.floor(r[2]*self.invBoxSize[2]))
        return self.getBoxFromTuple(ix, iy, iz)

After:

@numba.njit
def getBoxFromCoordInner(x, y, z, invBoxSize, nLocalBoxes, gs):
    ix = int(math.floor(x*invBoxSize[0]))
    iy = int(math.floor(y*invBoxSize[1]))
    iz = int(math.floor(z*invBoxSize[2]))
    return getBoxFromTupleInner(ix, iy, iz, nLocalBoxes, gs)

(Changing the parameter r to individual components was not strictly necessary.)

And the call sites change (for example) from

        iBox = self.getBoxFromCoord([x, y, z])

to

        iBox = getBoxFromCoordInner(x, y, z, self.invBoxSize, self.nLocalBoxes, self.gridSize)

This improves performance to 20 μs/atom.

Repeating the same transformation for putAtomInBox gives 18.4 μs/atom.

Second step

Run the profiler again. By self time, loadAtomsBuffer is at the top. Let's look at that in context. Sort by inclusive time, and we see that the parts of the call tree starting at redistributeAtoms take a significant amount of time.

KCachegrind screenshot of functions sorted by inclusive time

KCachegrind screenshot of call graph

This part of the code:

  • Applies periodic boundary conditions
  • Moves atoms to a new cell
  • Packs atom into a buffer at source
  • Unpacks buffer into atom data structure at destination

This packing/unpacking anticipates the parallel version, which transmits the buffer across processors.

A previous attempt at using numpy records did not work well (and ran into a serious performance regression with numpy 1.10). This time I went with two buffers - one for integers, and one for floating point numbers. This works better, and the performance is now 10.2 μs/atom.

More steps

Continuing the process of profiling, and converting routines to be Numba friendly eventually reached a performance of 2.9 μs/atom. (Wow, this is only about 30% slower than C.)

Modified code is here

The updated performance table is

Language/compiler   Version    Initial time   Final time
Python 2.7.10 1014 1014
PyPy 4.0 30 30
Cython 0.23.3 729 13
Julia 0.4.0-rc3 87 6.1
Numba 0.22.1 867 2.9     New result
C (clang) 3.7 2.2 2.2


Times are all in μs/atom. System size is 4000 atoms. Hardware is Xeon E5-2630 v3 @ 2.4 Ghz, OS is Ubuntu 12.04.
The 'Initial Time' column results from the minimal amount of code changes to get the compiler working.
The 'Final Time' is the time after tuning.


Do note that the Cython, Julia, and Numba results reflect the amount of effort put into optimization. Cython and Julia still need to be improved with the assistance of a profiler (or in Julia's case, a better viewer for existing profile data).

Summary

Using a profiler to guide our optimization efforts has been very helpful.

The Numba results are really promising, but, in addition to creating ugly code, it required an amount of work that I would not want to perform on a regular basis. These transformations are fairly regular, so it should be possible to incorporate them into Numba. Alternately, if doing so in a safe manner inside the compiler is difficult, some sort of automated AST transformation of the source should be possible.

As the optimization process proceeds on this code, increasing amount of time is being spent in the core routine, computeForce, (as it should), and we will need to move beyond a function-level profiler to look for optimization opportunities.

Performance Updates with PyPy 4.0

The PyPy team recently released version 4.0 (The jump in version number is to reduce confusion with the Python version supported.) One of the features is improved performance.

But first, an issue with reporting accurate timings with this version of CoMD should be addressed. The initial iteration contains overhead from tracing and JIT compilation (Cython and Numba have the same issue). For this example we are concerned with the steady-state timing, so the time for the first iteration should be excluded. I've added a '--skip' parameter to the CoMD code (default: 1) that skips the first printRate steps (default: 10) in computing the overall average update rate at the end.

Now the table with the most recent performance numbers (from this post ), updated with PyPy 4.0 results:

| Language/compiler   | Version   | Time| |-------------------|--------------|--------------:|------------:| | Python | 2.7.10 | 1014 | | | PyPy | 2.6.1 | 96 | | | Numba | 0.21.0 | 47 | | | PyPy | 4.0 | 30 |     New result | | Cython | 0.23.3 | 13 | | | Julia | 0.4.0-rc3 | 6.1 | | | C | 4.8.2 | 2.3 | |


Times are all in μs/atom. System size is 4000 atoms. Hardware is Xeon E5-2630 v3 @ 2.4 Ghz, OS is Ubuntu 12.04.

The new release of PyPy also includes some SIMD vectorization support (--jit vec=1 or --jit vec_all=1). Neither of these provided any improvement in performance on this code. Not too surprising given the vectorization support is new, and the code contains conditionals in the inner loop.

PyPy 4.0 gives a very good 34x performance improvement over bare Python, and 3x improvement over the previous release (2.6.1). PyPy is attractive here in that no modifications made to the source. (Both Cython and Numba required source changes)