Basic hygiene for scientific coding - Inverse Complexity Lab Manual

Authors: Tiago P. Peixoto¹
Affiliations: ¹Inverse Complexity Lab
License: CC-BY

This document describes general principles and recipes to keep your scientific programming pipeline sane and efficient, and to facilitate collaboration.

Project organization principles¶

A scientific project differs in one important way from typical programming tasks: Initially, we usually have only a tentative idea of the methods and implementations we want to use, and a substantial amount of experimentation and revision is unavoidable. Instead of being designed based on well-defined requirements, scientific code usually evolves in the time scale of a project. This needs to be taken into consideration in the planning and development phases to minimize entropy increase, which tends to create inefficiencies, mistakes, and hence time waste and frustration.

The general principles of organization that should be followed are the following:

Avoid unnecessary complexity.
Avoid code duplication.
Modularize your code. In particular, separate between:
1. Boilerplate code, i.e. submitting jobs to HPC, scanning through parameter ranges, etc.
2. Ad hoc code, i.e. things you will likely never use again for another project.
3. General functions that will be useful for other projects.
4. Core algorithms for your project.
Take the time to refactor your code when entropy increase has caused the above principles to be sufficiently violated!
Use version control (git) to track your changes and share the code with collaborators.

A corollary of the above is the following:

A project template: sampling from the Ising model¶

A simple project template that demonstrates how to implement the above principles is available in the repository under the folder proj_template.

This example project consists of sampling states from an Ising model on an Erdős–Rényi network, to characterize the phase transition in the magnetization as function of the inverse temperature $\beta$ .

The project-specific code is contained in the file ising.py:

ising.py

#!/usr/bin/env python3

# This should contain project-specific code

from graph_tool.all import *
import numpy as np

def gen_state(N, ak, beta):
    g = random_graph(N, lambda: np.random.poisson(ak), model="erdos", directed=False)
    state = IsingGlauberState(g, beta=beta)
    return state

def mcmc_sweep(state, n=1):
    return state.iterate_async(niter=n * state.g.num_vertices())

def get_mag(state):
    return state.s.fa.mean()

This is fairly minimal in this skeleton project, since it just leverages the IsingGlauberState implementation available in graph-tool. In actual projects this will be fairly larger, and will reflect the actual substance of the study.

Typically we will want to run our analysis in an HPC environment, where we spread our tasks among many jobs running in parallel in different machines (a.k.a. nodes), which all have access to the same shared filesystem.

An elegant infrastructure that simplifies this kind of synchronization is implemented in the data_cache module. You should clone that to your project’s folder so it can be used:

Warning: cache key collisions

The data_cache module translates parameter combinations passed as a dictionary into a file path. For example:

cache.get("foo", dict(bar=10, gnat="gnu"))

will be transformed into the path

cache/foo/bar_10/gnat_gnu/...

This very simple translation has many advantages, but it can also lead to cache collisions in case some parameter combinations are not unique when concatenated, for example:

cache.get("foo", dict("bar"="gnat_gnu"))
cache.get("foo", dict("bar_gnat"="gnu"))

will both translate to the same path:

cache/foo/bar_gnat_gnu/...

It’s not easy to achieve these collisions by chance, but in general one should avoid parameters that are substrings of another.

cd proj_template
git clone gitlab@git.skewed.de:count0/data_cache.git

The script run_hpc.py takes care of actually running the individual tasks:

run_hpc.py

#!/bin/env python
import os.path
import itertools
import time
import numpy as np
from collections import defaultdict

from ising import *     # project-specific code

from data_cache import FSDataCache

cache = FSDataCache()   # this is the persistent cache for all results

# the following sets the parameter ranges we will iterate over

vranges = {
    "N": [10000],
    "ak": [3, 5],
    "beta" : np.linspace(0, 1, 40)
}

def iter_ranges(*args):
    return itertools.product(*[vranges[a] for a in args])

if __name__ == "__main__":

    params_list = []
    nparams = list(vranges.keys())
    for vals in iter_ranges(*nparams):
        params = dict(zip(nparams, vals))
        params_list.append(params)

    print("Parameter combinations:", len(params_list))

    # follow them in random order
    np.random.shuffle(params_list)

    while True:
        for params in params_list:
            print(params)
            with cache.get_lock("state", params, block=False) as lock:
                if lock is None:
                    # if lock is None, another process already claimed this set of
                    # parameters
                    continue

                # at this point, we hold a 'lock' on this particular combination of
                # parameters

                try:
                    # load previous state if available
                    state = cache.get("state", params)
                except KeyError:
                    # initialize our state
                    state = gen_state(**params)

                # keep track of some statistics
                try:
                    stats = cache.get("stats", params)
                except KeyError:
                    stats = defaultdict(list)

                def chkpt():
                    cache.put("state", params, state)
                    cache.put("stats", params, stats)

                # perform some computation:
                last_time = time.time()
                for i in range(100):

                    mcmc_sweep(state, n=10)

                    stats["mag"].append(get_mag(state))

                    print(params, stats["mag"][-1])

                    # if 30 mins (real time) have passed, we save the state to the
                    # filesystem
                    if time.time() - last_time > 30 * 60:  # 30 mins
                        chkpt()
                        last_time = time.time()
                        print("saved...")

                chkpt()

The script above will cycle through all parameter combinations and store the results in the cache directory. Importantly, if we run the same script in parallel, e.g.

./run_hpc.py &
./run_hpc.py &

each job will be assigned a disjoint set of tasks due to the data_cache locking mechanism, even if they are running on different nodes! This means that the level of parallelism is completely decoupled from the task specification and we can tune this later when we submit the jobs to the HPC queue.

After the jobs have completed, the results will be stored in the cache directory. We can then use a separate script called plot_mag.py to plot the resutls, and perform the local part of the analysis:

plot_mag.py

#!/bin/env python

import os.path
from pylab import *

import style

from run_hpc import *   # parameter list and cache are imported here

for N in vranges["N"]:
    output = f"figs/N{N}-mag.pdf"
    if not os.path.exists(output):
        fig, ax = subplots(1, 1, figsize=(4, 3))

        for ak in vranges["ak"]:

            betas = []
            mags = []

            for beta in vranges["beta"]:

                params = dict(N=N, ak=ak, beta=beta)

                try:
                    stats = cache.get("stats", params)
                except KeyError:
                    continue

                betas.append(beta)
                mags.append(abs(stats["mag"][-1]))

            ax.plot(betas, mags, "s-", label=rf"$\left<k\right>={ak}$")
            ax.axvline(1/ak, ls="--", color="grey", zorder=-1)

        ax.legend(loc="best")
        ax.set_xlabel(r"$\beta$")
        ax.set_ylabel(r"Average magnetization")
        fig.tight_layout(pad=0.05)
        fig.savefig(output, transparent=True, bbox_inches="tight", pad_inches=0.05)

The template also provides a sync.sh script that simplifies pushing and pulling from the HPC environment using rsync. The pipeline then becomes:

Implement the code via ising.py and run_hpc.py
Test by runinng run_hpc.py locally.
When satisfied, push the project to the HPC with
```
./sync.sh push
```
This will synchronize the entire project’s directory with the HPC, except the cache directory, not to overwrite any new result obtained there.
Submit the jobs to the HPC queue. A sample SLURM submission script is given in submit_jobs.sh. Note how the number of jobs can be configured there independently of everything else.
When the results are done, or the intermediary results are to be inspected locally, the cache directory can be downloaded from the HPC with
```
./sync.sh cache
```
The results are plotted with plot_mag.py.
Goto 1.

This pipeline can be easily customized and augmented by modifying the respective scripts, or creating additional ones. A few rules of thumb to keep in mind:

There’s no need to have a single script for HPC tasks and a single script for plotting. On the contrary: for independent analyses in the same project, it is better to have multiple scripts for running and plotting.
Avoid adding too much project substance in run_hpc.py. This should contain mostly boilerplate for task bookkeeping.
Try to keep things simple!

Periodically back up your cache directory!

In most projects, your cache directory is very precious, since it contains all the results obtained so far, which could have taken weeks in real time, and hundreds of thousands of core-hours to obtain. It’s important to periodically download it from the HPC, and keep it locally as a backup, in case you make a mistake that invalidates it in some way; in which case you can just restore it from your local copy.

You can keep several time-stamped local snapshots by doing:

cp -arv cache cache-$(date +%Y-%m-%d)

This would allow you to go back in case a mistake is found only later.

Your own software stack in the HPC: Apptainer (Singularity)¶

A typical problem with HPC environments is that they do not contain up-to-date software. On top of this, if you need to use a customized, or development version of graph-tool, it needs to be compiled from scratch, making it difficult to use in an HPC. A satisfying solution for this is Apptainer, which, similar to docker, allows you to create containers that include all of your desired software. However, unlike docker, the apptainer image can be used in a HPC system, since it does not require special user privileges.

See the repository gt-apptainer for instructions on how to build an apptainer image containing a compiled version of graph-tool.

Exercises¶

Provide a separate modification of the template for each exercise.

When done, please add below a link to your solution in your own fork of the repository.

Exercise 1¶

Modify the sample project so that not only the magnetization

M = \frac{1}{N}\sum_i s_i,

(1)

for $s_i\in\{-1,+1\}$ , but also the susceptibility is computed and plotted, given by

\chi = \left<M^2\right> - \left<M\right>^2 = \frac{1}{N^2} \sum_{ij} \left<s_is_j\right> - \left<s_i\right>\left<s_j\right>.

(2)

Solutions:

Exercise 2¶

Consider the situation that, after an initial analysis, you now have decided to add an additional parameter to your model: a global field $\theta$ affecting every node. Modify the script in a manner that adds this as an additional parameter, varying within some range.

Solutions:

Exercise 3¶

Implement the following project: Given a set of empirical networks from Netzschleuder, you will fit several versions of the SBM (flat, nested, ordered, degree-corrected, not-degree corrected). For each network, you will plot the description length according to each model, relative to the best fitting one.

Be sure to run the inference algorithm multiple times, and analyze only the best result.

Solutions:

Exercise 4¶

Implement a project that, given a set of empirical networks from Netzschleuder, does the following:

Obtain $M$ independent equilibrium configurations of the Ising model with $\beta=1/\left<k\right>$ , where $\left<k\right>$ is the average degree of the network.
From the equilibrium configurations you will reconstruct the original network using maximum a posteriori estimation.
Plot the similarity of the inferred network with the true one as a function of $M$ .

Be sure to run the inference algorithm multiple times, and analyze only the best result.

Solutions:

Exercise 5¶

Modify the project of exercise 4 by, in addition to maximum a posteriori estimation, you perform also posterior averages, and marginal a posteriori estimation. Compare the similarity obtained in both cases.

Solutions:

Footnotes¶

Emacs is the correct text editor. In a free society, one is allowed to make the wrong choice, but that will never make it the right one. Using vim is in itself more of a penance than a sin, so it deserves no additional retribution, but repentance is nevertheless advised.
↩

Inverse Complexity Lab Manual

Essential information and group infrastructure

Inverse Complexity Lab Manual

Compiling and developing with graph-tool