Basic hygiene for scientific coding
Authors: Tiago P. Peixoto1
Affiliations: 1Inverse Complexity Lab
License: CC-BY
This document describes general principles and recipes to keep your scientific programming pipeline sane and efficient, and to facilitate collaboration.
Project organization principles¶
A scientific project differs in one important way from typical programming tasks: Initially, we usually have only a tentative idea of the methods and implementations we want to use, and a substantial amount of experimentation and revision is unavoidable. Instead of being designed based on well-defined requirements, scientific code usually evolves in the time scale of a project. This needs to be taken into consideration in the planning and development phases to minimize entropy increase, which tends to create inefficiencies, mistakes, and hence time waste and frustration.
The general principles of organization that should be followed are the following:
Avoid unnecessary complexity.
Avoid code duplication.
Modularize your code. In particular, separate between:
- Boilerplate code, i.e. submitting jobs to HPC, scanning through parameter ranges, etc.
- Ad hoc code, i.e. things you will likely never use again for another project.
- General functions that will be useful for other projects.
- Core algorithms for your project.
Take the time to refactor your code when entropy increase has caused the above principles to be sufficiently violated!
Use version control (git) to track your changes and share the code with collaborators.
A corollary of the above is the following:
A project template: sampling from the Ising model¶
A simple project template that demonstrates how to implement the above principles is available in the repository under the folder proj_template.
This example project consists of sampling states from an Ising model on an Erdős–Rényi network, to characterize the phase transition in the magnetization as function of the inverse temperature β.
The project-specific code is contained in the file ising.py:
#!/usr/bin/env python3
# This should contain project-specific code
from graph_tool.all import *
import numpy as np
def gen_state(N, ak, beta):
g = random_graph(N, lambda: np.random.poisson(ak), model="erdos", directed=False)
state = IsingGlauberState(g, beta=beta)
return state
def mcmc_sweep(state, n=1):
return state.iterate_async(niter=n * state.g.num_vertices())
def get_mag(state):
return state.s.fa.mean()
This is fairly minimal in this skeleton project, since it just leverages the IsingGlauberState implementation available in graph-tool. In actual projects this will be fairly larger, and will reflect the actual substance of the study.
Typically we will want to run our analysis in an HPC environment, where we spread our tasks among many jobs running in parallel in different machines (a.k.a. nodes), which all have access to the same shared filesystem.
An elegant infrastructure that simplifies this kind of synchronization is implemented in the data_cache module. You should clone that to your project’s folder so it can be used:
cd proj_template
git clone gitlab@git.skewed.de:count0/data_cache.git
The script run_hpc.py takes care of actually running the individual tasks:
#!/bin/env python
import os.path
import itertools
import time
import numpy as np
from collections import defaultdict
from ising import * # project-specific code
from data_cache import FSDataCache
cache = FSDataCache() # this is the persistent cache for all results
# the following sets the parameter ranges we will iterate over
vranges = {
"N": [10000],
"ak": [3, 5],
"beta" : np.linspace(0, 1, 40)
}
def iter_ranges(*args):
return itertools.product(*[vranges[a] for a in args])
if __name__ == "__main__":
params_list = []
nparams = list(vranges.keys())
for vals in iter_ranges(*nparams):
params = dict(zip(nparams, vals))
params_list.append(params)
print("Parameter combinations:", len(params_list))
# follow them in random order
np.random.shuffle(params_list)
while True:
for params in params_list:
print(params)
with cache.get_lock("state", params, block=False) as lock:
if lock is None:
# if lock is None, another process already claimed this set of
# parameters
continue
# at this point, we hold a 'lock' on this particular combination of
# parameters
try:
# load previous state if available
state = cache.get("state", params)
except KeyError:
# initialize our state
state = gen_state(**params)
# keep track of some statistics
try:
stats = cache.get("stats", params)
except KeyError:
stats = defaultdict(list)
def chkpt():
cache.put("state", params, state)
cache.put("stats", params, stats)
# perform some computation:
last_time = time.time()
for i in range(100):
mcmc_sweep(state, n=10)
stats["mag"].append(get_mag(state))
print(params, stats["mag"][-1])
# if 30 mins (real time) have passed, we save the state to the
# filesystem
if time.time() - last_time > 30 * 60: # 30 mins
chkpt()
last_time = time.time()
print("saved...")
chkpt()
The script above will cycle through all parameter combinations and store the
results in the cache
directory. Importantly, if we run the same script in
parallel, e.g.
./run_hpc.py &
./run_hpc.py &
each job will be assigned a disjoint set of tasks due to the data_cache
locking mechanism, even if they are running on different nodes! This means that
the level of parallelism is completely decoupled from the task specification
and we can tune this later when we submit the jobs to the HPC queue.
After the jobs have completed, the results will be stored in the cache
directory. We can then use a separate script called plot_mag.py
to plot the
resutls, and perform the local part of the analysis:
#!/bin/env python
import os.path
from pylab import *
import style
from run_hpc import * # parameter list and cache are imported here
for N in vranges["N"]:
output = f"figs/N{N}-mag.pdf"
if not os.path.exists(output):
fig, ax = subplots(1, 1, figsize=(4, 3))
for ak in vranges["ak"]:
betas = []
mags = []
for beta in vranges["beta"]:
params = dict(N=N, ak=ak, beta=beta)
try:
stats = cache.get("stats", params)
except KeyError:
continue
betas.append(beta)
mags.append(abs(stats["mag"][-1]))
ax.plot(betas, mags, "s-", label=rf"$\left<k\right>={ak}$")
ax.axvline(1/ak, ls="--", color="grey", zorder=-1)
ax.legend(loc="best")
ax.set_xlabel(r"$\beta$")
ax.set_ylabel(r"Average magnetization")
fig.tight_layout(pad=0.05)
fig.savefig(output, transparent=True, bbox_inches="tight", pad_inches=0.05)
Note how the plot script does not need to repeat the parameter range—it is
imported from run_hpc.py
, which serves a central repository for that
information. Unlike when using notebooks, the separation of code and analysis in
this way is much cleaner and efficient.
The template also provides a sync.sh script that simplifies pushing and pulling from the HPC environment using rsync. The pipeline then becomes:
- Implement the code via
ising.py
andrun_hpc.py
- Test by runinng
run_hpc.py
locally. - When satisfied, push the project to the HPC withThis will synchronize the entire project’s directory with the HPC, except the
./sync.sh push
cache
directory, not to overwrite any new result obtained there. - Submit the jobs to the HPC queue. A sample SLURM submission script is given in submit_jobs.sh. Note how the number of jobs can be configured there independently of everything else.
- When the results are done, or the intermediary results are to be inspected locally,
the
cache
directory can be downloaded from the HPC with./sync.sh cache
- The results are plotted with
plot_mag.py
. - Goto 1.
This pipeline can be easily customized and augmented by modifying the respective scripts, or creating additional ones. A few rules of thumb to keep in mind:
- There’s no need to have a single script for HPC tasks and a single script for plotting. On the contrary: for independent analyses in the same project, it is better to have multiple scripts for running and plotting.
- Avoid adding too much project substance in
run_hpc.py
. This should contain mostly boilerplate for task bookkeeping. - Try to keep things simple!
cache
directory!In most projects, your cache
directory is very precious, since it contains
all the results obtained so far, which could have taken weeks in real time, and
hundreds of thousands of core-hours to obtain. It’s important to periodically
download it from the HPC, and keep it locally as a backup, in case you make a
mistake that invalidates it in some way; in which case you can just restore it
from your local copy.
You can keep several time-stamped local snapshots by doing:
cp -arv cache cache-$(date +%Y-%m-%d)
This would allow you to go back in case a mistake is found only later.
Your own software stack in the HPC: Apptainer (Singularity)¶
A typical problem with HPC environments is that they do not contain up-to-date software. On top of this, if you need to use a customized, or development version of graph-tool, it needs to be compiled from scratch, making it difficult to use in an HPC. A satisfying solution for this is Apptainer, which, similar to docker, allows you to create containers that include all of your desired software. However, unlike docker, the apptainer image can be used in a HPC system, since it does not require special user privileges.
See the repository gt-apptainer for instructions on how to build an apptainer image containing a compiled version of graph-tool.
Exercises¶
Provide a separate modification of the template for each exercise.
When done, please add below a link to your solution in your own fork of the repository.
Don’t forget to rename the project from proj_template
to
something more appropriate.
Exercise 1¶
Modify the sample project so that not only the magnetization
for , but also the susceptibility is computed and plotted, given by
Solutions:
Exercise 2¶
Consider the situation that, after an initial analysis, you now have decided to add an additional parameter to your model: a global field θ affecting every node. Modify the script in a manner that adds this as an additional parameter, varying within some range.
Solutions:
cache
!As mentioned before, it is important not to invalidate your cache
as your
project evolves, since you want to preserve your results. When you augment your
parameter list, you must make sure that your cache with the previous parameter
list is still accessible and coherent. The data_cache
infrastructure allows
for an elegant solution in this case. Try to figure out how to implement it!
Hint: a parameter value of None
passed to cache().get
or cache.put()
is equivalent to not passing that parameter at all.
Exercise 3¶
Implement the following project: Given a set of empirical networks from Netzschleuder, you will fit several versions of the SBM (flat, nested, ordered, degree-corrected, not-degree corrected). For each network, you will plot the description length according to each model, relative to the best fitting one.
Be sure to run the inference algorithm multiple times, and analyze only the best result.
Solutions:
Exercise 4¶
Implement a project that, given a set of empirical networks from Netzschleuder, does the following:
- Obtain independent equilibrium configurations of the Ising model with , where is the average degree of the network.
- From the equilibrium configurations you will reconstruct the original network using maximum a posteriori estimation.
- Plot the similarity of the inferred network with the true one as a function of .
Be sure to run the inference algorithm multiple times, and analyze only the best result.
Solutions:
Exercise 5¶
Modify the project of exercise 4 by, in addition to maximum a posteriori estimation, you perform also posterior averages, and marginal a posteriori estimation. Compare the similarity obtained in both cases.
Solutions: