.\" Point size fails when rending html in the code blocks
.\".nr PS 11p
.nr GROWPS 3
.nr PSINCR 2p
.fam P
.\" ===================================================================
.\" Some useful macros
.\" ===================================================================
.\"
.\" Code start (CS) and end (CE) blocks
.de CS
.DS L
\fC
..
.de CE
\fP
.DE
..
.\" Code inline:
.\" .CI "inline code"
.de CI
\fC\\$1\fP\\$2
..
.\" ===================================================================
.\" \&
.\" .sp 3c
.\" .LG
.\" .LG
.\" .LG
.\" .LG
.\" Garlic: User guide
.\" .br
.\" .NL
.\" Rodrigo Arias Mallo
.\" .br
.\" .I "Barcelona Supercomputing Center"
.\" .br
.\" \*[curdate]
.\" .sp 17c
.\" .DE
.\" .CI \*[gitcommit]
.TL
Garlic: User Guide
.AU
Rodrigo Arias Mallo
.AI
Barcelona Supercomputing Center
.AB
.LP
This document contains all the information to configure and use the
garlic benchmark. All stages from the development to the publication
are covered, as well as the introductory steps required to setup the
machines.
.DS L
.SM
\fC
Generated on \*[curdate]
Git commit: \*[gitcommit]
\fP
.DE
.AE
.\" ===================================================================
.NH 1
Introduction
.LP
The garlic framework is designed to fulfill all the requirements of an
experimenter in all the steps up to publication. The experience gained
while using it suggests that we move along three stages despicted in the
following diagram:
.DS L
.SM
.PS 5
linewid=1.4;
arcrad=1;
right
S: box "Source" "code"
line "Development" invis
P: box "Program"
line "Experimentation" invis
R:box "Results"
line "Data" "exploration" invis
F:box "Figures"
# Creates a "cycle" around two boxes
define cycle {
  arc cw from 1/2 of the way between $1.n and $1.ne \
    to 1/2 of the way between $2.nw and $2.n ->;
  arc cw from 1/2 of the way between $2.s and $2.sw \
    to 1/2 of the way between $1.se and $1.s ->;
}
cycle(S, P)
cycle(P, R)
cycle(R, F)
.PE
.DE
In the development phase the experimenter changes the source code in
order to introduce new features or fix bugs. Once the program is
considered functional, the next phase is the experimentation, where
several experiment configurations are tested to evaluate the program. It
is common that some problems are spotted during this phase, which lead
the experimenter to go back to the development phase and change the
source code.
.PP
Finally, when the experiment is considered completed, the
experimenter moves to the next phase, which envolves the exploration of
the data generated by the experiment. During this phase, it is common to
generate results in the form of plots or tables which provide a clear
insight in those quantities of interest. It is also common that after
looking at the figures, some changes in the experiment configuration
need to be introduced (or even in the source code of the program).
.PP
Therefore, the experimenter may move forward and backwards along three
phases several times. The garlic framework provides support for all the
three stages (with different degrees of madurity).
.\" ===================================================================
.NH 2
Machines and clusters
.LP
Our current setup employs multiple machines to build and execute the
experiments. Each cluster and node has it's own name and will be
different in other clusters. Therefore, instead of using the names of
the machines we use machine classes to generalize our setup. Those
machine clases currently correspond to a physical machine each:
.IP \(bu 12p
.B Builder
(xeon07): runs the nix-daemon and performs the builds in /nix. Requires
root access to setup the
.I nix-daemon
with multiple users.
.IP \(bu
.B Target
(MareNostrum 4 compute nodes): the nodes where the experiments 
are executed. It doesn't need to have /nix installed or root access.
.IP \(bu
.B Login
(MareNostrum 4 login nodes): used to allocate resources and run jobs. It
doesn't need to have /nix installed or root access.
.IP \(bu
.B Laptop
(where the keyboard is attached, can be anything): used to connect to the other machines.
No root access is required or /nix, but needs to be able to connect to
the builder.
.LP
The machines don't need to be different of each others, as one machine
can implement several classes. For example the laptop can act as the
builder too but is not recommended. Or the login machine can also
perform the builds, but is not possible yet in our setup.
.\" ===================================================================
.NH 2
Reproducibility
.LP
An effort to facilitate the reproducibility of the experiments has been
done, with varying degrees of success. The names of the different levels
of reproducibility have not been yet standarized, so we define our own
to avoid any confusion. We define three levels of reproducibility based
on the people and the machine involved:
.IP \(bu 12p
R0: The \fIsame\fP people on the \fIsame\fP machine obtain the same result
.IP \(bu
R1: \fIDifferent\fP people on the \fIsame\fP machine obtain the same result
.IP \(bu
R2: \fIDifferent\fP people on a \fIdifferent\fP machine obtain the same result
.LP
The garlic framework distinguishes two types of results: the result of
\fIbuilding a derivation\fP (usually building a binary or a library from the
sources) and the results of the \fIexecution of an experiment\fP (typically
those are the measurements performed during the execution of the program
of study).
.PP
For those two types, the meaning of
.I "same result"
is different. In the case of building a binary, we define the same
result if it is bit-by-bit identical. In the packages provided by nixos
is usually the case except some rare cases. One example is that during the build process,
a directory is listed by the order of the inodes, giving a random order
which is different between builds. These problems are tracked by the
.URL https://r13y.com/ r13y
project. About 99% of the derivations of the minimal package set achieve
the R2 property.
.PP
On the other hand, the results of the experiments are always bit-by-bit
different. So we change the definition to state that they are the same
if the conclusions that can be obtained are the same. In particular, we
assume that the results are within the confidence interval. With this
definition, all experiments are currently R1. The reproducibility level
R2 is not posible yet as the software is compiled to support only the
target machine, with an specific interconnection.
.\" ===================================================================
.bp
.NH 1
Preliminary steps
.LP
The peculiarities of our setup require that users perform some actions
to use the garlic framework. The content of this section is only
intended for the users of our machines, but can serve as reference in
other machines.
.PP
The names of the machine classes are used in the command line prompt
instead of the actual name of the machine, to indicate that the command
needs to be executed in the stated machine class, for example:
.CS
builder% echo hi
hi
.CE
When the machine class is not important, it is ignored and only the
.CI "%"
prompt appears.
.\" ===================================================================
.NH 2
Configure your laptop
.LP
To easily connect to the builder (xeon07) in one step, configure the SSH
client to perform a jump over the Cobi login node. The
.I ProxyJump
directive is only available in version 7.3 and upwards. Add the
following lines in the
.CI \(ti/.ssh/config
file of your laptop:
.CS
Host cobi
      HostName ssflogin.bsc.es
      User your-username-here
 
Host xeon07
      ProxyJump cobi
      HostName xeon07
      User your-username-here
.CE
You should be able to connect to the builder typing:
.CS
laptop$ ssh xeon07
.CE
To spot any problems try with the
.CI -v
option to enable verbose output.
.\" ===================================================================
.NH 2
Configure the builder (xeon07)
.LP
In order to use nix you would need to be able to download the sources 
from Internet. Usually the download requires the ports 22, 80 and 443 
to be open for outgoing traffic.
.PP
Check that you have network access in
xeon07 provided by the environment variables \fIhttp_proxy\fP and
\fIhttps_proxy\fP. Try to fetch a webpage with curl, to ensure the proxy
is working:
.CS
xeon07$ curl x.com
x
.CE
.\" ===================================================================
.NH 3
Create a new SSH key
.LP
There is one DSA key in your current home called "cluster" that is no
longer supported in recent SSH versions and should not be used. Before
removing it, create a new one without password protection leaving the
passphrase empty (in case that you don't have one already created) by
running:
.CS
xeon07$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (\(ti/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in \(ti/.ssh/id_rsa.
Your public key has been saved in \(ti/.ssh/id_rsa.pub.
\&...
.CE
By default it will create the public key at \f(CW\(ti/.ssh/id_rsa.pub\fP.
Then add the newly created key to the authorized keys, so you can
connect to other nodes of the Cobi cluster:
.CS
xeon07$ cat \(ti/.ssh/id_rsa.pub >> \(ti/.ssh/authorized_keys
.CE
Finally, delete the old "cluster" key:
.CS
xeon07$ rm \(ti/.ssh/cluster \(ti/.ssh/cluster.pub
.CE
And remove the section in the configuration \f(CW\(ti/.ssh/config\fP
where the key was assigned to be used in all hosts along with the
\f(CWStrictHostKeyChecking=no\fP option. Remove the following lines (if
they exist):
.CS
Host *
    IdentityFile \(ti/.ssh/cluster
    StrictHostKeyChecking=no
.CE
By default, the SSH client already searchs for a keypair called
\f(CW\(ti/.ssh/id_rsa\fP and \f(CW\(ti/.ssh/id_rsa.pub\fP, so there is
no need to manually specify them.
.PP
You should be able to access the login node with your new key by using:
.CS
xeon07$ ssh ssfhead
.CE
.\" ===================================================================
.NH 3
Authorize access to the repository
.LP
The sources of BSC packages are usually downloaded directly from the PM
git server, so you must be able to access all repositories without a
password prompt.
.PP
Most repositories are open to read for logged in users, but there are
some exceptions (for example the nanos6 repository) where you must have
explicitly granted read access.
.PP
Copy the contents of your public SSH key in \f(CW\(ti/.ssh/id_rsa.pub\fP
and paste it in GitLab at
.CS
https://pm.bsc.es/gitlab/profile/keys
.CE
Finally verify the SSH connection to the server works and you get a 
greeting from the GitLab server with your username:
.CS
xeon07$ ssh git@bscpm03.bsc.es
PTY allocation request failed on channel 0
Welcome to GitLab, @rarias!
Connection to bscpm03.bsc.es closed.
.CE
Verify that you can access the nanos6 repository (otherwise you 
first need to ask to be granted read access), at:
.CS
https://pm.bsc.es/gitlab/nanos6/nanos6
.CE
Finally, you should be able to download the nanos6 git 
repository without any password interaction by running:
.CS
xeon07$ git clone git@bscpm03.bsc.es:nanos6/nanos6.git
.CE
Which will create the nanos6 directory.
.\" ===================================================================
.NH 3
Authorize access to MareNostrum 4
.LP
You will also need to access MareNostrum 4 from the xeon07 machine, in 
order to run experiments. Add the following lines to the 
\f(CW\(ti/.ssh/config\fP file and set your user name:
.CS
Host mn0 mn1 mn2
    User <your user name in MN4>
.CE
Then copy your SSH key to MareNostrum 4 (it will ask you for your login
password):
.CS
xeon07$ ssh-copy-id -i \(ti/.ssh/id_rsa.pub mn1
.CE
Finally, ensure that you can connect without a password:
.CS
xeon07$ ssh mn1
\&...
login1$
.CE
.\" ===================================================================
.NH 3
Clone the bscpkgs repository
.LP
Once you have Internet and you have granted access to the PM GitLab 
repositories you can begin building software with nix. First ensure 
that the nix binaries are available from your shell in xeon07:
.CS
xeon07$ nix --version
nix (Nix) 2.3.6
.CE
Now you are ready to build and install packages with nix. Clone the 
bscpkgs repository:
.CS
xeon07$ git clone git@bscpm03.bsc.es:rarias/bscpkgs.git
.CE
Nix looks in the current folder for a file named \f(CWdefault.nix\fP for
packages, so go to the bscpkgs directory:
.CS
xeon07$ cd bscpkgs
.CE
Now you should be able to build nanos6 (which is probably already
compiled):
.CS
xeon07$ nix-build -A bsc.nanos6
\&...
/nix/store/...2cm1ldx9smb552sf6r1-nanos6-2.4-6f10a32
.CE
The installation is placed in the nix store (with the path stated in 
the last line of the build process), with the \f(CWresult\fP symbolic
link pointing to the same location:
.CS
xeon07$ readlink result
/nix/store/...2cm1ldx9smb552sf6r1-nanos6-2.4-6f10a32
.CE
.\" ###################################################################
.NH 3
Configure garlic
.LP
In order to launch experiments in the
.I target
machine, it is required to configure nix to allow a directory to be
available during the build process, where the results will be stored 
before being copied in the nix store. Create a new
.CI garlic
directory in your
personal cache directory and copy the full path:
.CS
xeon07$ mkdir -p \(ti/.cache/garlic
xeon07$ readlink -f \(ti/.cache/garlic
/home/Computational/rarias/.cache/garlic
.CE
Then create the nix configuration directory (if it has not already been
created):
.CS
xeon07$ mkdir -p \(ti/.config/nix
.CE
And add the following line in the
.CI \(ti/.config/nix/nix.conf
file, replacing it with the path you copied before:
.CS
.SM
extra-sandbox-paths = /garlic=/home/Computational/rarias/.cache/garlic
.CE
This option creates a virtual directory called
.CI /garlic
inside the build environment, whose contents are the ones you specify at
the right hand side of the equal sign (in this case the
.CI \(ti/.cache/garlic
directory). It will be used to allow the results of the experiments to
be passed to nix from the
.I target
machine.
.\" ###################################################################
.NH 3
Run the garlic daemon (optional)
.LP
The garlic benchmark has a daemon which can be used to
automatically launch the experiments in the
.I target
machine on demand, when they are required to build other derivations, so
they can be launched without user interaction. The daemon creates some
FIFO pipes to communicate with the build environment, and must be
running to be able to run the experiments. To execute it, go to the 
.CI bscpkgs/garlic
directory and run
.CS
xeon07$ nix-shell
nix-shell$
.CE
to enter the nix shell (or specify the path to the
.CI garlic/shell.nix
file as argument). Then, run the daemon inside the nix shell:
.CS
nix-shell$ garlicd
garlicd: Waiting for experiments ...
.CE
Notice that the daemon stays running in the foreground, waiting for
experiments. At this moment, it can only process one experiment at a
time.
.\" ===================================================================
.NH 2
Configure the login and target (MareNostrum 4)
.LP
In order to execute the programs in MareNostrum 4, you first need load
some utilities in the PATH. Add to the end of the file
\f(CW\(ti/.bashrc\fP in MareNostrum 4 the following line:
.CS
export PATH=/gpfs/projects/bsc15/nix/bin:$PATH
.CE
Then logout and login again (our source the \f(CW\(ti/.bashrc\fP file)
and check that now you have the \f(CWnix-develop\fP command available:
.CS
login1$ which nix-develop
/gpfs/projects/bsc15/nix/bin/nix-develop
.CE
The new utilities are available both in the login nodes and in the
compute (target) nodes, as they share the file system over the network.
.\" ===================================================================
.bp
.NH 1
Development
.LP
During the development phase, a functional program is produced by
modifying its source code. This process is generally cyclic: the
developer needs to compile, debug and correct mistakes. We want to
minimize the delay times, so the programs can be executed as soon as
needed, but under a controlled environment so that the same behavior
occurs during the experimentation phase.
.PP
In particular, we want that several developers can reproduce the
same development environment so they can debug each other programs
when reporting bugs. Therefore, the environment must be carefully
controlled to avoid non-reproducible scenarios.
.PP
The current development environment provides an isolated shell with a
clean environment, which runs in a new mount namespace where access to
the filesystem is restricted. Only the project directory and the nix
store are available (with some other exceptions), to ensure that you
cannot accidentally link with the wrong library or modify the build
process with a forgotten environment variable in the \f(CW\(ti/.bashrc\fP
file.
.\" ===================================================================
.NH 2
Getting the development tools
.LP
To create a development
environment, first copy or download the sources of your program (not the
dependencies) in a new directory placed in the target machine
(MareNostrum\~4).
.PP
The default environment contains packages commonly used to develop
programs, listed in the \fIgarlic/index.nix\fP file:
.\" FIXME: Unify garlic.unsafeDevelop in garlic.develop, so we can
.\" specify the packages directly
.CS
develop = let 
  commonPackages = with self; [
    coreutils htop procps-ng vim which strace
    tmux gdb kakoune universal-ctags bashInteractive
    glibcLocales ncurses git screen curl
    # Add more nixpkgs packages here...
  ];  
  bscPackages = with bsc; [
    slurm clangOmpss2 icc mcxx perf tampi impi
    # Add more bsc packages here...
  ];
  ...
.CE
If you need additional packages, add them to the list, so that they
become available in the environment. Those may include any dependency
required to build your program.
.PP
Then use the build machine (xeon07) to build the
.I garlic.develop
derivation:
.CS
build% nix-build -A garlic.develop
\&...
build% grep ln result
ln -fs /gpfs/projects/.../bin/stage1 .nix-develop
.CE
Copy the \fIln\fP command and run it in the target machine
(MareNostrum\~4), inside the new directory used for your program
development, to create the link \fI.nix-develop\fP (which is used to
remember your environment). Several environments can be stored in
different directories using this method, with different packages in each
environment. You will need
to rebuild the
.I garlic.develop
derivation and update the
.I .nix-develop
link after the package list is changed. Once the
environment link is created, there is no need to repeat these steps again.
.PP
Before entering the environment, you will need to access the required
resources for your program, which may include several compute nodes.
.\" ===================================================================
.NH 2
Allocating resources for development
.LP
Our target machine (MareNostrum 4) provides an interactive shell, that
can be requested with the number of computational resources required for
development. To do so, connect to the login node and allocate an
interactive session:
.CS
% ssh mn1
login% salloc ...
target%
.CE
This operation may take some minutes to complete depending on the load
of the cluster. But once the session is ready, any subsequent execution
of programs will be immediate.
.\" ===================================================================
.NH 2
Accessing the developement environment
.PP
The utility program \fInix-develop\fP has been designed to access the
development environment of the current directory, by looking for the
\fI.nix-develop\fP file. It creates a namespace where the required
packages are installed and ready to be used. Now you can access the
newly created environment by running:
.CS
target% nix-develop
develop%
.CE
The spawned shell contains all the packages pre-defined in the
\fIgarlic.develop\fP derivation, and can now be accessed by typing the
name of the commands.
.CS
develop% which gcc
/nix/store/azayfhqyg9...s8aqfmy-gcc-wrapper-9.3.0/bin/gcc
develop% which gdb
/nix/store/1c833b2y8j...pnjn2nv9d46zv44dk-gdb-9.2/bin/gdb
.CE
If you need additional packages, you can add them in the
\fIgarlic/index.nix\fP file as mentioned previously. To keep the
same current resources, so you don't need to wait again for the
resources to be allocated, exit only from the development shell:
.CS
develop% exit
target%
.CE
Then update the
.I .nix-develop
link and enter into the new develop environment:
.CS
target% nix-develop
develop%
.CE
.\" ===================================================================
.NH 2
Execution
.LP
The allocated shell can only execute tasks in the current node, which
may be enough for some tests. To do so, you can directly run your
program as:
.CS
develop$ ./program
.CE
If you need to run a multi-node program, typically using MPI
communications, then you can do so by using srun. Notice that you need
to allocate several nodes when calling salloc previously. The srun
command will execute the given program \fBoutside\fP the development
environment if executed as-is. So we re-enter the develop environment by
calling nix-develop as a wrapper of the program:
.\" FIXME: wrap srun to reenter the develop environment by its own
.CS
develop$ srun nix-develop ./program
.CE
.\" ===================================================================
.NH 2
Debugging
.LP
The debugger can be used to directly execute the program if is executed
in only one node by using:
.CS
develop$ gdb ./program
.CE
Or it can be attached to an already running program by using its PID.
You will need to first connect to the node running it (say target2), and
run gdb inside the nix-develop environment. Use
.I squeue
to see the compute nodes running your program: 
.CS
login$ ssh target2
target2$ cd project-develop
target2$ nix-develop
develop$ gdb -p $pid
.CE
You can repeat this step to control the execution of programs running in
different nodes simultaneously.
.PP
In those cases where the program crashes before being able to attach the
debugger, enable the generation of core dumps:
.CS
develop$ ulimit -c unlimited
.CE
And rerun the program, which will generate a core file that can be
opened by gdb and contains the state of the memory when the crash
happened. Beware that the core dump file can be very large, depending on
the memory used by your program at the crash.
.\" ===================================================================
.NH 2
Git branch name convention
.LP
The garlic benchmark imposes a set of requirements to be meet for each 
application in order to coordinate the execution of the benchmark and 
the gathering process of the results.
.PP
Each application must be available in a git repository so it can be 
included into the garlic benchmark. The different combinations of 
programming models and communication schemes should be each placed in 
one git branch, which are referred to as \fIbenchmark branches\fP. At
least one benchmark branch should exist and they all must begin with the
prefix \f(CWgarlic/\fP (other branches will be ignored).
.PP
The branch name is formed by adding keywords separated by the "+" 
character. The keywords must follow the given order and can only 
appear zero or once each. At least one keyword must be included. The 
following keywords are available:
.IP \f(CWmpi\fP 5m
A significant fraction of the communications uses only the standard MPI
(without extensions like TAMPI).
.IP \f(CWtampi\fP
A significant fraction of the communications uses TAMPI.
.IP \f(CWsend\fP
A significant part of the MPI communication uses the blocking family of
methods
.I MPI_Send , (
.I MPI_Recv ,
.I MPI_Gather "...)."
.IP \f(CWisend\fP
A significant part of the MPI communication uses the non-blocking family
of methods
.I MPI_Isend , (
.I MPI_Irecv ,
.I MPI_Igather "...)."
.IP \f(CWrma\fP
A significant part of the MPI communication uses remote memory access
(one-sided) methods
.I MPI_Get , (
.I MPI_Put "...)."
.IP \f(CWseq\fP
The complete execution is sequential in each process (one thread per
process).
.IP \f(CWomp\fP
A significant fraction of the execution uses the OpenMP programming
model.
.IP \f(CWoss\fP
A significant fraction of the execution uses the OmpSs-2 programming
model.
.IP \f(CWtask\fP
A significant part of the execution involves the use of the tasking
model.
.IP \f(CWtaskfor\fP
A significant part of the execution uses the taskfor construct.
.IP \f(CWfork\fP
A significant part of the execution uses the fork-join model (including
hybrid programming techniques with  parallel computations and sequential
communications).
.IP \f(CWsimd\fP
A significant part of the computation has been optimized to use SIMD
instructions.
.LP
In the
.URL #appendixA "Appendix A"
there is a flowchart to help the decision
process of the branch name. Additional user defined keywords may be
added at the end using the separator "+" as well. User keywords must
consist of capital alphanumeric characters only and be kept short. These
additional keywords must be different (case insensitive) to the already
defined above. Some examples:
.CS
garlic/mpi+send+seq
garlic/mpi+send+omp+fork
garlic/mpi+isend+oss+task
garlic/tampi+isend+oss+task
garlic/tampi+isend+oss+task+COLOR
garlic/tampi+isend+oss+task+COLOR+BTREE
.CE
.\" ===================================================================
.NH 2
Initialization time
.LP
It is common for programs to have an initialization phase prior to the
execution of the main computation task which is the objective of the study.
The initialization phase is usually not considered when taking
measurements, but the time it takes to complete can limit seriously the
amount of information that can be extracted from the computation phase.
As an example, if the computation phase is in the order of seconds, but
the initialization phase takes several minutes, the number of runs would
need to be set low, as the units could exceed the time limits. Also, the
experimenter may be reluctant to modify the experiments to test other
parameters, as the waiting time for the results is unavoidably large. 
.PP
To prevent this problem the programs must reduce the time of the
initialization phase to be no larger than the computation time. To do
so, the initialization phase can be optimized either with
parallelization, or it can be modified to store the result of the
initialization to the disk to be later at the computation phase. In the
garlic framework an experiment can have a dependency over the results of
another experiment (the results of the initialization). The
initialization results will be cached if the derivation is kept
invariant, when modifying the computation phase parameters.
.\" ===================================================================
.NH 2
Measurement of the execution time
.LP
The programs must measure the wall time of the computation phase following a
set of rules. The way in which the wall time is measured is very important to
get accurate results. The measured time must be implemented by using a
monotonic clock which is able to correct the drift of the oscillator of
the internal clock due to changes in temperature. This clock must be
measured in C and C++ with:
.CS
clock_gettime(CLOCK_MONOTONIC, &ts);
.CE
A helper function can be used the approximate value of the clock in a
double precision float, in seconds:
.CS
double get_time()
{
    struct timespec tv;
    if(clock_gettime(CLOCK_MONOTONIC, &tv) != 0)
    {
        perror("clock_gettime failed");
        exit(EXIT_FAILURE);
    }
    return (double)(ts.tv_sec) +
        (double)ts.tv_nsec * 1.0e-9;
}
.CE
The start and end points must be measured after the synchronization of
all the processes and threads, so the complete computation work can be
bounded to fit inside the measured interval. An example for a MPI
program:
.CS
double start, end, delta_time;
MPI_Barrier();
start = get_time();
run_simulation();
MPI_Barrier();
end = get_time();
delta_time = end - start;
.CE
.\" ===================================================================
.NH 2
Format of the execution time
.LP
The measured execution time must be printed to the standard output
(stdout) in scientific notation with at least 7 significative digits.
The following the printf format (or the strict equivalent in other languages)
must be used:
.CS
printf("time %e\\n", delta_time);
.CE
The line must be printed alone and only once: for MPI programs,
only one process shall print the time:
.CS
if(rank == 0) printf("time %e\\n", delta_time);
.CE
Other lines can be printed in the stdout, but without the
.I time
prefix, so that the following pipe can be used to capture the line:
.CS
% ./app | grep "^time"
1.234567e-01
.CE
Ensure that your program follows this convention by testing it with the
above
.I grep
filter; otherwise the results will fail to be parsed when building
the dataset with the execution time.
.\" ===================================================================
.bp
.NH 1
Experimentation
.LP
During the experimentation, a program is studied by running it and
measuring some properties. The experimenter is in charge of the
experiment design, which is typically controlled by a single
.I nix
file placed in the
.CI garlic/exp
subdirectory.
Experiments are formed by several
.I "experimental units"
or simply
.I units .
A unit is the result of each unique configuration of the experiment 
(typically involves the cartesian product of all factors) and
consists of several shell scripts executed sequentially to setup the
.I "execution environment" ,
which finally launch the actual program being analyzed.
The scripts that prepare the environment and the program itself are
called the
.I stages
of the execution and altogether form the
.I "execution pipeline"
or simply the
.I pipeline .
The experimenter must know with very good details all the stages
involved in the pipeline, as they have a large impact on the execution.
.PP
Additionally, the execution time is impacted by the target machine in
which the experiments run. The software used for the benchmark is
carefully configured and tuned for the hardware used in the execution;
in particular, the experiments are designed to run in MareNostrum 4
cluster with the SLURM workload manager and the Omni-Path
interconnection network. In the future we plan to add
support for other clusters in order to execute the experiments in other
machines.
.\"#####################################################################
.NH 2
Isolation
.LP
The benchmark is designed so that both the compilation of every software
package and the execution of the experiment is performed under strict
conditions. We can ensure that two executions of the same experiment are
actually running the same program in the same software environment.
.PP
All the software used by an experiment is included in the
.I "nix store"
which is, by convention, located at the
.CI /nix
directory. Unfortunately, it is common for libraries to try to load
software from other paths like
.CI /usr
or
.CI /lib .
It is also common that configuration files are loaded from
.CW /etc
and from the home directory of the user that runs the experiment.
Additionally, some environment variables are recognized by the libraries
used in the experiment, which change their behavior. As we cannot
control the software and configuration files in those directories, we
couldn't guarantee that the execution behaves as intended.
.PP
In order to avoid this problem, we create a
.I sandbox
where only the files in the nix store are available (with some other
exceptions). Therefore, even if the libraries try to access any path
outside the nix store, they will find that the files are not there
anymore. Additionally, the environment variables are cleared before
entering the environment (with some exceptions as well).
.\"#####################################################################
.NH 2
Execution pipeline
.LP
Several predefined stages form the
.I standard
execution pipeline and are defined in the
.I stdPipeline
array. The standard pipeline prepares the resources and the environment
to run a program (usually in parallel) in the compute nodes. It is
divided in two main parts:
connecting to the target machine to submit a job and executing the job.
Finally, the complete execution pipeline ends by running the actual
program, which is not part of the standard pipeline, as should be
defined differently for each program.
.\"#####################################################################
.NH 3
Job submission
.LP
Some stages are involved in the job submission: the
.I trebuchet
stage connects via
.I ssh
to the target machine and executes the next stage there. Once in the
target machine, the
.I runexp
stage computes the output path to store the experiment results, using
the user in the target machine and changes the working directory there.
In MareNostrum 4 the output path is at
.CI /gpfs/projects/bsc15/garlic/$user/out .
Then the
.I isolate
stage is executed to enter the sandbox and the
.I experiment
stage begins, which creates a directory to store the experiment output,
and launches several
.I unit
stages.
.PP
Each unit executes a
.I sbatch
stage which runs the
.I sbatch(1)
program with a job script that simply calls the next stage. The
sbatch program internally reads the
.CW /etc/slurm/slurm.conf
file from outside the sandbox, so we must explicitly allow this file to
be available, as well as the
.I munge
socket used for authentication by the SLURM daemon. Once the jobs are
submitted to SLURM, the experiment stage ends and the trebuchet finishes
the execution. The jobs will be queued for execution without any other
intervention from the user.
.PP
The rationale behind running sbatch from the sandbox is because the
options provided in environment variables override the options from the
job script. Therefore, we avoid this problem by running sbatch from the
sandbox, where the interfering environment variables are removed. The
sbatch program is also provided in the
.I "nix store" ,
with a version compatible with the SLURM daemon running in the target
machine.
.\"#####################################################################
.NH 3
Job execution
.LP
Once an unit job has been selected for execution, SLURM
allocates the resources (usually several nodes) and then selects one of
the nodes to run the job script: it is not executed in parallel yet.
The job script runs from a child process forked from on of the SLURM
daemon processes, which are outside the sandbox. Therefore, we first run the
.I isolate
stage
to enter the sandbox again.
.PP
The next stage is called
.I control
and determines if enough data has been generated by the experiment unit
or if it should continue repeating the execution. At the current time,
it is only implemented as a simple loop that runs the next stage a fixed
amount of times (by default, it is repeated 30 times).
.PP
The following stage is
.I srun
which launches several copies of the next stage to run in
parallel (when using more than one task). Runs one copy per task,
effectively creating one process per task. The CPUs affinity is
configured by the parameter
.I --cpu-bind
and is important to set it correctly (see more details in the
.I srun(1)
manual). Appending the
.I verbose
value to the cpu bind option causes srun to print the assigned affinity
of each task, which is very valuable when examining the execution log.
.PP
The mechanism by which srun executes multiple processes is the same used
by sbatch, it forks from a SLURM daemon running in the computing nodes.
Therefore, the execution begins outside the sandbox. The next stage is
.I isolate
which enters again the sandbox in every task. All remaining stages are
running now in parallel.
.\" ###################################################################
.NH 3
The program
.LP
At this point in the execution, the standard pipeline has been
completely executed, and we are ready to run the actual program that is
the matter of the experiment. Usually, programs require some arguments
to be passed in the command line. The
.I exec
stage sets the arguments (and optionally some environment variables) and
executes the last stage, the
.I program .
.PP
The experimenters are required to define these last stages, as they
define the specific way in which the program must be executed.
Additional stages may be included before or after the program run, so
they can perform additional steps.
.\" ###################################################################
.NH 3
Stage overview
.LP
The complete execution pipeline using the standard pipeline is shown in
the Table 1. Some properties are also reflected about the execution
stages.
.DS L
.TS
center;
lB cB cB cB cB cB
l  c  c  c  c  c.
_
Stage     	Where	Safe	Copies	User	Std
_
trebuchet	*	no	no	yes	yes
runexp  	login	no	no	no	yes
isolate 	login	no	no	no	yes
experiment	login	yes	no	no	yes
unit    	login	yes	no	no	yes
sbatch  	login	yes	no	no	yes
_
isolate 	target	no	no	no	yes
control 	target	yes	no	no	yes
srun    	target	yes	no	no	yes
isolate    	target	no	yes	no	yes
_
exec    	target	yes	yes	no	no
program    	target	yes	yes	no	no
_
.TE
.DE
.QS
.SM
.B "Table 1" :
The stages of a complete execution pipeline. The
.I where
column determines where the stage is running,
.I safe
states if the stage begins the execution inside the sandbox,
.I user
if it can be executed directly by the user,
.I copies
if there are several instances running in parallel and
.I std
if is part of the standard execution pipeline.
.QE
.\" ###################################################################
.NH 2
Writing the experiment
.LP
The experiments are generally written in the
.I nix
language as it provides very easy management for the packages an their
customization. An experiment file is formed by several parts, which
produce the execution pipeline when built. The experiment file describes
a function (which is typical in nix) and takes as argument an
attribute set with some common packages, tools and options:
.CS
{ stdenv, bsc, stdexp, targetMachine, stages, garlicTools }:
.CE
The
.I bsc
attribute contains all the BSC and nixpkgs packages, as defined in the
overlay. The
.I stdexp
contains some useful tools and functions to build the experiments, like
the standard execution pipeline, so you don't need to redefine the
stages in every experiment. The configuration of the target machine is
specified in the
.I targetMachine
attribute which includes information like the number of CPUs per node or
the cache line length. It is used to define the experiments in such a
way that they are not tailored to an specific machine hardware
(sometimes this is not posible). All the execution stages are available
in the
.I stages
attribute which are used when some extra stage is required. And finally,
the
.I garlicTools
attribute provide some functions to aid common tasks when defining the
experiment configuration
.\" ###################################################################
.NH 3
Experiment configuration
.LP
The next step is to define some variables in a
.CI let
\&...
.CI in
\&...
.CI ;
construct, to be used later. The first one, is the variable
configuration of the experiment called
.I varConf ,
which include all
the factors that will be changed. All the attributes of this set
.I must
be arrays, even if they only contain one element:
.CS
varConf = {
  blocks = [ 1 2 4 ];
  nodes = [ 1 ];
};
.CE
In this example, the variable
.I blocks
will be set to the values 1, 2 and 4; while
.I nodes
will remain set to 1 always. These variables are used later to build the
experiment configuration. The
.I varConf
is later converted to a list of attribute sets, where every attribute
contains only one value, covering all the combinations (the Cartesian
product is computed):
.CS
[ { blocks = 1; nodes = 1; }
  { blocks = 2; nodes = 1; }
  { blocks = 4; nodes = 1; } ]
.CE
These configurations are then passed to the
.I genConf
function one at a time, which is the central part of the description of
the experiment:
.CS
genConf = var: fix (self: targetMachine.config // {
  expName = "example";
  unitName = self.expName + "-b" + toString self.blocks;
  blocks = var.blocks;
  cpusPerTask = 1;
  tasksPerNode = self.hw.socketsPerNode;
  nodes = var.nodes;
});
.CE
It takes as input
.I one
configuration from the Cartesian product, for example:
.CS
{ blocks = 2; nodes = 1; }
.CE
And returns the complete configuration for that input, which usually
expand the input configuration with some derived variables along with
other constant parameters. The return value can be inspected by calling
the function in the interactive
.I "nix repl"
session:
.CS
nix-repl> genConf { blocks = 2; nodes = 1; }
{
  blocks = 2;
  cpusPerTask = 1;
  expName = "example";
  hw = { ... };
  march = "skylake-avx512";
  mtune = "skylake-avx512";
  name = "mn4";
  nixPrefix = "/gpfs/projects/bsc15/nix";
  nodes = 1;
  sshHost = "mn1";
  tasksPerNode = 2;
  unitName = "example-b2";
}
.CE
Some configuration parameters were added by
.I targetMachine.config ,
such as the
.I nixPrefix ,
.I sshHost
or the
.I hw
attribute set, which are specific for the cluster they experiment is
going to run. Also, the
.I unitName
got assigned the proper name based on the number of blocks, but the
number of tasks per node were assigned based on the hardware description
of the target machine.
.PP
By following this rule, the experiments can easily be ported to machines
with other hardware characteristics, and we only need to define the
hardware details once. Then all the experiments will be updated based on
those details.
.\" ###################################################################
.NH 3
Adding the stages
.LP
Once the configuration is ready, it will be passed to each stage of the
execution pipeline which will take the parameters it needs. The
connection between the parameters and how they are passed to each stage
is done either by convention or manually. There is a list of parameters that
are recognized by the standard pipeline stages. For example the
attribute
.I nodes ,
it is recognized as the number of nodes in the standard
.I sbatch
stage when allocating resources:
.DS L
.TS
center;
lB lB cB cB lB
l  l  c  c  l.
_
Stage	Attribute     	Std	Req	Description
_
*	nixPrefix	yes	yes	Path to the nix store in the target
unit	expName     	yes	yes	Name of the experiment
unit	unitName     	yes	yes	Name of the unit
control	loops   	yes	yes	Number of runs of each unit
sbatch	cpusPerTask 	yes	yes	Number of CPUs per task (process)
sbatch	jobName   	yes	yes	Name of the job
sbatch	nodes   	yes	yes	Number of nodes allocated
sbatch	ntasksPerNode	yes	yes	Number of tasks (processes) per node
sbatch	qos     	yes	no	Name of the QoS queue
sbatch	reservation	yes	no	Name of the reservation
sbatch	time    	yes	no	Maximum allocated time (string)
_
exec	argv    	no	no	Array of arguments to execve
exec	env     	no	no	Environment variable settings
exec	pre     	no	no	Code before the execution
exec	post     	no	no	Code after the execution
_
.TE
.DE
.QS
.SM
.B "Table 2" :
The attributes recognized by the stages in the execution pipeline. The
column
.I std
indicates if they are part of the standard execution pipeline. Some
attributes are required as indicated by the
.I req
column.
.QE
.LP
Other attribute names can be used to specify custom information used in
additional stages. The two most common stages required to complete the
pipeline are the
.I exec
and the
.I program .
Let see an example of
.I exec :
.CS
exec = {nextStage, conf, ...}: stages.exec {
  inherit nextStage;
  argv = [ "--blocks" conf.blocks ];
};
.CE
The
.I exec
stage is defined as a function that uses the predefined
.I stages.exec
stage, which accepts the
.I argv
array, and sets the argv of the program. In our case, we fill the
.I argv
array by setting the
.I --blocks
parameter to the number of blocks, specified in the configuration in the
attribute
.I blocks .
The name of this attribute can be freely choosen, as long as the
.I exec
stage refers to it properly. The
.I nextStage
attribute is mandatory in all stages, and is automatically set when
building the pipeline.
.PP
The last step is to configure the actual program to be executed,
which can be specified as another stage:
.CS
program = {nextStage, conf, ...}: bsc.apps.example;
.CE
Notice that this function only returns the
.I bsc.apps.example
derivation, which will be translated to the path where the example
program is installed. If the program is located inside a directory
(typically
.I bin ),
it must define the attribute
.I programPath
in the
.I bsc.apps.example
derivation, which points to the executable program. An example:
.CS
stdenv.mkDerivation {
\&  ...
  programPath = "/bin/example";
\&  ...
};
.CE
.\" ###################################################################
.NH 3
Building the pipeline
.LP
With the
.I exec
and
.I program
stages defined and the ones provided by the standard pipeline, the
complete execution pipeline can be formed. To do so, the stages are
placed in an array, in the order they will be executed:
.CS
pipeline = stdexp.stdPipeline ++ [ exec program ];
.CE
The attribute
.I stdexp.stdPipeline
contains the standard pipeline stages, and we only append our two
defined stages
.I exec
and
.I program .
The
.I pipeline
is an array of functions, and must be transformed in something that can
be executed in the target machine. For that purpose, the
.I stdexp
provides the
.I genExperiment
function, which takes the
.I pipeline
array and the list of configurations and builds the execution pipeline:
.CS
stdexp.genExperiment { inherit configs pipeline; }
.CE
The complete example experiment can be shown here:
.CS
{ stdenv, stdexp, bsc, targetMachine, stages }:
with stdenv.lib;
let
  # Initial variable configuration
  varConf = {
    blocks = [ 1 2 4 ];
    nodes = [ 1 ];
  };
  # Generate the complete configuration for each unit
  genConf = c: targetMachine.config // rec {
    expName = "example";
    unitName = "${expName}-b${toString blocks}";
    inherit (targetMachine.config) hw;
    inherit (c) blocks nodes;
    loops = 30;
    ntasksPerNode = hw.socketPerNode;
    cpusPerTask = hw.cpusPerSocket;
    jobName = unitName;
  };
  # Compute the array of configurations
  configs = stdexp.buildConfigs {
    inherit varConf genConf;
  };
  exec = {nextStage, conf, ...}: stages.exec {
    inherit nextStage;
    argv = [ "--blocks" conf.blocks ];
  };
  program = {nextStage, conf, ...}: bsc.garlic.apps.example;
  pipeline = stdexp.stdPipeline ++ [ exec program ];
in
  stdexp.genExperiment { inherit configs pipeline; }
.CE
.\" ###################################################################
.NH 3
Adding the experiment to the index
.LP
The experiment file must be located in a named directory inside the
.I garlic/exp
directory. The name is usually the program name. Once the experiment is
placed in a nix file, it must be added to the index of experiments, so
it can be build. The index is hyerarchically organized as attribute
sets, with
.I exp
containing all the experiments;
.I exp.example
the experiments of the
.I example
program; and
.I exp.example.test1
referring to the
.I test1
experiment of the
.I example
program. Additional attributes can be added, like
.I exp.example.test1.variantA
to handle more details.
.PP
For this example we are going to use the attribute path
.I exp.example.test
and add it to the index, in the
.I garlic/exp/index.nix
file. We append to the end of the attribute set, the following
definition:
.CS
\&...
  example = {
    test = callPackage ./example/test.nix { };
  };
}
.CE
The experiment can now be built with:
.CS
builder% nix-build -A exp.example.test
.CE
.\" ###################################################################
.NH 2
Recommendations
.PP
The complete results generally take a long time to be finished, so it is
advisable to design the experiments iteratively, in order to quickly
obtain some feedback. Some recommendations:
.BL
.LI
Start with one unit only.
.LI
Set the number of runs low (say 5) but more than one.
.LI
Use a small problem size, so the execution time is low.
.LI
Set the time limit low, so deadlocks are caught early.
.LE
.PP
As soon as the first runs are complete, examine the results and test
that everything looks good. You would likely want to check:
.BL
.LI
The resources where assigned as intended (nodes and CPU affinity).
.LI
No errors or warnings: look at stderr and stdout logs.
.LI
If a deadlock happens, it will run out of the time limit.
.LE
.PP
As you gain confidence over that the execution went as planned, begin
increasing the problem size, the number of runs, the time limit and
lastly the number of units. The rationale is that each unit that is
shared among experiments gets assigned the same hash. Therefore, you can
iteratively add more units to an experiment, and if they are already
executed (and the results were generated) is reused.
.\" ###################################################################
.bp
.NH 1
Post-processing
.LP
After the correct execution of an experiment the results are stored for
further investigation. Typically the time of the execution or other
quantities are measured and presented later in a figure (generally a
plot or a table). The
.I "postprocess pipeline"
consists of all the steps required to create a set of figures from the
results. Similarly to the execution pipeline where several stages run
sequentially,
.[
garlic execution
.]
the postprocess pipeline is also formed by multiple stages executed
in order.
.PP
The rationale behind dividing execution and postprocess is
that usually the experiments are costly to run (they take a long time to
complete) while generating a figure require less time. Refining the
figures multiple times reusing the same experimental results doesn't
require the execution of the complete experiment, so the experimenter
can try multiple ways to present the data without waiting a large delay.
.NH 2
Results
.LP
The results are generated in the same
.I "target"
machine where the experiment is executed and are stored in the garlic
\fCout\fP
directory, organized into a tree structure following the experiment
name, the unit name and the run number (governed by the
.I control
stage):
.DS L
\fC
|-- 6lp88vlj7m8hvvhpfz25p5mvvg7ycflb-experiment
|   |-- 8lpmmfix52a8v7kfzkzih655awchl9f1-unit 
|   |   |-- 1 
|   |   |   |-- stderr.log
|   |   |   |-- stdout.log
|   |   |   |-- ...
|   |   |-- 2 
\&...
\fP
.DE
In order to provide an easier access to the results, an index is also
created by taking the
.I expName
and
.I unitName
attributes (defined in the experiment configuration) and linking them to
the appropriate experiment and unit directories. These links are
overwritten by the last experiment with the same names so they are only
valid for the last execution. The out and index directories are
placed into a per-user directory, as we cannot guarantee the complete
execution of each unit when multiple users share units.
.PP
The messages printed to 
.I stdout
and
.I stderr
are stored in the log files with the same name inside each run
directory. Additional data is sometimes generated by the experiments,
and is found in each run directory. As the generated data can be very
large, is ignored by default when fetching the results.
.NH 2
Fetching the results
.LP
Consider a program of interest for which an experiment has been designed to
measure some properties that the experimenter wants to present in a
visual plot. When the experiment is launched, the execution
pipeline (EP) is completely executed and it will generate some
results. In this escenario, the execution pipeline depends on the
program\[em]any changes in the program will cause nix to build the
pipeline again
using the updated program. The results will also depend on the
execution pipeline as well as the postprocess pipeline (PP) and the plot
on the results. This chain of dependencies can be shown in the
following dependency graph:
.PS
circlerad=0.22;
linewid=0.3;
right
circle "Prog"
arrow
circle "EP"
arrow
circle "Result"
arrow
circle "PP"
arrow
circle "Plot"
.PE
Ideally, the dependencies should be handled by nix, so it can detect any
change and rebuild the necessary parts automatically. Unfortunately, nix
is not able to build the result as a derivation directly, as it requires
access to the
.I "target"
machine with several user accounts. In order to let several users reuse
the same results from a shared cache, we would like to use the
.I "nix store" .
.PP
To generate the results from the
experiment, we add some extra steps that must be executed manually:
.PS
circle "Prog"
arrow
diag=linewid + circlerad;
far=circlerad*3 + linewid*4
E: circle "EP"
R: circle "Result" at E + (far,0)
RUN: circle "Run" at E + (diag,-diag) dashed
FETCH: circle "Fetch" at R + (-diag,-diag) dashed
move to R.e
arrow
P: circle "PP"
arrow
circle "Plot"
arrow dashed from E to RUN chop
arrow dashed from RUN to FETCH chop
arrow dashed from FETCH to R chop
arrow from E to R chop
.PE
The run and fetch steps are provided by the helper tool
.I "garlic(1)" ,
which launches the experiment using the user credentials at the
.I "target"
machine and then fetches the results, placing them in a directory known
by nix.  When the result derivation needs to be built, nix will look in
this directory for the results of the execution. If the directory is not
found, a message is printed to suggest the user to launch the experiment
and the build process is stopped. When the result is successfully built
by any user, is stored in the
.I "nix store"
and it won't need to be rebuilt again until the experiment changes, as
the hash only depends on the experiment and not on the contents of the
results.
.PP
Notice that this mechanism violates the deterministic nature of the nix
store, as from a given input (the experiment) we can generate different
outputs (each result from different executions). We knowingly relaxed
this restriction by providing a guarantee that the results are
equivalent and there is no need to execute an experiment more than once.
.PP
To force the execution of an experiment you can use the
.I rev
attribute which is a number assigned to each experiment
and can be incremented to create copies that only differs on that
number. The experiment hash will change but the experiment will be the
same, as long as the revision number is ignored along the execution
stages.
.NH 2
Postprocess stages
.LP
Once the results are completely generated in the
.I "target"
machine there are several stages required to build a set of figures:
.PP
.I fetch \[em]
waits until all the experiment units are completed and then executes the
next stage. This stage is performed by the
.I garlic(1)
tool using the
.I -F
option and also reports the current state of the execution.
.PP
.I store \[em]
copies from the
.I target
machine into the nix store all log files generated by the experiment, 
keeping the same directory structure. It tracks the execution state of
each unit and only copies the results once the experiment is complete.
Other files are ignored as they are often very large and not required
for the subsequent stages.
.PP
.I timetable \[em]
converts the results of the experiment into a NDJSON file with one
line per run for each unit. Each line is a valid JSON object, containing
the
.I exp ,
.I unit
and
.I run
keys and the unit configuration (as a JSON object) in the
.I config
key. The execution time is captured from the standard output and is
added in the
.I time
key.
.PP
.I merge \[em]
one or more timetable datasets are joined, by simply concatenating them.
This step allows building one dataset to compare multiple experiments in
the same figure.
.PP
.I rPlot \[em]
one ot more figures are generated by a single R script
.[
r cookbook
.]
which takes as input the previously generated dataset.
The path of the dataset is recorded in the figure as well, which
contains enough information to determine all the stages in the execution
and postprocess pipelines.
.NH 2
Current setup
.LP
As of this moment, the
.I build
machine which contains the nix store is
.I xeon07
and the
.I "target"
machine used to run the experiments is Mare Nostrum 4 with the
.I output
directory placed at
.CW /gpfs/projects/bsc15/garlic .
By default, the experiment results are never deleted from the
.I target
so you may want to remove the ones already stored in the nix store to
free space.
.\" ###################################################################
.bp
.SH 1
Appendix A: Branch name diagram
.LP
.TAG appendixA
.DS B
.SM
.PS 4.4/25.4
copy "gitbranch.pic"
.PE
.DE