From e8d884a627b5d218a579283927497e3eeab9db07 Mon Sep 17 00:00:00 2001
From: Rodrigo Arias Mallo <rodrigo.arias@bsc.es>
Date: Wed, 7 Oct 2020 18:34:08 +0200
Subject: [PATCH] Document the execution pipeline

---
 garlic/doc/Makefile     |   9 ++
 garlic/doc/execution.ms | 203 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 212 insertions(+)
 create mode 100644 garlic/doc/Makefile
 create mode 100644 garlic/doc/execution.ms

diff --git a/garlic/doc/Makefile b/garlic/doc/Makefile
new file mode 100644
index 0000000..f768139
--- /dev/null
+++ b/garlic/doc/Makefile
@@ -0,0 +1,9 @@
+all: execution.pdf execution.txt
+
+%.pdf: %.ms
+	groff -ms -tbl -Tpdf $^ > $@
+	#pdfms $^ 2>&1 >$@  | sed 's/^troff: //g'
+	killall -HUP mupdf
+
+%.txt: %.ms
+	groff -ms -tbl -Tutf8 $^ > $@
diff --git a/garlic/doc/execution.ms b/garlic/doc/execution.ms
new file mode 100644
index 0000000..b38f81c
--- /dev/null
+++ b/garlic/doc/execution.ms
@@ -0,0 +1,203 @@
+.TL
+Garlic execution
+.AU
+Rodrigo Arias Mallo
+.AI
+Barcelona Supercomputing Center
+.AB
+.LP
+This document covers the execution of experiments in the Garlic
+benchmark, which are performed under strict conditions. The several
+stages of the execution are documented so the experimenter can have a
+global overview of how the benchmark runs under the hood.
+During the execution of the experiments, the results are
+stored in a file which will be used in posterior processing steps.
+.AE
+.\"#####################################################################
+.nr GROWPS 3
+.nr PSINCR 1.5p
+.\".nr PD 0.5m
+.nr PI 2m
+\".2C
+.\"#####################################################################
+.NH 1
+Introduction
+.LP
+Every experiment in the Garlic
+benchmark is controled by one
+.I nix
+file.
+An experiment consists of several shell scripts which are executed
+sequentially and perform several tasks to setup the
+.I "execution environment" ,
+which finally launch the actual program that is being analyzed.
+The scripts that prepare the environment and the program itself are
+called the
+.I stages
+of the execution, which altogether form the
+.I "execution pipeline"
+or simply the
+.I pipeline .
+The experimenter must know with very good details all the stages
+involved in the pipeline, as they can affect with great impact the
+result of the execution.
+.PP
+The experiments have a very strong dependency on the cluster where they
+run, as the results will be heavily affected. The software used for the
+benchmark is carefully configured for the hardware used in the
+execution. In particular, the experiments are designed to run in
+MareNostrum 4 cluster with the SLURM workload manager. In the future we
+plan to add support for other clusters, in order to execute the
+experiments in other machines.
+.\"#####################################################################
+.NH 1
+Isolation
+.LP
+The benchmark is designed so that both the compilation of every software
+package and the execution of the experiment is performed under strict
+conditions. Therefore, we can provide a guarantee that two executions
+of the same experiment are actually running the same program in the same
+environment.
+.PP
+All the software used by an experiment is included in the
+.I "nix store"
+which is, by convention, located in the
+.CW /nix
+directory. Unfortunately, it is common for libraries to try to load
+software from other paths like
+.CW /usr
+or
+.CW /lib .
+It is also common that configuration files are loaded from
+.CW /etc
+and from the home directory of the user that runs the experiment.
+Additionally, some environment variables are recognized by the libraries
+used in the experiment, which change their behavior. As we cannot
+control the software and configuration files in those directories, we
+coudn't guarantee that the execution behaves as intended.
+.PP
+In order to avoid this problem, we create a secure
+.I sandbox
+where only the files in the nix store are available (with some other
+exceptions). Therefore, even if the libraries try to access any path
+outside the nix store, they will find that the files are not there
+anymore.
+.\"#####################################################################
+.NH 1
+Execution stages
+.LP
+There are several predefined stages which form the
+.I standard
+execution pipeline. The standard pipeline is divided in two main parts:
+1) connecting to the target machine and submiting a job to SLURM, and 2)
+executing the job itself.
+.NH 2
+Job submission
+.LP
+Three stages are involved in the job submision. The
+.I trebuchet
+stage connects via
+.I ssh
+to the target machine and executes the next stage there. Once in the
+target machine, the
+.I isolate
+stage is executed to enter the sandbox. Finally, the
+.I sbatch
+stage runs the
+.I sbatch(1)
+program with a job script with simply executes the next stage. The
+sbatch program reads the
+.CW /etc/slurm/slurm.conf
+file from outside the sandbox, so we must explicitly allow this file to
+be available as well as the
+.I munge
+socket, used for authentication.
+.PP
+The rationale behind running sbatch from the sandbox is that the options
+provided in enviroment variables override the options from the job
+script. Therefore, we avoid this problem by running sbatch from the
+sandbox, where potentially dangerous environment variables were removed.
+.NH 2
+Seting up the environment
+.LP
+Once the job has been selected for execution, the SLURM daemon allocates
+the resources and then selects one of the nodes to run the job script
+(is not executed in parallel). Additionally, the job script is executed
+from a child process, forked from on of the SLURM processes, which is
+outside the sandbox. Therefore, we first run the
+.I isolate
+stage
+to enter the sandbox again.
+.PP
+The next stage is called
+.I control
+and determines if enough data has been generated by the experiment or if
+it should continue repeating the execution. At the current time, is only
+implemented as a simple loop that runs the next stage a fixed amount of
+times.
+.PP
+The following stage is
+.I srun
+which usually launches several copies of the next stage to run in
+parallel (when using more than one task). Runs one copy per task,
+effectively creating one process per task. The set of CPUs available to
+each process is computed by the parameter
+.I --cpu-bind
+and is crucial to set it correctly; is documented in the
+.I srun(1)
+manual. Apending the
+.I verbose
+value to the cpu bind option causes srun to print the assigned affinity
+of each task so that it can be reviewed in the execution log.
+.PP
+The mechanism by which srun executes multiple processes is the same used
+by sbatch, it forks from a SLURM daemon running in the computing nodes.
+Therefore, the execution begins outside the sandbox. The next stage is
+.I isolate
+which enters again the sandbox in every task (from now on, all stages
+are running in parallel). 
+.PP
+At this point in the execution, we are ready to run the actual program
+that is the matter of the experiment. Usually, the programs require some
+argument options to be passed in the command line. The
+.I argv
+stage sets the arguments and optionally some environment variables and
+executes the last stage, the
+.I program .
+.NH 2
+Stage overview
+.LP
+The standard execution pipeline contains the stages listed in the table
+1, ordered by the execution time. Additional stages can be placed before
+the argv stage, to modify the execution. Usually debugging programs and
+other options can be included there.
+.KF
+.TS
+center;
+lB cB cB cB
+l  c  c  c.
+_
+Stage     	Target	Safe	Copies
+_
+trebuchet	no	no	no
+isolate 	yes	no	no
+sbatch  	yes	yes	no
+isolate 	yes	no	no
+control 	yes	yes	no
+srun    	yes	yes	no
+isolate    	yes	no	yes
+argv    	yes	yes	yes
+program    	yes	yes	yes
+_
+.TE
+.QP
+.B "Table 1" :
+The stages of a standard execution pipeline. The
+.B target
+column determines whether the stage is running in the target cluster;
+.B safe
+states if the stage is running in the sandbox and
+.B copies
+if there are several instances of the stages running in parallel.
+.QE
+.KE