bscpkgs/garlic/doc/execution.ms

.TL
Garlic execution
.AU
Rodrigo Arias Mallo
.AI
Barcelona Supercomputing Center
.AB
.LP
This document covers the execution of experiments in the Garlic
benchmark, which are performed under strict conditions. The several
stages of the execution are documented so the experimenter can have a
global overview of how the benchmark runs under the hood.
During the execution of the experiments, the results are
stored in a file which will be used in posterior processing steps.
.AE
.\"#####################################################################
.nr GROWPS 3
.nr PSINCR 1.5p
.\".nr PD 0.5m
.nr PI 2m
\".2C
.\"#####################################################################
.NH 1
Introduction
.LP
Every experiment in the Garlic
benchmark is controled by one
.I nix
file.
An experiment consists of several shell scripts which are executed
sequentially and perform several tasks to setup the
.I "execution environment" ,
which finally launch the actual program that is being analyzed.
The scripts that prepare the environment and the program itself are
called the
.I stages
of the execution, which altogether form the
.I "execution pipeline"
or simply the
.I pipeline .
The experimenter must know with very good details all the stages
involved in the pipeline, as they can affect with great impact the
result of the execution.
.PP
The experiments have a very strong dependency on the cluster where they
run, as the results will be heavily affected. The software used for the
benchmark is carefully configured for the hardware used in the
execution. In particular, the experiments are designed to run in
MareNostrum 4 cluster with the SLURM workload manager. In the future we
plan to add support for other clusters, in order to execute the
experiments in other machines.
.\"#####################################################################
.NH 1
Isolation
.LP
The benchmark is designed so that both the compilation of every software
package and the execution of the experiment is performed under strict
conditions. Therefore, we can provide a guarantee that two executions
of the same experiment are actually running the same program in the same
environment.
.PP
All the software used by an experiment is included in the
.I "nix store"
which is, by convention, located in the
.CW /nix
directory. Unfortunately, it is common for libraries to try to load
software from other paths like
.CW /usr
or
.CW /lib .
It is also common that configuration files are loaded from
.CW /etc
and from the home directory of the user that runs the experiment.
Additionally, some environment variables are recognized by the libraries
used in the experiment, which change their behavior. As we cannot
control the software and configuration files in those directories, we
coudn't guarantee that the execution behaves as intended.
.PP
In order to avoid this problem, we create a secure
.I sandbox
where only the files in the nix store are available (with some other
exceptions). Therefore, even if the libraries try to access any path
outside the nix store, they will find that the files are not there
anymore.
.\"#####################################################################
.NH 1
Execution stages
.LP
There are several predefined stages which form the
.I standard
execution pipeline. The standard pipeline is divided in two main parts:
1) connecting to the target machine and submiting a job to SLURM, and 2)
executing the job itself.
.NH 2
Job submission
.LP
Three stages are involved in the job submision. The
.I trebuchet
stage connects via
.I ssh
to the target machine and executes the next stage there. Once in the
target machine, the
.I isolate
stage is executed to enter the sandbox. Finally, the
.I sbatch
stage runs the
.I sbatch(1)
program with a job script with simply executes the next stage. The
sbatch program reads the
.CW /etc/slurm/slurm.conf
file from outside the sandbox, so we must explicitly allow this file to
be available as well as the
.I munge
socket, used for authentication.
.PP
The rationale behind running sbatch from the sandbox is that the options
provided in enviroment variables override the options from the job
script. Therefore, we avoid this problem by running sbatch from the
sandbox, where potentially dangerous environment variables were removed.
.NH 2
Seting up the environment
.LP
Once the job has been selected for execution, the SLURM daemon allocates
the resources and then selects one of the nodes to run the job script
(is not executed in parallel). Additionally, the job script is executed
from a child process, forked from on of the SLURM processes, which is
outside the sandbox. Therefore, we first run the
.I isolate
stage
to enter the sandbox again.
.PP
The next stage is called
.I control
and determines if enough data has been generated by the experiment or if
it should continue repeating the execution. At the current time, is only
implemented as a simple loop that runs the next stage a fixed amount of
times.
.PP
The following stage is
.I srun
which usually launches several copies of the next stage to run in
parallel (when using more than one task). Runs one copy per task,
effectively creating one process per task. The set of CPUs available to
each process is computed by the parameter
.I --cpu-bind
and is crucial to set it correctly; is documented in the
.I srun(1)
manual. Apending the
.I verbose
value to the cpu bind option causes srun to print the assigned affinity
of each task so that it can be reviewed in the execution log.
.PP
The mechanism by which srun executes multiple processes is the same used
by sbatch, it forks from a SLURM daemon running in the computing nodes.
Therefore, the execution begins outside the sandbox. The next stage is
.I isolate
which enters again the sandbox in every task (from now on, all stages
are running in parallel). 
.PP
At this point in the execution, we are ready to run the actual program
that is the matter of the experiment. Usually, the programs require some
argument options to be passed in the command line. The
.I argv
stage sets the arguments and optionally some environment variables and
executes the last stage, the
.I program .
.NH 2
Stage overview
.LP
The standard execution pipeline contains the stages listed in the table
1, ordered by the execution time. Additional stages can be placed before
the argv stage, to modify the execution. Usually debugging programs and
other options can be included there.
.KF
.TS
center;
lB cB cB cB
l  c  c  c.
_
Stage     	Target	Safe	Copies
_
trebuchet	no	no	no
isolate 	yes	no	no
sbatch  	yes	yes	no
isolate 	yes	no	no
control 	yes	yes	no
srun    	yes	yes	no
isolate    	yes	no	yes
argv    	yes	yes	yes
program    	yes	yes	yes
_
.TE
.QP
.B "Table 1" :
The stages of a standard execution pipeline. The
.B target
column determines whether the stage is running in the target cluster;
.B safe
states if the stage is running in the sandbox and
.B copies
if there are several instances of the stages running in parallel.
.QE
.KE
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.TL`
			`Garlic execution`
			`.AU`
			`Rodrigo Arias Mallo`
			`.AI`
			`Barcelona Supercomputing Center`
			`.AB`
			`.LP`
			`This document covers the execution of experiments in the Garlic`
			`benchmark, which are performed under strict conditions. The several`
			`stages of the execution are documented so the experimenter can have a`
			`global overview of how the benchmark runs under the hood.`
			`During the execution of the experiments, the results are`
			`stored in a file which will be used in posterior processing steps.`
			`.AE`
			`.\"#####################################################################`
			`.nr GROWPS 3`
			`.nr PSINCR 1.5p`
			`.\".nr PD 0.5m`
			`.nr PI 2m`
			`\".2C`
			`.\"#####################################################################`
			`.NH 1`
			`Introduction`
			`.LP`
			`Every experiment in the Garlic`
			`benchmark is controled by one`
			`.I nix`
			`file.`
			`An experiment consists of several shell scripts which are executed`
			`sequentially and perform several tasks to setup the`
			`.I "execution environment" ,`
			`which finally launch the actual program that is being analyzed.`
			`The scripts that prepare the environment and the program itself are`
			`called the`
			`.I stages`
			`of the execution, which altogether form the`
			`.I "execution pipeline"`
			`or simply the`
			`.I pipeline .`
			`The experimenter must know with very good details all the stages`
			`involved in the pipeline, as they can affect with great impact the`
			`result of the execution.`
			`.PP`
			`The experiments have a very strong dependency on the cluster where they`
			`run, as the results will be heavily affected. The software used for the`
			`benchmark is carefully configured for the hardware used in the`
			`execution. In particular, the experiments are designed to run in`
			`MareNostrum 4 cluster with the SLURM workload manager. In the future we`
			`plan to add support for other clusters, in order to execute the`
			`experiments in other machines.`
			`.\"#####################################################################`
			`.NH 1`
			`Isolation`
			`.LP`
			`The benchmark is designed so that both the compilation of every software`
			`package and the execution of the experiment is performed under strict`
			`conditions. Therefore, we can provide a guarantee that two executions`
			`of the same experiment are actually running the same program in the same`
			`environment.`
			`.PP`
			`All the software used by an experiment is included in the`
			`.I "nix store"`
			`which is, by convention, located in the`
			`.CW /nix`
			`directory. Unfortunately, it is common for libraries to try to load`
			`software from other paths like`
			`.CW /usr`
			`or`
			`.CW /lib .`
			`It is also common that configuration files are loaded from`
			`.CW /etc`
			`and from the home directory of the user that runs the experiment.`
			`Additionally, some environment variables are recognized by the libraries`
			`used in the experiment, which change their behavior. As we cannot`
			`control the software and configuration files in those directories, we`
			`coudn't guarantee that the execution behaves as intended.`
			`.PP`
			`In order to avoid this problem, we create a secure`
			`.I sandbox`
			`where only the files in the nix store are available (with some other`
			`exceptions). Therefore, even if the libraries try to access any path`
			`outside the nix store, they will find that the files are not there`
			`anymore.`
			`.\"#####################################################################`
			`.NH 1`
			`Execution stages`
			`.LP`
			`There are several predefined stages which form the`
			`.I standard`
			`execution pipeline. The standard pipeline is divided in two main parts:`
			`1) connecting to the target machine and submiting a job to SLURM, and 2)`
			`executing the job itself.`
			`.NH 2`
			`Job submission`
			`.LP`
			`Three stages are involved in the job submision. The`
			`.I trebuchet`
			`stage connects via`
			`.I ssh`
			`to the target machine and executes the next stage there. Once in the`
			`target machine, the`
			`.I isolate`
			`stage is executed to enter the sandbox. Finally, the`
			`.I sbatch`
			`stage runs the`
			`.I sbatch(1)`
			`program with a job script with simply executes the next stage. The`
			`sbatch program reads the`
			`.CW /etc/slurm/slurm.conf`
			`file from outside the sandbox, so we must explicitly allow this file to`
			`be available as well as the`
			`.I munge`
			`socket, used for authentication.`
			`.PP`
			`The rationale behind running sbatch from the sandbox is that the options`
			`provided in enviroment variables override the options from the job`
			`script. Therefore, we avoid this problem by running sbatch from the`
			`sandbox, where potentially dangerous environment variables were removed.`
			`.NH 2`
			`Seting up the environment`
			`.LP`
			`Once the job has been selected for execution, the SLURM daemon allocates`
			`the resources and then selects one of the nodes to run the job script`
			`(is not executed in parallel). Additionally, the job script is executed`
			`from a child process, forked from on of the SLURM processes, which is`
			`outside the sandbox. Therefore, we first run the`
			`.I isolate`
			`stage`
			`to enter the sandbox again.`
			`.PP`
			`The next stage is called`
			`.I control`
			`and determines if enough data has been generated by the experiment or if`
			`it should continue repeating the execution. At the current time, is only`
			`implemented as a simple loop that runs the next stage a fixed amount of`
			`times.`
			`.PP`
			`The following stage is`
			`.I srun`
			`which usually launches several copies of the next stage to run in`
			`parallel (when using more than one task). Runs one copy per task,`
			`effectively creating one process per task. The set of CPUs available to`
			`each process is computed by the parameter`
			`.I --cpu-bind`
			`and is crucial to set it correctly; is documented in the`
			`.I srun(1)`
			`manual. Apending the`
			`.I verbose`
			`value to the cpu bind option causes srun to print the assigned affinity`
			`of each task so that it can be reviewed in the execution log.`
			`.PP`
			`The mechanism by which srun executes multiple processes is the same used`
			`by sbatch, it forks from a SLURM daemon running in the computing nodes.`
			`Therefore, the execution begins outside the sandbox. The next stage is`
			`.I isolate`
			`which enters again the sandbox in every task (from now on, all stages`
			`are running in parallel).`
			`.PP`
			`At this point in the execution, we are ready to run the actual program`
			`that is the matter of the experiment. Usually, the programs require some`
			`argument options to be passed in the command line. The`
			`.I argv`
			`stage sets the arguments and optionally some environment variables and`
			`executes the last stage, the`
			`.I program .`
			`.NH 2`
			`Stage overview`
			`.LP`
			`The standard execution pipeline contains the stages listed in the table`
			`1, ordered by the execution time. Additional stages can be placed before`
			`the argv stage, to modify the execution. Usually debugging programs and`
			`other options can be included there.`
			`.KF`
			`.TS`
			`center;`
			`lB cB cB cB`
			`l c c c.`
			`_`
			`Stage Target Safe Copies`
			`_`
			`trebuchet no no no`
			`isolate yes no no`
			`sbatch yes yes no`
			`isolate yes no no`
			`control yes yes no`
			`srun yes yes no`
			`isolate yes no yes`
			`argv yes yes yes`
			`program yes yes yes`
			`_`
			`.TE`
			`.QP`
			`.B "Table 1" :`
			`The stages of a standard execution pipeline. The`
			`.B target`
			`column determines whether the stage is running in the target cluster;`
			`.B safe`
			`states if the stage is running in the sandbox and`
			`.B copies`
			`if there are several instances of the stages running in parallel.`
			`.QE`
			`.KE`