bscpkgs/garlic/doc/execution.ms

.TL
Garlic: the execution pipeline
.AU
Rodrigo Arias Mallo
.AI
Barcelona Supercomputing Center
.AB
.LP
This document covers the execution of experiments in the Garlic
benchmark, which are performed under strict conditions. The several
stages of the execution are documented so the experimenter can have a
global overview of how the benchmark runs under the hood.
The results of the experiment are stored in a known path to be used in
posterior processing steps.
.AE
.\"#####################################################################
.nr GROWPS 3
.nr PSINCR 1.5p
.\".nr PD 0.5m
.nr PI 2m
\".2C
.\"#####################################################################
.NH 1
Introduction
.LP
Every experiment in the Garlic
benchmark is controlled by a single
.I nix
file placed in the
.CW garlic/exp
subdirectory.
Experiments are formed by several
.I "experimental units"
or simply
.I units .
A unit is the result of each unique configuration of the experiment 
(typically involves the cartesian product of all factors) and
consists of several shell scripts executed sequentially to setup the
.I "execution environment" ,
which finally launch the actual program being analyzed.
The scripts that prepare the environment and the program itself are
called the
.I stages
of the execution and altogether form the
.I "execution pipeline"
or simply the
.I pipeline .
The experimenter must know with very good details all the stages
involved in the pipeline, as they have a large impact on the execution.
.PP
Additionally, the execution time is impacted by the target machine in
which the experiments run. The software used for the benchmark is
carefully configured and tuned for the hardware used in the execution;
in particular, the experiments are designed to run in MareNostrum 4
cluster with the SLURM workload manager and the Omni-Path
interconnection network. In the future we plan to add
support for other clusters in order to execute the experiments in other
machines.
.\"#####################################################################
.NH 1
Isolation
.LP
The benchmark is designed so that both the compilation of every software
package and the execution of the experiment is performed under strict
conditions. We can ensure that two executions of the same experiment are
actually running the same program in the same software environment.
.PP
All the software used by an experiment is included in the
.I "nix store"
which is, by convention, located at the
.CW /nix
directory. Unfortunately, it is common for libraries to try to load
software from other paths like
.CW /usr
or
.CW /lib .
It is also common that configuration files are loaded from
.CW /etc
and from the home directory of the user that runs the experiment.
Additionally, some environment variables are recognized by the libraries
used in the experiment, which change their behavior. As we cannot
control the software and configuration files in those directories, we
couldn't guarantee that the execution behaves as intended.
.PP
In order to avoid this problem, we create a
.I sandbox
where only the files in the nix store are available (with some other
exceptions). Therefore, even if the libraries try to access any path
outside the nix store, they will find that the files are not there
anymore. Additionally, the environment variables are cleared before
entering the environment (with some exceptions as well).
.\"#####################################################################
.NH 1
Execution pipeline
.LP
Several predefined stages form the
.I standard
execution pipeline and are defined in the
.I stdPipeline
array. The standard pipeline prepares the resources and the environment
to run a program (usually in parallel) in the compute nodes. It is
divided in two main parts:
connecting to the target machine to submit a job and executing the job.
Finally, the complete execution pipeline ends by running the actual
program, which is not part of the standard pipeline, as should be
defined differently for each program.
.NH 2
Job submission
.LP
Some stages are involved in the job submission: the
.I trebuchet
stage connects via
.I ssh
to the target machine and executes the next stage there. Once in the
target machine, the
.I runexp
stage computes the output path to store the experiment results, using
the user in the target machine and changes the working directory there.
In MareNostrum 4 the output path is at
.CW /gpfs/projects/bsc15/garlic/$user/out .
Then the
.I isolate
stage is executed to enter the sandbox and the
.I experiment
stage begins, which creates a directory to store the experiment output,
and launches several
.I unit
stages.
.PP
Each unit executes a
.I sbatch
stage which runs the
.I sbatch(1)
program with a job script that simply calls the next stage. The
sbatch program internally reads the
.CW /etc/slurm/slurm.conf
file from outside the sandbox, so we must explicitly allow this file to
be available, as well as the
.I munge
socket used for authentication by the SLURM daemon. Once the jobs are
submitted to SLURM, the experiment stage ends and the trebuchet finishes
the execution. The jobs will be queued for execution without any other
intervention from the user.
.PP
The rationale behind running sbatch from the sandbox is because the
options provided in environment variables override the options from the
job script. Therefore, we avoid this problem by running sbatch from the
sandbox, where the interfering environment variables are removed. The
sbatch program is also provided in the
.I "nix store" ,
with a version compatible with the SLURM daemon running in the target
machine.
.NH 2
Job execution
.LP
Once an unit job has been selected for execution, SLURM
allocates the resources (usually several nodes) and then selects one of
the nodes to run the job script: it is not executed in parallel yet.
The job script runs from a child process forked from on of the SLURM
daemon processes, which are outside the sandbox. Therefore, we first run the
.I isolate
stage
to enter the sandbox again.
.PP
The next stage is called
.I control
and determines if enough data has been generated by the experiment unit
or if it should continue repeating the execution. At the current time,
it is only implemented as a simple loop that runs the next stage a fixed
amount of times (by default, it is repeated 30 times).
.PP
The following stage is
.I srun
which launches several copies of the next stage to run in
parallel (when using more than one task). Runs one copy per task,
effectively creating one process per task. The CPUs affinity is
configured by the parameter
.I --cpu-bind
and is important to set it correctly (see more details in the
.I srun(1)
manual). Appending the
.I verbose
value to the cpu bind option causes srun to print the assigned affinity
of each task, which is very valuable when examining the execution log.
.PP
The mechanism by which srun executes multiple processes is the same used
by sbatch, it forks from a SLURM daemon running in the computing nodes.
Therefore, the execution begins outside the sandbox. The next stage is
.I isolate
which enters again the sandbox in every task. All remaining stages are
running now in parallel.
.\" ###################################################################
.NH 2
The program
.LP
At this point in the execution, the standard pipeline has been
completely executed, and we are ready to run the actual program that is
the matter of the experiment. Usually, programs require some arguments
to be passed in the command line. The
.I exec
stage sets the arguments (and optionally some environment variables) and
executes the last stage, the
.I program .
.PP
The experimenters are required to define these last stages, as they
define the specific way in which the program must be executed.
Additional stages may be included before or after the program run, so
they can perform additional steps.
.\" ###################################################################
.NH 2
Stage overview
.LP
The complete execution pipeline using the standard pipeline is shown in
the Table 1. Some properties are also reflected about the execution
stages.
.KF
.TS
center;
lB cB cB cB cB cB
l  c  c  c  c  c.
_
Stage     	Target	Safe	Copies	User	Std
_
trebuchet	xeon	no	no	yes	yes
runexp  	login	no	no	yes	yes
isolate 	login	no	no	no	yes
experiment	login	yes	no	no	yes
unit    	login	yes	no	no	yes
sbatch  	login	yes	no	no	yes
_
isolate 	comp	no	no	no	yes
control 	comp	yes	no	no	yes
srun    	comp	yes	no	no	yes
isolate    	comp	no	yes	no	yes
_
exec    	comp	yes	yes	no	no
program    	comp	yes	yes	no	no
_
.TE
.QS
.B "Table 1" :
The stages of a complete execution pipeline. The
.B target
column determines where the stage is running,
.B safe
states if the stage begins the execution inside the sandbox,
.B user
if it can be executed directly by the user,
.B copies
if there are several instances running in parallel and
.B std
if is part of the standard execution pipeline.
.QE
.KE
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.TL`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`Garlic: the execution pipeline`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.AU`
			`Rodrigo Arias Mallo`
			`.AI`
			`Barcelona Supercomputing Center`
			`.AB`
			`.LP`
			`This document covers the execution of experiments in the Garlic`
			`benchmark, which are performed under strict conditions. The several`
			`stages of the execution are documented so the experimenter can have a`
			`global overview of how the benchmark runs under the hood.`
Add runexp stage documentation 2020-10-13 20:07:34 +08:00			`The results of the experiment are stored in a known path to be used in`
			`posterior processing steps.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.AE`
			`.\"#####################################################################`
			`.nr GROWPS 3`
			`.nr PSINCR 1.5p`
			`.\".nr PD 0.5m`
			`.nr PI 2m`
			`\".2C`
			`.\"#####################################################################`
			`.NH 1`
			`Introduction`
			`.LP`
			`Every experiment in the Garlic`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`benchmark is controlled by a single`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I nix`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`file placed in the`
			`.CW garlic/exp`
			`subdirectory.`
			`Experiments are formed by several`
			`.I "experimental units"`
			`or simply`
			`.I units .`
			`A unit is the result of each unique configuration of the experiment`
			`(typically involves the cartesian product of all factors) and`
			`consists of several shell scripts executed sequentially to setup the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I "execution environment" ,`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`which finally launch the actual program being analyzed.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`The scripts that prepare the environment and the program itself are`
			`called the`
			`.I stages`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`of the execution and altogether form the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I "execution pipeline"`
			`or simply the`
			`.I pipeline .`
			`The experimenter must know with very good details all the stages`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`involved in the pipeline, as they have a large impact on the execution.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.PP`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`Additionally, the execution time is impacted by the target machine in`
			`which the experiments run. The software used for the benchmark is`
			`carefully configured and tuned for the hardware used in the execution;`
			`in particular, the experiments are designed to run in MareNostrum 4`
			`cluster with the SLURM workload manager and the Omni-Path`
			`interconnection network. In the future we plan to add`
			`support for other clusters in order to execute the experiments in other`
			`machines.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.\"#####################################################################`
			`.NH 1`
			`Isolation`
			`.LP`
			`The benchmark is designed so that both the compilation of every software`
			`package and the execution of the experiment is performed under strict`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`conditions. We can ensure that two executions of the same experiment are`
			`actually running the same program in the same software environment.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.PP`
			`All the software used by an experiment is included in the`
			`.I "nix store"`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`which is, by convention, located at the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.CW /nix`
			`directory. Unfortunately, it is common for libraries to try to load`
			`software from other paths like`
			`.CW /usr`
			`or`
			`.CW /lib .`
			`It is also common that configuration files are loaded from`
			`.CW /etc`
			`and from the home directory of the user that runs the experiment.`
			`Additionally, some environment variables are recognized by the libraries`
			`used in the experiment, which change their behavior. As we cannot`
			`control the software and configuration files in those directories, we`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`couldn't guarantee that the execution behaves as intended.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.PP`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`In order to avoid this problem, we create a`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I sandbox`
			`where only the files in the nix store are available (with some other`
			`exceptions). Therefore, even if the libraries try to access any path`
			`outside the nix store, they will find that the files are not there`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`anymore. Additionally, the environment variables are cleared before`
			`entering the environment (with some exceptions as well).`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.\"#####################################################################`
			`.NH 1`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`Execution pipeline`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.LP`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`Several predefined stages form the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I standard`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`execution pipeline and are defined in the`
			`.I stdPipeline`
			`array. The standard pipeline prepares the resources and the environment`
			`to run a program (usually in parallel) in the compute nodes. It is`
			`divided in two main parts:`
			`connecting to the target machine to submit a job and executing the job.`
			`Finally, the complete execution pipeline ends by running the actual`
			`program, which is not part of the standard pipeline, as should be`
			`defined differently for each program.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.NH 2`
			`Job submission`
			`.LP`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`Some stages are involved in the job submission: the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I trebuchet`
			`stage connects via`
			`.I ssh`
			`to the target machine and executes the next stage there. Once in the`
			`target machine, the`
Add runexp stage documentation 2020-10-13 20:07:34 +08:00			`.I runexp`
			`stage computes the output path to store the experiment results, using`
Fix execution out path 2020-11-06 19:31:39 +08:00			`the user in the target machine and changes the working directory there.`
			`In MareNostrum 4 the output path is at`
			`.CW /gpfs/projects/bsc15/garlic/$user/out .`
Add runexp stage documentation 2020-10-13 20:07:34 +08:00			`Then the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I isolate`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`stage is executed to enter the sandbox and the`
			`.I experiment`
Add runexp stage documentation 2020-10-13 20:07:34 +08:00			`stage begins, which creates a directory to store the experiment output,`
			`and launches several`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`.I unit`
			`stages.`
			`.PP`
			`Each unit executes a`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I sbatch`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`stage which runs the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I sbatch(1)`
Add runexp stage documentation 2020-10-13 20:07:34 +08:00			`program with a job script that simply calls the next stage. The`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`sbatch program internally reads the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.CW /etc/slurm/slurm.conf`
			`file from outside the sandbox, so we must explicitly allow this file to`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`be available, as well as the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I munge`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`socket used for authentication by the SLURM daemon. Once the jobs are`
			`submitted to SLURM, the experiment stage ends and the trebuchet finishes`
			`the execution. The jobs will be queued for execution without any other`
			`intervention from the user.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.PP`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`The rationale behind running sbatch from the sandbox is because the`
			`options provided in environment variables override the options from the`
			`job script. Therefore, we avoid this problem by running sbatch from the`
			`sandbox, where the interfering environment variables are removed. The`
			`sbatch program is also provided in the`
			`.I "nix store" ,`
			`with a version compatible with the SLURM daemon running in the target`
Use target machine notation 2020-11-06 19:31:31 +08:00			`machine.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.NH 2`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`Job execution`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.LP`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`Once an unit job has been selected for execution, SLURM`
			`allocates the resources (usually several nodes) and then selects one of`
			`the nodes to run the job script: it is not executed in parallel yet.`
			`The job script runs from a child process forked from on of the SLURM`
			`daemon processes, which are outside the sandbox. Therefore, we first run the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I isolate`
			`stage`
			`to enter the sandbox again.`
			`.PP`
			`The next stage is called`
			`.I control`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`and determines if enough data has been generated by the experiment unit`
			`or if it should continue repeating the execution. At the current time,`
			`it is only implemented as a simple loop that runs the next stage a fixed`
			`amount of times (by default, it is repeated 30 times).`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.PP`
			`The following stage is`
			`.I srun`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`which launches several copies of the next stage to run in`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`parallel (when using more than one task). Runs one copy per task,`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`effectively creating one process per task. The CPUs affinity is`
			`configured by the parameter`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I --cpu-bind`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`and is important to set it correctly (see more details in the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I srun(1)`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`manual). Appending the`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.I verbose`
			`value to the cpu bind option causes srun to print the assigned affinity`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`of each task, which is very valuable when examining the execution log.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.PP`
			`The mechanism by which srun executes multiple processes is the same used`
			`by sbatch, it forks from a SLURM daemon running in the computing nodes.`
			`Therefore, the execution begins outside the sandbox. The next stage is`
			`.I isolate`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`which enters again the sandbox in every task. All remaining stages are`
			`running now in parallel.`
			`.\" ###################################################################`
			`.NH 2`
			`The program`
			`.LP`
			`At this point in the execution, the standard pipeline has been`
			`completely executed, and we are ready to run the actual program that is`
			`the matter of the experiment. Usually, programs require some arguments`
			`to be passed in the command line. The`
			`.I exec`
			`stage sets the arguments (and optionally some environment variables) and`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`executes the last stage, the`
			`.I program .`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`.PP`
			`The experimenters are required to define these last stages, as they`
			`define the specific way in which the program must be executed.`
			`Additional stages may be included before or after the program run, so`
			`they can perform additional steps.`
			`.\" ###################################################################`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.NH 2`
			`Stage overview`
			`.LP`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`The complete execution pipeline using the standard pipeline is shown in`
			`the Table 1. Some properties are also reflected about the execution`
			`stages.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.KF`
			`.TS`
			`center;`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`lB cB cB cB cB cB`
			`l c c c c c.`
			`_`
			`Stage Target Safe Copies User Std`
			`_`
			`trebuchet xeon no no yes yes`
Add runexp stage documentation 2020-10-13 20:07:34 +08:00			`runexp login no no yes yes`
			`isolate login no no no yes`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`experiment login yes no no yes`
			`unit login yes no no yes`
			`sbatch login yes no no yes`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`_`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`isolate comp no no no yes`
			`control comp yes no no yes`
			`srun comp yes no no yes`
			`isolate comp no yes no yes`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`_`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`exec comp yes yes no no`
			`program comp yes yes no no`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`_`
			`.TE`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`.QS`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.B "Table 1" :`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`The stages of a complete execution pipeline. The`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.B target`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`column determines where the stage is running,`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.B safe`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`states if the stage begins the execution inside the sandbox,`
			`.B user`
			`if it can be executed directly by the user,`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.B copies`
Update execution doc with isolation 2020-10-13 18:13:56 +08:00			`if there are several instances running in parallel and`
			`.B std`
			`if is part of the standard execution pipeline.`
Document the execution pipeline 2020-10-08 00:34:08 +08:00			`.QE`
			`.KE`