bscpkgs/NOISE
2020-09-22 18:01:42 +02:00

104 lines
3.9 KiB
Plaintext

Known sources of noise
in MareNostrum 4
ABSTRACT
The experiments run at MareNostrum 4 show that there are several
factors that can affect the execution time. Some may even become the
dominant part of the time, rendering the experiment invalid.
This document lists all known sources of variability and tries to give
an overview on how to detect and correct the problems.
1. Notable sources of variability
Usually all sources were found in the MareNostrum 4 cluster, but they
may apply to other machines. Some may have a detection mechanism so
the effect can be neglected, but others don't. Also, some problems
only occur with low probability.
Other sources of variability with a low effect, say lower than 1% of
the mean time, are not listed here.
1.1 The daemon slurmstepd eats sys CPU in a new thread
For a period of about 10 seconds a thread is created from the
slurmstepd process when a job is running, which uses quite a lot of
CPU. This event happens from time to time with unknown frequency. It
was first observed in the nbody program, as it almost doubles the time
per iteration, as the other processes are waiting for the one with
slow CPU to continue to the next iteration. The SLURM version was
17.11.7 and the program was executed with sbatch+srun. See the issue
for more details:
https://pm.bsc.es/gitlab/rarias/bsc-nixpkgs/-/issues/19
It can be detected by looking at the cycles per us view with Extrae,
with the PAPI counters enabled. It shows a slowdown in one process
when the problem occurs. Also, perf-sched(1) can be used to trace
context switches to other programs but requires access to the debugfs.
1.2 MPICH uses ethernet rather than infiniband
Some MPI implementations (like MPICH) can silently use non-optimal
fabrics like the ethernet rather than infiniband because the are
misconfigured.
Can be detected by running latency benchmarks like the OSU micro
benchmark, which should report a low latency. It can also be reported
by using strace to ensure which network card is being used.
1.3 CPU binding
A thread may switch between CPUs when running, leading to a drop in
performance. To ensure that it remains in the same process it can be
binded with srun(1) or sbatch(1) using the --cpu-bind option, or using
taskset(1).
It can be detected by running the program with Extrae and using the
General/view/executing_cpu.cfg configuration in Paraver. After
adjusting the scale, all processes must have a different color from
each other (the assigned CPU) and keep it constant. Otherwise changes
of CPUs are happening.
1.4 Libraries that use dlopen(3)
Some libraries or programs try to determine which components are
available in a system by looking for specific libraries in the search
path determined at runtime.
This behavior can cause a program to change the execution time
depending on the environment variables like LD_LIBRARY_PATH.
It can be detected by setting LD_DEBUG=all (see ld.so(8)) or using
strace(1) when running the program.
1.5 Intel MPI library selection
The Intel MPI library has several variants which are loaded at run
time: debug, release, debug_mt and release_mt. Of which the
I_MPI_THREAD_SPLIT controls whether the multithread capabilities are
enabled or not.
1.6 LLVM and OpenMP problem
The LLVM OpenMP implementation is installed in libomp.so, however two
symbolic links are created for libgomp.so and libiomp5.so.
libgomp.so -> libomp.so
libiomp5.so -> libomp.so
libomp.so
So applications compiled with OpenMP by other compilers may end up
using the LLVM implementation. This can be observed by setting
LD_DEBUG=all of using strace(1) and looking for the libomp.so library
being loaded.
In bscpkgs the symbolic links have been removed for the clangOmpss2
compiler.
/* vim: set ts=2 sw=2 tw=72 fo=watqc expandtab spell autoindent: */