From 0cc5fe92e5b8549fef8587d402b5b7b97ac56552 Mon Sep 17 00:00:00 2001 From: Rodrigo Arias Mallo Date: Fri, 28 Aug 2020 20:01:58 +0200 Subject: [PATCH] Add documentation on sources of variability --- NOISE | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) create mode 100644 NOISE diff --git a/NOISE b/NOISE new file mode 100644 index 0000000..913d33a --- /dev/null +++ b/NOISE @@ -0,0 +1,85 @@ + + Known sources of noise + in MareNostrum 4 + + +ABSTRACT + + The experiments run at MareNostrum 4 show that there are several + factors that can affect the execution time. Some may even become the + dominant part of the time, rendering the experiment invalid. + + This document lists all known sources of variability and tries to give + an overview on how to detect and correct the problems. + +1. Notable sources of variability + + Usually all sources were found in the MareNostrum 4 cluster, but they + may apply to other machines. Some may have a detection mechanism so + the effect can be neglected, but others don't. Also, some problems + only occur with low probability. + + Other sources of variability with a low effect, say lower than 1% of + the mean time, are not listed here. + +1.1 The daemon slurmstepd eats sys CPU in a new thread + + For a period of about 10 seconds a thread is created from the + slurmstepd process when a job is running, which uses quite a lot of + CPU. This event happens from time to time with unknown frequency. It + was first observed in the nbody program, as it almost doubles the time + per iteration, as the other processes are waiting for the one with + slow CPU to continue to the next iteration. The SLURM version was + 17.11.7 and the program was executed with sbatch+srun. See the issue + for more details: + + https://pm.bsc.es/gitlab/rarias/bsc-nixpkgs/-/issues/19 + + It can be detected by looking at the cycles per us view with Extrae, + with the PAPI counters enabled. It shows a slowdown in one process + when the problem occurs. Also, perf-sched(1) can be used to trace + context switches to other programs but requires access to the debugfs. + +1.2 MPICH uses ethernet rather than infiniband + + Some MPI implementations (like MPICH) can silently use non-optimal + fabrics like the ethernet rather than infiniband because the are + misconfigured. + + Can be detected by running latency benchmarks like the OSU micro + benchmark, which should report a low latency. It can also be reported + by using strace to ensure which network card is being used. + +1.3 CPU binding + + A thread may switch between CPUs when running, leading to a drop in + performance. To ensure that it remains in the same process it can be + binded with srun(1) or sbatch(1) using the --cpu-bind option, or using + taskset(1). + + It can be detected by running the program with Extrae and using the + General/view/executing_cpu.cfg configuration in Paraver. After + adjusting the scale, all processes must have a different color from + each other (the assigned CPU) and keep it constant. Otherwise changes + of CPUs are happening. + +1.4 Libraries that use dlopen(3) + + Some libraries or programs try to determine which components are + available in a system by looking for specific libraries in the search + path determined at runtime. + + This behavior can cause a program to change the execution time + depending on the environment variables like LD_LIBRARY_PATH. + + It can be detected by setting LD_DEBUG=all (see ld.so(8)) or using + strace(1) when running the program. + +1.5 Intel MPI library selection + + The Intel MPI library has several variants which are loaded at run + time: debug, release, debug_mt and release_mt. Of which the + I_MPI_THREAD_SPLIT controls whether the multithread capabilities are + enabled or not. + +/* vim: set ts=2 sw=2 tw=72 fo=watqc expandtab spell autoindent: */