Known sources of noise in MareNostrum 4 ABSTRACT The experiments run at MareNostrum 4 show that there are several factors that can affect the execution time. Some may even become the dominant part of the time, rendering the experiment invalid. This document lists all known sources of variability and tries to give an overview on how to detect and correct the problems. 1. Notable sources of variability Usually all sources were found in the MareNostrum 4 cluster, but they may apply to other machines. Some may have a detection mechanism so the effect can be neglected, but others don't. Also, some problems only occur with low probability. Other sources of variability with a low effect, say lower than 1% of the mean time, are not listed here. 1.1 The daemon slurmstepd eats sys CPU in a new thread For a period of about 10 seconds a thread is created from the slurmstepd process when a job is running, which uses quite a lot of CPU. This event happens from time to time with unknown frequency. It was first observed in the nbody program, as it almost doubles the time per iteration, as the other processes are waiting for the one with slow CPU to continue to the next iteration. The SLURM version was 17.11.7 and the program was executed with sbatch+srun. See the issue for more details: https://pm.bsc.es/gitlab/rarias/bsc-nixpkgs/-/issues/19 It can be detected by looking at the cycles per us view with Extrae, with the PAPI counters enabled. It shows a slowdown in one process when the problem occurs. Also, perf-sched(1) can be used to trace context switches to other programs but requires access to the debugfs. 1.2 MPICH uses ethernet rather than infiniband Some MPI implementations (like MPICH) can silently use non-optimal fabrics like the ethernet rather than infiniband because the are misconfigured. Can be detected by running latency benchmarks like the OSU micro benchmark, which should report a low latency. It can also be reported by using strace to ensure which network card is being used. 1.3 CPU binding A thread may switch between CPUs when running, leading to a drop in performance. To ensure that it remains in the same process it can be binded with srun(1) or sbatch(1) using the --cpu-bind option, or using taskset(1). It can be detected by running the program with Extrae and using the General/view/executing_cpu.cfg configuration in Paraver. After adjusting the scale, all processes must have a different color from each other (the assigned CPU) and keep it constant. Otherwise changes of CPUs are happening. 1.4 Libraries that use dlopen(3) Some libraries or programs try to determine which components are available in a system by looking for specific libraries in the search path determined at runtime. This behavior can cause a program to change the execution time depending on the environment variables like LD_LIBRARY_PATH. It can be detected by setting LD_DEBUG=all (see ld.so(8)) or using strace(1) when running the program. 1.5 Intel MPI library selection The Intel MPI library has several variants which are loaded at run time: debug, release, debug_mt and release_mt. Of which the I_MPI_THREAD_SPLIT controls whether the multithread capabilities are enabled or not. 1.6 LLVM and OpenMP problem The LLVM OpenMP implementation is installed in libomp.so, however two symbolic links are created for libgomp.so and libiomp5.so. libgomp.so -> libomp.so libiomp5.so -> libomp.so libomp.so So applications compiled with OpenMP by other compilers may end up using the LLVM implementation. This can be observed by setting LD_DEBUG=all of using strace(1) and looking for the libomp.so library being loaded. In bscpkgs the symbolic links have been removed for the clangOmpss2 compiler. 1.7 Nix-shell does not allow isolation Nix-shell is not isolated, the compilation process tries then to use headers and libs from /usr. This can induce compilation errors not happening inside nix-build. Do not use to ensure reproducibility. 1.8 Make doesn't rebuild objects When using local repo as src code, (e.g. developer mode on) a make clean at the preBuild stage is required. Nix sets the same modification date (one second after the Epoch (1970-01-01 at 00:00:01 in UTC timezone) to all the files in the nix store (also those copied from repos). Makefile checks the files modification date in order to call or not the compilation instructions. If any object/binary file exists out of Nix, at the time we build within Nix, they will be copied with the current data and consequently not updated during the Nix compilation process. 1.9 Sbatch silently fails on parsing When submitting a job with a wrong specification in MN4 with SLURM 17.11.9-2, for example this bogus line: #SBATCH --nodes=1 2 It silently fails to parse the options, falling back to the defaults, without any error. We have improved our checking to detect bogus options passed to SLURM, so we prevent this problem from happening. 1.10 The srun program misses signals after MPI_Finalize When a program receives a signal such as SIGSEGV after calling MPI_Finalize, srun at version 17.11.7 doesn't return a error code but exits with 0. This can cause bogus programs to go undetected when only checking the return code of srun. A better approach is to check the exit code with sacct(1) or write the exit code to a file and check it later. /* vim: set ts=2 sw=2 tw=72 fo=watqc expandtab spell autoindent: */