Add documentation for smpi/shared-malloc-hugepage and SMPI_PARTIAL_SHARED_MALLOC.

[simgrid.git] / doc / doxygen / options.doc
diff --git a/doc/doxygen/options.doc b/doc/doxygen/options.doc

index 5fc516d..4c5af4b 100644 (file)
--- a/doc/doxygen/options.doc
+++ b/doc/doxygen/options.doc
@@ -202,9 +202,17 @@ price of a reduced numerical precision.
  \subsection options_concurrency_limit Concurrency limit
  
  The maximum number of variables per resource can be tuned through
-the \b maxmin/concurrency_limit item (default value: -1, meaning no such limitation). 
-Setting a higher value can lift some limitations, such as the number of 
-concurrent processes running on a single host.
+the \b maxmin/concurrency-limit item. The default value is -1, meaning that
+there is no such limitation. You can have as many simultaneous actions per
+resources as you want. If your simulation presents a very high level of
+concurrency, it may help to use e.g. 100 as a value here. It means that at
+most 100 actions can consume a resource at a given time. The extraneous actions
+are queued and wait until the amount of concurrency of the considered resource
+lowers under the given boundary.
+
+Such limitations help both to the simulation speed and simulation accuracy
+on highly constrained scenarios, but the simulation speed suffers of this
+setting on regular (less constrained) scenarios so it is off by default.
  
  \subsection options_model_network Configuring the Network model
  
@@ -374,7 +382,7 @@ For now, this configuration variable can take 2 values:
   * none: Do not apply any kind of reduction (mandatory for now for
     liveness properties)
   * dpor: Apply Dynamic Partial Ordering Reduction. Only valid if you
-   verify local safety properties.
+   verify local safety properties (default value for safety checks).
  
  \subsection options_modelchecking_visited model-check/visited, Cycle detection
  
@@ -842,6 +850,16 @@ to 1, \c smpirun will display this information when the simulation ends. \verbat
  Simulation time: 1e3 seconds.
  \endverbatim
  
+\subsection options_smpi_temps smpi/keep-temps: not cleaning up after simulation
+
+\b Default: 0 (false)
+
+Under some conditions, SMPI generates a lot of temporary files.  They
+usually get cleaned, but you may use this option to not erase these
+files. This is for example useful when debugging or profiling
+executions using the dlopen privatization schema, as missing binary
+files tend to fool the debuggers.
+
  \subsection options_model_smpi_lat_factor smpi/lat-factor: Latency factors
  
  The motivation and syntax for this option is identical to the motivation/syntax
@@ -887,40 +905,56 @@ of counters, the "default" set.
  --cfg=smpi/papi-events:"default:PAPI_L3_LDM:PAPI_L2_LDM"
  \endverbatim
  
-\subsection options_smpi_global smpi/privatize-global-variables: Automatic privatization of global variables
+\subsection options_smpi_privatization smpi/privatization: Automatic privatization of global variables
  
-MPI executables are meant to be executed in separated processes, but SMPI is
+MPI executables are usually meant to be executed in separated processes, but SMPI is
  executed in only one process. Global variables from executables will be placed
-in the same memory zone and shared between processes, causing hard to find bugs.
-To avoid this, several options are possible :
-  - Manual edition of the code, for example to add __thread keyword before data
-  declaration, which allows the resulting code to work with SMPI, but only
-  if the thread factory (see \ref options_virt_factory) is used, as global
-  variables are then placed in the TLS (thread local storage) segment.
-  - Source-to-source transformation, to add a level of indirection
-  to the global variables. SMPI does this for F77 codes compiled with smpiff,
-  and used to provide coccinelle scripts for C codes, which are not functional anymore.
-  - Compilation pass, to have the compiler automatically put the data in
-  an adapted zone.
-  - Runtime automatic switching of the data segments. SMPI stores a copy of
-  each global data segment for each process, and at each context switch replaces
-  the actual data with its copy from the right process. This mechanism uses mmap,
-  and is for now limited to systems supporting this functionnality (all Linux
-  and some BSD should be compatible).
-  Another limitation is that SMPI only accounts for global variables defined in
-  the executable. If the processes use external global variables from dynamic
-  libraries, they won't be switched correctly. To avoid this, using static
-  linking is advised (but not with the simgrid library, to avoid replicating
-  its own global variables).
-
-  To use this runtime automatic switching, the variable \b smpi/privatize-global-variables
-  should be set to yes
+in the same memory zone and shared between processes, causing intricate bugs.
+Several options are possible to avoid this, as described in the main
+<a href="https://hal.inria.fr/hal-01415484">SMPI publication</a>.
+SimGrid provides two ways of automatically privatizing the globals,
+and this option allows to choose between them.
+
+  - <b>no</b> (default): Do not automatically privatize variables.
+  - <b>mmap</b> or <b>yes</b>: Runtime automatic switching of the data segments.\n
+    SMPI stores a copy of each global data segment for each process,
+    and at each context switch replaces the actual data with its copy
+    from the right process. No copy actually occures as this mechanism
+    uses mmap for efficiency. As such, it is for now limited to
+    systems supporting this functionnality (all Linux and most BSD).\n
+    Another limitation is that SMPI only accounts for global variables
+    defined in the executable. If the processes use external global
+    variables from dynamic libraries, they won't be switched
+    correctly. The easiest way to solve this is to statically link
+    against the library with these globals (but you should never
+    statically link against the simgrid library itself).
+  - <b>dlopen</b>: Link multiple times against the binary.\n  
+    SMPI loads several copy of the same binary in memory, resulting in
+    the natural duplication global variables. Since the dynamic linker
+    refuses to link the same file several times, the binary is copied
+    in a temporary file before being dl-loaded (it is erased right
+    after loading).\n
+    Note that this feature is somewhat experimental at time of writing
+    (v3.16) but seems to work.\n
+    This approach greatly speeds up the context switching, down to
+    about 40 CPU cycles with our raw contextes, instead of requesting
+    several syscalls with the \c mmap approach. Another advantage is
+    that it permits to run the SMPI contexts in parallel, which is
+    obviously not possible with the \c mmap approach.\n
+    Further work may be possible to alleviate the memory and disk
+    overconsumption. It seems that we could 
+    <a href="https://lwn.net/Articles/415889/">punch holes</a>
+    in the files before dl-loading them to remove the code and
+    constants, and mmap these area onto a unique copy. This require
+    to understand the ELF layout of the file, but would 
+    reduce the disk- and memory- usage to the bare minimum. In
+    addition, this would reduce the pressure on the CPU caches (in
+    particular on instruction one).
  
  \warning
    This configuration option cannot be set in your platform file. You can only
    pass it as an argument to smpirun.
  
-
  \subsection options_model_smpi_detached Simulating MPI detached send
  
  This threshold specifies the size in bytes under which the send will return
@@ -944,6 +978,18 @@ uses naive version of collective operations). Each collective operation can be m
  The behavior and motivation for this configuration option is identical with \a smpi/test, see
  Section \ref options_model_smpi_test for details.
  
+\subsection options_model_smpi_iprobe_cpu_usage smpi/iprobe-cpu-usage: Reduce speed for iprobe calls
+
+\b Default value: 1 (no change from default behavior)
+
+MPI_Iprobe calls can be heavily used in applications. To account correctly for the energy
+cores spend probing, it is necessary to reduce the load that these calls cause inside
+SimGrid.
+
+For instance, we measured a max power consumption of 220 W for a particular application but 
+only 180 W while this application was probing. Hence, the correct factor that should
+be passed to this option would be 180/220 = 0.81.
+
  \subsection options_model_smpi_init smpi/init: Inject constant times for calls to MPI_Init
  
  \b Default value: 0
@@ -1036,14 +1082,67 @@ Here is an example:
      also disable this behavior for MPI_Iprobe.
  
  
-\subsection options_model_smpi_use_shared_malloc smpi/use-shared-malloc: Factorize malloc()s
+\subsection options_model_smpi_shared_malloc smpi/shared-malloc: Factorize malloc()s
+
+\b Default: global
+
+If your simulation consumes too much memory, you may want to modify
+your code so that the working areas are shared by all MPI ranks. For
+example, in a bloc-cyclic matrix multiplication, you will only
+allocate one set of blocs, and every processes will share them.
+Naturally, this will lead to very wrong results, but this will save a
+lot of memory so this is still desirable for some studies. For more on
+the motivation for that feature, please refer to the 
+<a href="https://simgrid.github.io/SMPI_CourseWare/topic_understanding_performance/matrixmultiplication/">relevant
+section</a> of the SMPI CourseWare (see Activity #2.2 of the pointed
+assignment). In practice, change the call to malloc() and free() into
+SMPI_SHARED_MALLOC() and SMPI_SHARED_FREE().
  
-\b Default: 1
+SMPI provides 2 algorithms for this feature. The first one, called \c
+local, allocates one bloc per call to SMPI_SHARED_MALLOC() in your
+code (each call location gets its own bloc) and this bloc is shared
+amongst all MPI ranks.  This is implemented with the shm_* functions
+to create a new POSIX shared memory object (kept in RAM, in /dev/shm)
+for each shared bloc.
  
-SMPI can use shared memory by calling shm_* functions; this might speed up the simulation.
-This opens or creates a new POSIX shared memory object, kept in RAM, in /dev/shm.
+With the \c global algorithm, each call to SMPI_SHARED_MALLOC()
+returns a new adress, but it only points to a shadow bloc: its memory
+area is mapped on a 1MiB file on disk. If the returned bloc is of size
+N MiB, then the same file is mapped N times to cover the whole bloc. 
+At the end, no matter how many SMPI_SHARED_MALLOC you do, this will
+only consume 1 MiB in memory.
+
+You can disable this behavior and come back to regular mallocs (for
+example for debugging purposes) using \c "no" as a value.
+
+If you want to keep private some parts of the buffer, for instance if these
+parts are used by the application logic and should not be corrupted, you
+can use SMPI_PARTIAL_SHARED_MALLOC(size, offsets, offsets_count).
+
+As an example,
+
+\code{.unparsed}
+    mem = SMPI_PARTIAL_SHARED_MALLOC(500, {27,42 , 100,200}, 2);
+\endcode
+
+will allocate 500 bytes to mem, such that mem[27..41] and mem[100..199]
+are shared and other area remain private.
+
+Then, it can be deallocated by calling SMPI_SHARED_FREE(mem).
+
+When smpi/shared-malloc:global is used, it is possible to optimize even
+further the memory consumption and the simulation time by using huge pages.
+To do so, it is required to mount a hugetlbfs on your system and allocate
+at least one huge page:
+
+\code{.unparsed}
+    mkdir /home/huge
+    sudo mount none /home/huge -t hugetlbfs -o rw,mode=0777
+    sudo sh -c 'echo 1 > /proc/sys/vm/nr_hugepages'
+\endcode
  
-If you want to disable this behavior, set the value to 0.
+Then, you can pass the option --cfg=smpi/shared-malloc-hugepage:/home/huge
+to smpirun.
  
  \subsection options_model_smpi_wtime smpi/wtime: Inject constant times for calls to MPI_Wtime
  
@@ -1143,6 +1242,7 @@ It can be done by using XBT. Go to \ref XBT_log for more details.
  - \c host/model: \ref options_model_select
  
  - \c maxmin/precision: \ref options_model_precision
+- \c maxmin/concurrency-limit: \ref options_concurrency_limit
  
  - \c msg/debug-multiple-use: \ref options_msg_debug_multiple_use
  
@@ -1191,17 +1291,19 @@ It can be done by using XBT. Go to \ref XBT_log for more details.
  - \c smpi/host-speed: \ref options_smpi_bench
  - \c smpi/IB-penalty-factors: \ref options_model_network_coefs
  - \c smpi/iprobe: \ref options_model_smpi_iprobe
+- \c smpi/iprobe-cpu-usage: \ref options_model_smpi_iprobe_cpu_usage
  - \c smpi/init: \ref options_model_smpi_init
+- \c smpi/keep-temps: \ref options_smpi_temps
  - \c smpi/lat-factor: \ref options_model_smpi_lat_factor
  - \c smpi/ois: \ref options_model_smpi_ois
  - \c smpi/or: \ref options_model_smpi_or
  - \c smpi/os: \ref options_model_smpi_os
  - \c smpi/papi-events: \ref options_smpi_papi_events
-- \c smpi/privatize-global-variables: \ref options_smpi_global
+- \c smpi/privatization: \ref options_smpi_privatization
  - \c smpi/send-is-detached-thresh: \ref options_model_smpi_detached
+- \c smpi/shared-malloc: \ref options_model_smpi_shared_malloc
  - \c smpi/simulate-computation: \ref options_smpi_bench
  - \c smpi/test: \ref options_model_smpi_test
-- \c smpi/use-shared-malloc: \ref options_model_smpi_use_shared_malloc
  - \c smpi/wtime: \ref options_model_smpi_wtime
  
  - \c <b>Tracing configuration options can be found in Section \ref tracing_tracing_options</b>.