X-Git-Url: http://info.iut-bm.univ-fcomte.fr/pub/gitweb/simgrid.git/blobdiff_plain/a4ab179f11ae2afec1467c4ca9256fc5ba6fa85b..999c6ca0248ff66351ef2ebd0901622384212bc6:/doc/doxygen/options.doc diff --git a/doc/doxygen/options.doc b/doc/doxygen/options.doc index 596c5fa761..3fad32cb85 100644 --- a/doc/doxygen/options.doc +++ b/doc/doxygen/options.doc @@ -201,8 +201,18 @@ price of a reduced numerical precision. \subsection options_concurrency_limit Concurrency limit -The maximum number of variables in a system can be tuned through -the \b maxmin/concurrency_limit item (default value: 100). Setting a higher value can lift some limitations, such as the number of concurrent processes running on a single host. +The maximum number of variables per resource can be tuned through +the \b maxmin/concurrency-limit item. The default value is -1, meaning that +there is no such limitation. You can have as many simultaneous actions per +resources as you want. If your simulation presents a very high level of +concurrency, it may help to use e.g. 100 as a value here. It means that at +most 100 actions can consume a resource at a given time. The extraneous actions +are queued and wait until the amount of concurrency of the considered resource +lowers under the given boundary. + +Such limitations help both to the simulation speed and simulation accuracy +on highly constrained scenarios, but the simulation speed suffers of this +setting on regular (less constrained) scenarios so it is off by default. \subsection options_model_network Configuring the Network model @@ -372,7 +382,7 @@ For now, this configuration variable can take 2 values: * none: Do not apply any kind of reduction (mandatory for now for liveness properties) * dpor: Apply Dynamic Partial Ordering Reduction. Only valid if you - verify local safety properties. + verify local safety properties (default value for safety checks). \subsection options_modelchecking_visited model-check/visited, Cycle detection @@ -840,6 +850,16 @@ to 1, \c smpirun will display this information when the simulation ends. \verbat Simulation time: 1e3 seconds. \endverbatim +\subsection options_smpi_temps smpi/keep-temps: not cleaning up after simulation + +\b Default: 0 (false) + +Under some conditions, SMPI generates a lot of temporary files. They +usually get cleaned, but you may use this option to not erase these +files. This is for example useful when debugging or profiling +executions using the dlopen privatization schema, as missing binary +files tend to fool the debuggers. + \subsection options_model_smpi_lat_factor smpi/lat-factor: Latency factors The motivation and syntax for this option is identical to the motivation/syntax @@ -885,40 +905,56 @@ of counters, the "default" set. --cfg=smpi/papi-events:"default:PAPI_L3_LDM:PAPI_L2_LDM" \endverbatim -\subsection options_smpi_global smpi/privatize-global-variables: Automatic privatization of global variables +\subsection options_smpi_privatization smpi/privatization: Automatic privatization of global variables -MPI executables are meant to be executed in separated processes, but SMPI is +MPI executables are usually meant to be executed in separated processes, but SMPI is executed in only one process. Global variables from executables will be placed -in the same memory zone and shared between processes, causing hard to find bugs. -To avoid this, several options are possible : - - Manual edition of the code, for example to add __thread keyword before data - declaration, which allows the resulting code to work with SMPI, but only - if the thread factory (see \ref options_virt_factory) is used, as global - variables are then placed in the TLS (thread local storage) segment. - - Source-to-source transformation, to add a level of indirection - to the global variables. SMPI does this for F77 codes compiled with smpiff, - and used to provide coccinelle scripts for C codes, which are not functional anymore. - - Compilation pass, to have the compiler automatically put the data in - an adapted zone. - - Runtime automatic switching of the data segments. SMPI stores a copy of - each global data segment for each process, and at each context switch replaces - the actual data with its copy from the right process. This mechanism uses mmap, - and is for now limited to systems supporting this functionnality (all Linux - and some BSD should be compatible). - Another limitation is that SMPI only accounts for global variables defined in - the executable. If the processes use external global variables from dynamic - libraries, they won't be switched correctly. To avoid this, using static - linking is advised (but not with the simgrid library, to avoid replicating - its own global variables). - - To use this runtime automatic switching, the variable \b smpi/privatize-global-variables - should be set to yes +in the same memory zone and shared between processes, causing intricate bugs. +Several options are possible to avoid this, as described in the main +SMPI publication. +SimGrid provides two ways of automatically privatizing the globals, +and this option allows to choose between them. + + - no (default): Do not automatically privatize variables. + - mmap or yes: Runtime automatic switching of the data segments.\n + SMPI stores a copy of each global data segment for each process, + and at each context switch replaces the actual data with its copy + from the right process. No copy actually occures as this mechanism + uses mmap for efficiency. As such, it is for now limited to + systems supporting this functionnality (all Linux and most BSD).\n + Another limitation is that SMPI only accounts for global variables + defined in the executable. If the processes use external global + variables from dynamic libraries, they won't be switched + correctly. The easiest way to solve this is to statically link + against the library with these globals (but you should never + statically link against the simgrid library itself). + - dlopen: Link multiple times against the binary.\n + SMPI loads several copy of the same binary in memory, resulting in + the natural duplication global variables. Since the dynamic linker + refuses to link the same file several times, the binary is copied + in a temporary file before being dl-loaded (it is erased right + after loading).\n + Note that this feature is somewhat experimental at time of writing + (v3.16) but seems to work.\n + This approach greatly speeds up the context switching, down to + about 40 CPU cycles with our raw contextes, instead of requesting + several syscalls with the \c mmap approach. Another advantage is + that it permits to run the SMPI contexts in parallel, which is + obviously not possible with the \c mmap approach.\n + Further work may be possible to alleviate the memory and disk + overconsumption. It seems that we could + punch holes + in the files before dl-loading them to remove the code and + constants, and mmap these area onto a unique copy. This require + to understand the ELF layout of the file, but would + reduce the disk- and memory- usage to the bare minimum. In + addition, this would reduce the pressure on the CPU caches (in + particular on instruction one). \warning This configuration option cannot be set in your platform file. You can only pass it as an argument to smpirun. - \subsection options_model_smpi_detached Simulating MPI detached send This threshold specifies the size in bytes under which the send will return @@ -942,6 +978,18 @@ uses naive version of collective operations). Each collective operation can be m The behavior and motivation for this configuration option is identical with \a smpi/test, see Section \ref options_model_smpi_test for details. +\subsection options_model_smpi_iprobe_cpu_usage smpi/iprobe-cpu-usage: Reduce speed for iprobe calls + +\b Default value: 1 (no change from default behavior) + +MPI_Iprobe calls can be heavily used in applications. To account correctly for the energy +cores spend probing, it is necessary to reduce the load that these calls cause inside +SimGrid. + +For instance, we measured a max power consumption of 220 W for a particular application but +only 180 W while this application was probing. Hence, the correct factor that should +be passed to this option would be 180/220 = 0.81. + \subsection options_model_smpi_init smpi/init: Inject constant times for calls to MPI_Init \b Default value: 0 @@ -1034,14 +1082,69 @@ Here is an example: also disable this behavior for MPI_Iprobe. -\subsection options_model_smpi_use_shared_malloc smpi/use-shared-malloc: Factorize malloc()s +\subsection options_model_smpi_shared_malloc smpi/shared-malloc: Factorize malloc()s + +\b Default: global -\b Default: 1 +If your simulation consumes too much memory, you may want to modify +your code so that the working areas are shared by all MPI ranks. For +example, in a bloc-cyclic matrix multiplication, you will only +allocate one set of blocs, and every processes will share them. +Naturally, this will lead to very wrong results, but this will save a +lot of memory so this is still desirable for some studies. For more on +the motivation for that feature, please refer to the +relevant +section of the SMPI CourseWare (see Activity #2.2 of the pointed +assignment). In practice, change the call to malloc() and free() into +SMPI_SHARED_MALLOC() and SMPI_SHARED_FREE(). -SMPI can use shared memory by calling shm_* functions; this might speed up the simulation. -This opens or creates a new POSIX shared memory object, kept in RAM, in /dev/shm. +SMPI provides 2 algorithms for this feature. The first one, called \c +local, allocates one bloc per call to SMPI_SHARED_MALLOC() in your +code (each call location gets its own bloc) and this bloc is shared +amongst all MPI ranks. This is implemented with the shm_* functions +to create a new POSIX shared memory object (kept in RAM, in /dev/shm) +for each shared bloc. + +With the \c global algorithm, each call to SMPI_SHARED_MALLOC() +returns a new adress, but it only points to a shadow bloc: its memory +area is mapped on a 1MiB file on disk. If the returned bloc is of size +N MiB, then the same file is mapped N times to cover the whole bloc. +At the end, no matter how many SMPI_SHARED_MALLOC you do, this will +only consume 1 MiB in memory. + +You can disable this behavior and come back to regular mallocs (for +example for debugging purposes) using \c "no" as a value. + +If you want to keep private some parts of the buffer, for instance if these +parts are used by the application logic and should not be corrupted, you +can use SMPI_PARTIAL_SHARED_MALLOC(size, offsets, offsets_count). + +As an example, + +\code{.C} + mem = SMPI_PARTIAL_SHARED_MALLOC(500, {27,42 , 100,200}, 2); +\endcode + +will allocate 500 bytes to mem, such that mem[27..41] and mem[100..199] +are shared and other area remain private. + +Then, it can be deallocated by calling SMPI_SHARED_FREE(mem). + +When smpi/shared-malloc:global is used, the memory consumption problem +is solved, but it may induce too much load on the kernel's pages table. +In this case, you should use huge pages so that we create only one +entry per Mb of malloced data instead of one entry per 4k. +To activate this, you must mount a hugetlbfs on your system and allocate +at least one huge page: + +\code{.sh} + mkdir /home/huge + sudo mount none /home/huge -t hugetlbfs -o rw,mode=0777 + sudo sh -c 'echo 1 > /proc/sys/vm/nr_hugepages' # echo more if you need more +\endcode -If you want to disable this behavior, set the value to 0. +Then, you can pass the option --cfg=smpi/shared-malloc-hugepage:/home/huge +to smpirun to actually activate the huge page support in shared mallocs. \subsection options_model_smpi_wtime smpi/wtime: Inject constant times for calls to MPI_Wtime @@ -1141,6 +1244,7 @@ It can be done by using XBT. Go to \ref XBT_log for more details. - \c host/model: \ref options_model_select - \c maxmin/precision: \ref options_model_precision +- \c maxmin/concurrency-limit: \ref options_concurrency_limit - \c msg/debug-multiple-use: \ref options_msg_debug_multiple_use @@ -1189,17 +1293,20 @@ It can be done by using XBT. Go to \ref XBT_log for more details. - \c smpi/host-speed: \ref options_smpi_bench - \c smpi/IB-penalty-factors: \ref options_model_network_coefs - \c smpi/iprobe: \ref options_model_smpi_iprobe +- \c smpi/iprobe-cpu-usage: \ref options_model_smpi_iprobe_cpu_usage - \c smpi/init: \ref options_model_smpi_init +- \c smpi/keep-temps: \ref options_smpi_temps - \c smpi/lat-factor: \ref options_model_smpi_lat_factor - \c smpi/ois: \ref options_model_smpi_ois - \c smpi/or: \ref options_model_smpi_or - \c smpi/os: \ref options_model_smpi_os - \c smpi/papi-events: \ref options_smpi_papi_events -- \c smpi/privatize-global-variables: \ref options_smpi_global +- \c smpi/privatization: \ref options_smpi_privatization - \c smpi/send-is-detached-thresh: \ref options_model_smpi_detached +- \c smpi/shared-malloc: \ref options_model_smpi_shared_malloc +- \c smpi/shared-malloc-hugepage: \ref options_model_smpi_shared_malloc - \c smpi/simulate-computation: \ref options_smpi_bench - \c smpi/test: \ref options_model_smpi_test -- \c smpi/use-shared-malloc: \ref options_model_smpi_use_shared_malloc - \c smpi/wtime: \ref options_model_smpi_wtime - \c Tracing configuration options can be found in Section \ref tracing_tracing_options.