doc/doxygen/module-smpi.doc

   1 /**
   2 @defgroup SMPI_API      SMPI: Simulate real MPI applications
   3 @brief Programming environment for the simulation of MPI applications
   4
   5 @tableofcontents
   6
   7 [TOC]
   8
   9 SMPI enables the study of MPI application by emulating them on top of
  10 the SimGrid simulator. This is particularly interesting to study
  11 existing MPI applications within the comfort of the simulator. The
  12 SMPI reference article is available at
  13 https://hal.inria.fr/hal-01415484. You should also read the
  14 <a href="http://simgrid.org/tutorials/simgrid-smpi-101.pdf">SMPI
  15 introductory slides</a>.
  16
  17 Our goal is to enable the study of **unmodified MPI applications**.
  18 Some constructs and features are still missing, but we can probably
  19 add them on demand.  If you already used MPI before, SMPI should sound
  20 very familiar to you: Use smpicc instead of mpicc, and smpirun instead
  21 of mpirun. The main difference is that smpirun takes a virtual
  22 platform as extra parameter (see @ref platform).
  23
  24 If you are new to MPI, you should first take our online [SMPI
  25 CourseWare](https://simgrid.github.io/SMPI_CourseWare/). It consists
  26 in several projects that progressively introduce the MPI concepts. It
  27 proposes to use SimGrid and SMPI to run the experiments, but the
  28 learning objectives are centered on MPI itself.
  29
  30 For **further scalability**, you may modify your code to speed up your
  31 studies or save memory space.  Maximal **simulation accuracy**
  32 requires some specific care from you.
  33
  34  - @ref SMPI_use
  35    - @ref SMPI_use_compile
  36    - @ref SMPI_use_exec
  37    - @ref SMPI_use_colls
  38      - @ref SMPI_use_colls_algos
  39      - @ref SMPI_use_colls_tracing
  40  - @ref SMPI_what
  41    - @ref SMPI_what_coverage
  42    - @ref SMPI_what_globals
  43  - @ref SMPI_adapting
  44    - @ref SMPI_adapting_size
  45    - @ref SMPI_adapting_speed
  46  - @ref SMPI_accuracy
  47  - @ref SMPI_troubleshooting
  48    - @ref SMPI_trouble_configure_refuses_smpicc
  49    - @ref SMPI_trouble_configure_dont_find_smpicc
  50    - @ref SMPI_trouble_useconds_t
  51
  52
  53 @section SMPI_use Using SMPI
  54
  55 @subsection SMPI_use_compile Compiling your code
  56
  57 If your application is in C, then simply use <tt>smpicc</tt> as a
  58 compiler just like you use mpicc with other MPI implementations. This
  59 script still calls your default compiler (gcc, clang, ...) and adds
  60 the right compilation flags along the way. If your application is in
  61 C++, Fortran 77 or Fortran 90, use respectively <tt>smpicxx</tt>,
  62 <tt>smpiff</tt> or <tt>smpif90</tt>.
  63
  64 @subsection SMPI_use_exec Executing your code on the simulator
  65
  66 Use the <tt>smpirun</tt> script as follows for that:
  67
  68 @verbatim
  69 smpirun -hostfile my_hostfile.txt -platform my_platform.xml ./program -blah
  70 @endverbatim
  71
  72  - <tt>my_hostfile.txt</tt> is a classical MPI hostfile (that is, this
  73    file lists the machines on which the processes must be dispatched, one
  74    per line)
  75  - <tt>my_platform.xml</tt> is a classical SimGrid platform file. Of
  76    course, the hosts of the hostfile must exist in the provided
  77    platform.
  78  - <tt>./program</tt> is the MPI program to simulate, that you
  79    compiled with <tt>smpicc</tt>
  80  - <tt>-blah</tt> is a command-line parameter passed to this program.
  81
  82 <tt>smpirun</tt> accepts other parameters, such as <tt>-np</tt> if you
  83 don't want to use all the hosts defined in the hostfile, <tt>-map</tt>
  84 to display on which host each rank gets mapped of <tt>-trace</tt> to
  85 activate the tracing during the simulation. You can get the full list
  86 by running
  87
  88 @verbatim
  89 smpirun -help
  90 @endverbatim
  91
  92 @subsection SMPI_use_colls Simulating collective operations
  93
  94 MPI collective operations are crucial to the performance of MPI
  95 applications and must be carefully optimized according to many
  96 parameters. Every existing implementation provides several algorithms
  97 for each collective operation, and selects by default the best suited
  98 one, depending on the sizes sent, the number of nodes, the
  99 communicator, or the communication library being used.  These
 100 decisions are based on empirical results and theoretical complexity
 101 estimation, and are very different between MPI implementations. In
 102 most cases, the users can also manually tune the algorithm used for
 103 each collective operation.
 104
 105 SMPI can simulate the behavior of several MPI implementations:
 106 OpenMPI, MPICH,
 107 <a href="http://star-mpi.sourceforge.net/">STAR-MPI</a>, and
 108 MVAPICH2. For that, it provides 115 collective algorithms and several
 109 selector algorithms, that were collected directly in the source code
 110 of the targeted MPI implementations.
 111
 112 You can switch the automatic selector through the
 113 \c smpi/coll-selector configuration item. Possible values:
 114
 115  - <b>ompi</b>: default selection logic of OpenMPI (version 1.7)
 116  - <b>mpich</b>: default selection logic of MPICH (version 3.0.4)
 117  - <b>mvapich2</b>: selection logic of MVAPICH2 (version 1.9) tuned
 118    on the Stampede cluster
 119  - <b>impi</b>: preliminary version of an Intel MPI selector (version
 120    4.1.3, also tuned for the Stampede cluster). Due the closed source
 121    nature of Intel MPI, some of the algorithms described in the
 122    documentation are not available, and are replaced by mvapich ones.
 123  - <b>default</b>: legacy algorithms used in the earlier days of
 124    SimGrid. Do not use for serious perform performance studies.
 125
 126
 127 @subsubsection SMPI_use_colls_algos Available algorithms
 128
 129 You can also pick the algorithm used for each collective with the
 130 corresponding configuration item. For example, to use the pairwise
 131 alltoall algorithm, one should add \c --cfg=smpi/alltoall:pair to the
 132 line. This will override the selector (if any) for this algorithm.
 133 It means that the selected algorithm will be used
 134
 135 Warning: Some collective may require specific conditions to be
 136 executed correctly (for instance having a communicator with a power of
 137 two number of nodes only), which are currently not enforced by
 138 Simgrid. Some crashes can be expected while trying these algorithms
 139 with unusual sizes/parameters
 140
 141 #### MPI_Alltoall
 142
 143 Most of these are best described in <a href="http://www.cs.arizona.edu/~dkl/research/papers/ics06.pdf">STAR-MPI</a>
 144
 145  - default: naive one, by default
 146  - ompi: use openmpi selector for the alltoall operations
 147  - mpich: use mpich selector for the alltoall operations
 148  - mvapich2: use mvapich2 selector for the alltoall operations
 149  - impi: use intel mpi selector for the alltoall operations
 150  - automatic (experimental): use an automatic self-benchmarking algorithm
 151  - bruck: Described by Bruck et.al. in <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949">this paper</a>
 152  - 2dmesh: organizes the nodes as a two dimensional mesh, and perform allgather
 153    along the dimensions
 154  - 3dmesh: adds a third dimension to the previous algorithm
 155  - rdb: recursive doubling : extends the mesh to a nth dimension, each one
 156    containing two nodes
 157  - pair: pairwise exchange, only works for power of 2 procs, size-1 steps,
 158    each process sends and receives from the same process at each step
 159  - pair_light_barrier: same, with small barriers between steps to avoid
 160    contention
 161  - pair_mpi_barrier: same, with MPI_Barrier used
 162  - pair_one_barrier: only one barrier at the beginning
 163  - ring: size-1 steps, at each step a process send to process (n+i)%size, and receives from (n-i)%size
 164  - ring_light_barrier: same, with small barriers between some phases to avoid contention
 165  - ring_mpi_barrier: same, with MPI_Barrier used
 166  - ring_one_barrier: only one barrier at the beginning
 167  - basic_linear: posts all receives and all sends,
 168 starts the communications, and waits for all communication to finish
 169  - mvapich2_scatter_dest: isend/irecv with scattered destinations, posting only a few messages at the same time
 170
 171 #### MPI_Alltoallv
 172
 173  - default: naive one, by default
 174  - ompi: use openmpi selector for the alltoallv operations
 175  - mpich: use mpich selector for the alltoallv operations
 176  - mvapich2: use mvapich2 selector for the alltoallv operations
 177  - impi: use intel mpi selector for the alltoallv operations
 178  - automatic (experimental): use an automatic self-benchmarking algorithm
 179  - bruck: same as alltoall
 180  - pair: same as alltoall
 181  - pair_light_barrier: same as alltoall
 182  - pair_mpi_barrier: same as alltoall
 183  - pair_one_barrier: same as alltoall
 184  - ring: same as alltoall
 185  - ring_light_barrier: same as alltoall
 186  - ring_mpi_barrier: same as alltoall
 187  - ring_one_barrier: same as alltoall
 188  - ompi_basic_linear: same as alltoall
 189
 190 #### MPI_Gather
 191
 192  - default: naive one, by default
 193  - ompi: use openmpi selector for the gather operations
 194  - mpich: use mpich selector for the gather operations
 195  - mvapich2: use mvapich2 selector for the gather operations
 196  - impi: use intel mpi selector for the gather operations
 197  - automatic (experimental): use an automatic self-benchmarking algorithm
 198 which will iterate over all implemented versions and output the best
 199  - ompi_basic_linear: basic linear algorithm from openmpi, each process sends to the root
 200  - ompi_binomial: binomial tree algorithm
 201  - ompi_linear_sync: same as basic linear, but with a synchronization at the
 202  beginning and message cut into two segments.
 203  - mvapich2_two_level: SMP-aware version from MVAPICH. Gather first intra-node (defaults to mpich's gather), and then exchange with only one process/node. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 204
 205 #### MPI_Barrier
 206
 207  - default: naive one, by default
 208  - ompi: use openmpi selector for the barrier operations
 209  - mpich: use mpich selector for the barrier operations
 210  - mvapich2: use mvapich2 selector for the barrier operations
 211  - impi: use intel mpi selector for the barrier operations
 212  - automatic (experimental): use an automatic self-benchmarking algorithm
 213  - ompi_basic_linear: all processes send to root
 214  - ompi_two_procs: special case for two processes
 215  - ompi_bruck: nsteps = sqrt(size), at each step, exchange data with rank-2^k and rank+2^k
 216  - ompi_recursivedoubling: recursive doubling algorithm
 217  - ompi_tree: recursive doubling type algorithm, with tree structure
 218  - ompi_doublering: double ring algorithm
 219  - mvapich2_pair: pairwise algorithm
 220
 221 #### MPI_Scatter
 222
 223  - default: naive one, by default
 224  - ompi: use openmpi selector for the scatter operations
 225  - mpich: use mpich selector for the scatter operations
 226  - mvapich2: use mvapich2 selector for the scatter operations
 227  - impi: use intel mpi selector for the scatter operations
 228  - automatic (experimental): use an automatic self-benchmarking algorithm
 229  - ompi_basic_linear: basic linear scatter
 230  - ompi_binomial: binomial tree scatter
 231  - mvapich2_two_level_direct: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a basic linear inter node stage. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 232  - mvapich2_two_level_binomial: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a binomial phase. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 233
 234 #### MPI_Reduce
 235
 236  - default: naive one, by default
 237  - ompi: use openmpi selector for the reduce operations
 238  - mpich: use mpich selector for the reduce operations
 239  - mvapich2: use mvapich2 selector for the reduce operations
 240  - impi: use intel mpi selector for the reduce operations
 241  - automatic (experimental): use an automatic self-benchmarking algorithm
 242  - arrival_pattern_aware: root exchanges with the first process to arrive
 243  - binomial: uses a binomial tree
 244  - flat_tree: uses a flat tree
 245  - NTSL: Non-topology-specific pipelined linear-bcast function
 246    0->1, 1->2 ,2->3, ....., ->last node: in a pipeline fashion, with segments
 247  of 8192 bytes
 248  - scatter_gather: scatter then gather
 249  - ompi_chain: openmpi reduce algorithms are built on the same basis, but the
 250  topology is generated differently for each flavor
 251 chain = chain with spacing of size/2, and segment size of 64KB
 252  - ompi_pipeline: same with pipeline (chain with spacing of 1), segment size
 253 depends on the communicator size and the message size
 254  - ompi_binary: same with binary tree, segment size of 32KB
 255  - ompi_in_order_binary: same with binary tree, enforcing order on the
 256 operations
 257  - ompi_binomial: same with binomial algo (redundant with default binomial
 258 one in most cases)
 259  - ompi_basic_linear: basic algorithm, each process sends to root
 260  - mvapich2_knomial: k-nomial algorithm. Default factor is 4 (mvapich2 selector adapts it through tuning)
 261  - mvapich2_two_level: SMP-aware reduce, with default set to mpich both for intra and inter communicators. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 262  - rab: <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a>'s reduce algorithm
 263
 264 #### MPI_Allreduce
 265
 266  - default: naive one, by default
 267  - ompi: use openmpi selector for the allreduce operations
 268  - mpich: use mpich selector for the allreduce operations
 269  - mvapich2: use mvapich2 selector for the allreduce operations
 270  - impi: use intel mpi selector for the allreduce operations
 271  - automatic (experimental): use an automatic self-benchmarking algorithm
 272  - lr: logical ring reduce-scatter then logical ring allgather
 273  - rab1: variations of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: reduce_scatter then allgather
 274  - rab2: variations of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: alltoall then allgather
 275  - rab_rsag: variation of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: recursive doubling
 276 reduce_scatter then recursive doubling allgather
 277  - rdb: recursive doubling
 278  - smp_binomial: binomial tree with smp: binomial intra
 279 SMP reduce, inter reduce, inter broadcast then intra broadcast
 280  - smp_binomial_pipeline: same with segment size = 4096 bytes
 281  - smp_rdb: intra: binomial allreduce, inter: Recursive
 282 doubling allreduce, intra: binomial broadcast
 283  - smp_rsag: intra: binomial allreduce, inter: reduce-scatter,
 284 inter:allgather, intra: binomial broadcast
 285  - smp_rsag_lr: intra: binomial allreduce, inter: logical ring
 286 reduce-scatter, logical ring inter:allgather, intra: binomial broadcast
 287  - smp_rsag_rab: intra: binomial allreduce, inter: rab
 288 reduce-scatter, rab inter:allgather, intra: binomial broadcast
 289  - redbcast: reduce then broadcast, using default or tuned algorithms if specified
 290  - ompi_ring_segmented: ring algorithm used by OpenMPI
 291  - mvapich2_rs: rdb for small messages, reduce-scatter then allgather else
 292  - mvapich2_two_level: SMP-aware algorithm, with mpich as intra algoritm, and rdb as inter (Change this behavior by using mvapich2 selector to use tuned values)
 293  - rab: default <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> implementation
 294
 295 #### MPI_Reduce_scatter
 296
 297  - default: naive one, by default
 298  - ompi: use openmpi selector for the reduce_scatter operations
 299  - mpich: use mpich selector for the reduce_scatter operations
 300  - mvapich2: use mvapich2 selector for the reduce_scatter operations
 301  - impi: use intel mpi selector for the reduce_scatter operations
 302  - automatic (experimental): use an automatic self-benchmarking algorithm
 303  - ompi_basic_recursivehalving: recursive halving version from OpenMPI
 304  - ompi_ring: ring version from OpenMPI
 305  - mpich_pair: pairwise exchange version from MPICH
 306  - mpich_rdb: recursive doubling version from MPICH
 307  - mpich_noncomm: only works for power of 2 procs, recursive doubling for noncommutative ops
 308
 309
 310 #### MPI_Allgather
 311
 312  - default: naive one, by default
 313  - ompi: use openmpi selector for the allgather operations
 314  - mpich: use mpich selector for the allgather operations
 315  - mvapich2: use mvapich2 selector for the allgather operations
 316  - impi: use intel mpi selector for the allgather operations
 317  - automatic (experimental): use an automatic self-benchmarking algorithm
 318  - 2dmesh: see alltoall
 319  - 3dmesh: see alltoall
 320  - bruck: Described by Bruck et.al. in <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949">
 321 Efficient algorithms for all-to-all communications in multiport message-passing systems</a>
 322  - GB: Gather - Broadcast (uses tuned version if specified)
 323  - loosely_lr: Logical Ring with grouping by core (hardcoded, default
 324 processes/node: 4)
 325  - NTSLR: Non Topology Specific Logical Ring
 326  - NTSLR_NB: Non Topology Specific Logical Ring, Non Blocking operations
 327  - pair: see alltoall
 328  - rdb: see alltoall
 329  - rhv: only power of 2 number of processes
 330  - ring: see alltoall
 331  - SMP_NTS: gather to root of each SMP, then every root of each SMP node
 332 post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
 333 using logical ring algorithm (hardcoded, default processes/SMP: 8)
 334  - smp_simple: gather to root of each SMP, then every root of each SMP node
 335 post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
 336 using simple algorithm (hardcoded, default processes/SMP: 8)
 337  - spreading_simple: from node i, order of communications is i -> i + 1, i ->
 338  i + 2, ..., i -> (i + p -1) % P
 339  - ompi_neighborexchange: Neighbor Exchange algorithm for allgather.
 340 Described by Chen et.al. in  <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1592302">Performance Evaluation of Allgather Algorithms on Terascale Linux Cluster with Fast Ethernet</a>
 341  - mvapich2_smp: SMP aware algorithm, performing intra-node gather, inter-node allgather with one process/node, and bcast intra-node
 342
 343
 344 #### MPI_Allgatherv
 345
 346  - default: naive one, by default
 347  - ompi: use openmpi selector for the allgatherv operations
 348  - mpich: use mpich selector for the allgatherv operations
 349  - mvapich2: use mvapich2 selector for the allgatherv operations
 350  - impi: use intel mpi selector for the allgatherv operations
 351  - automatic (experimental): use an automatic self-benchmarking algorithm
 352  - GB: Gatherv - Broadcast (uses tuned version if specified, but only for
 353 Bcast, gatherv is not tuned)
 354  - pair: see alltoall
 355  - ring: see alltoall
 356  - ompi_neighborexchange: see allgather
 357  - ompi_bruck: see allgather
 358  - mpich_rdb: recursive doubling algorithm from MPICH
 359  - mpich_ring: ring algorithm from MPICh - performs differently from the  one from STAR-MPI
 360
 361 #### MPI_Bcast
 362
 363  - default: naive one, by default
 364  - ompi: use openmpi selector for the bcast operations
 365  - mpich: use mpich selector for the bcast operations
 366  - mvapich2: use mvapich2 selector for the bcast operations
 367  - impi: use intel mpi selector for the bcast operations
 368  - automatic (experimental): use an automatic self-benchmarking algorithm
 369  - arrival_pattern_aware: root exchanges with the first process to arrive
 370  - arrival_pattern_aware_wait: same with slight variation
 371  - binomial_tree: binomial tree exchange
 372  - flattree: flat tree exchange
 373  - flattree_pipeline: flat tree exchange, message split into 8192 bytes pieces
 374  - NTSB: Non-topology-specific pipelined binary tree with 8192 bytes pieces
 375  - NTSL: Non-topology-specific pipelined linear with 8192 bytes pieces
 376  - NTSL_Isend: Non-topology-specific pipelined linear with 8192 bytes pieces, asynchronous communications
 377  - scatter_LR_allgather: scatter followed by logical ring allgather
 378  - scatter_rdb_allgather: scatter followed by recursive doubling allgather
 379  - arrival_scatter: arrival pattern aware scatter-allgather
 380  - SMP_binary: binary tree algorithm with 8 cores/SMP
 381  - SMP_binomial: binomial tree algorithm with 8 cores/SMP
 382  - SMP_linear: linear algorithm with 8 cores/SMP
 383  - ompi_split_bintree: binary tree algorithm from OpenMPI, with message split in 8192 bytes pieces
 384  - ompi_pipeline: pipeline algorithm from OpenMPI, with message split in 128KB pieces
 385  - mvapich2_inter_node: Inter node default mvapich worker
 386  - mvapich2_intra_node: Intra node default mvapich worker
 387  - mvapich2_knomial_intra_node:  k-nomial intra node default mvapich worker. default factor is 4.
 388
 389 #### Automatic evaluation
 390
 391 (Warning: This is still very experimental)
 392
 393 An automatic version is available for each collective (or even as a selector). This specific
 394 version will loop over all other implemented algorithm for this particular collective, and apply
 395 them while benchmarking the time taken for each process. It will then output the quickest for
 396 each process, and the global quickest. This is still unstable, and a few algorithms which need
 397 specific number of nodes may crash.
 398
 399 #### Adding an algorithm
 400
 401 To add a new algorithm, one should check in the src/smpi/colls folder how other algorithms
 402 are coded. Using plain MPI code inside Simgrid can't be done, so algorithms have to be
 403 changed to use smpi version of the calls instead (MPI_Send will become smpi_mpi_send). Some functions may have different signatures than their MPI counterpart, please check the other algorithms or contact us using <a href="http://lists.gforge.inria.fr/mailman/listinfo/simgrid-devel">SimGrid developers mailing list</a>.
 404
 405 Example: adding a "pair" version of the Alltoall collective.
 406
 407  - Implement it in a file called alltoall-pair.c in the src/smpi/colls folder. This file should include colls_private.hpp.
 408
 409  - The name of the new algorithm function should be smpi_coll_tuned_alltoall_pair, with the same signature as MPI_Alltoall.
 410
 411  - Once the adaptation to SMPI code is done, add a reference to the file ("src/smpi/colls/alltoall-pair.c") in the SMPI_SRC part of the DefinePackages.cmake file inside buildtools/cmake, to allow the file to be built and distributed.
 412
 413  - To register the new version of the algorithm, simply add a line to the corresponding macro in src/smpi/colls/cools.h ( add a "COLL_APPLY(action, COLL_ALLTOALL_SIG, pair)" to the COLL_ALLTOALLS macro ). The algorithm should now be compiled and be selected when using --cfg=smpi/alltoall:pair at runtime.
 414
 415  - To add a test for the algorithm inside Simgrid's test suite, juste add the new algorithm name in the ALLTOALL_COLL list found inside teshsuite/smpi/CMakeLists.txt . When running ctest, a test for the new algorithm should be generated and executed. If it does not pass, please check your code or contact us.
 416
 417  - Please submit your patch for inclusion in SMPI, for example through a pull request on GitHub or directly per email.
 418
 419 @subsubsection SMPI_use_colls_tracing Tracing of internal communications
 420
 421 By default, the collective operations are traced as a unique operation
 422 because tracing all point-to-point communications composing them could
 423 result in overloaded, hard to interpret traces. If you want to debug
 424 and compare collective algorithms, you should set the
 425 \c tracing/smpi/internals configuration item to 1 instead of 0.
 426
 427 Here are examples of two alltoall collective algorithms runs on 16 nodes,
 428 the first one with a ring algorithm, the second with a pairwise one:
 429
 430 @htmlonly
 431 <a href="smpi_simgrid_alltoall_ring_16.png" border=0><img src="smpi_simgrid_alltoall_ring_16.png" width="30%" border=0 align="center"></a>
 432 <a href="smpi_simgrid_alltoall_pair_16.png" border=0><img src="smpi_simgrid_alltoall_pair_16.png" width="30%" border=0 align="center"></a>
 433 <br/>
 434 @endhtmlonly
 435
 436 @section SMPI_what What can run within SMPI?
 437
 438 You can run unmodified MPI applications (both C/C++ and Fortran) within
 439 SMPI, provided that you only use MPI calls that we implemented. Global
 440 variables should be handled correctly on Linux systems.
 441
 442 @subsection SMPI_what_coverage MPI coverage of SMPI
 443
 444 Our coverage of the interface is very decent, but still incomplete;
 445 Given the size of the MPI standard, we may well never manage to
 446 implement absolutely all existing primitives. Currently, we have
 447 almost no support for I/O primitives, but we still pass a very large
 448 amount of the MPICH coverage tests.
 449
 450 The full list of not yet implemented functions is documented in the
 451 file @ref include/smpi/smpi.h, between two lines containing the
 452 <tt>FIXME</tt> marker. If you really miss a feature, please get in
 453 touch with us: we can guide you though the SimGrid code to help you
 454 implementing it, and we'd glad to integrate your contribution to the
 455 main project afterward.
 456
 457 @subsection SMPI_what_globals Privatization of global variables
 458
 459 Concerning the globals, the problem comes from the fact that usually,
 460 MPI processes run as real UNIX processes while they are all folded
 461 into threads of a unique system process in SMPI. Global variables are
 462 usually private to each MPI process while they become shared between
 463 the processes in SMPI.  The problem and some potential solutions are
 464 discussed in this article: "Automatic Handling of Global Variables for
 465 Multi-threaded MPI Programs", available at
 466 http://charm.cs.illinois.edu/newPapers/11-23/paper.pdf (note that this
 467 article does not deal with SMPI but with a competing solution called
 468 AMPI that suffers of the same issue).  This point used to be
 469 problematic in SimGrid, but the problem should now be handled
 470 automatically on Linux.
 471
 472 Older versions of SimGrid came with a script that automatically
 473 privatized the globals through static analysis of the source code. But
 474 our implementation was not robust enough to be used in production, so
 475 it was removed at some point. Currently, SMPI comes with two
 476 privatization mechanisms that you can @ref options_smpi_privatization
 477 "select at runtime". At the time of writing (v3.18), the dlopen
 478 approach is considered to be very fast (it's used by default) while
 479 the mmap approach is considered to be rather slow but very robust.
 480
 481 With the <b>mmap approach</b>, SMPI duplicates and dynamically switch
 482 the \c .data and \c .bss segments of the ELF process when switching
 483 the MPI ranks. This allows each ranks to have its own copy of the
 484 global variables.  No copy actually occures as this mechanism uses \c
 485 mmap for efficiency. This mechanism is considered to be very robust on
 486 all systems supporting \c mmap (Linux and most BSDs). Its performance
 487 is questionable since each context switch between MPI ranks induces
 488 several syscalls to change the \c mmap that redirects the \c .data and
 489 \c .bss segments to the copies of the new rank. The code will also be
 490 copied several times in memory, inducing a slight increase of memory
 491 occupation.
 492
 493 Another limitation is that SMPI only accounts for global variables
 494 defined in the executable. If the processes use external global
 495 variables from dynamic libraries, they won't be switched
 496 correctly. The easiest way to solve this is to statically link against
 497 the library with these globals. This way, each MPI rank will get its
 498 own copy of these libraries. Of course you should never statically
 499 link against the SimGrid library itself.
 500
 501 With the <b>dlopen approach</b>, SMPI loads several copies of the same
 502 executable in memory as if it were a library, so that the global
 503 variables get naturally duplicated. It first requires the executable
 504 to be compiled as a relocatable binary, which is less common for
 505 programs than for libraries. But most distributions are now compiled
 506 this way for security reason as it allows to randomize the address
 507 space layout. It should thus be safe to compile most (any?) program
 508 this way.  The second trick is that the dynamic linker refuses to link
 509 the exact same file several times, be it a library or a relocatable
 510 executable. It makes perfectly sense in the general case, but we need
 511 to circumvent this rule of thumb in our case. To that extend, the
 512 binary is copied in a temporary file before being re-linked against.
 513 `dlmopen()` cannot be used as it only allows 256 contextes, and as it
 514 would also dupplicate simgrid itself.
 515
 516 This approach greatly speeds up the context switching, down to about
 517 40 CPU cycles with our raw contextes, instead of requesting several
 518 syscalls with the \c mmap approach. Another advantage is that it
 519 permits to run the SMPI contexts in parallel, which is obviously not
 520 possible with the \c mmap approach. It was tricky to implement, but we
 521 are not aware of any flaws, so smpirun activates it by default.
 522
 523 In the future, it may be possible to further reduce the memory and
 524 disk consumption. It seems that we could <a
 525 href="https://lwn.net/Articles/415889/">punch holes</a> in the files
 526 before dl-loading them to remove the code and constants, and mmap
 527 these area onto a unique copy. If done correctly, this would reduce
 528 the disk- and memory- usage to the bare minimum, and would also reduce
 529 the pressure on the CPU instruction cache. See
 530 <a href="https://github.com/simgrid/simgrid/issues/137">the relevant
 531 bug</a> on github for implementation leads.\n
 532
 533 Also, currently, only the binary is copied and dlopen-ed for each MPI
 534 rank. We could probably extend this to external dependencies, but for
 535 now, any external dependencies must be statically linked into your
 536 application. As usual, simgrid itself shall never be statically linked
 537 in your app. You don't want to give a copy of SimGrid to each MPI rank:
 538 that's ways too much for them to deal with.
 539
 540 @section SMPI_adapting Adapting your MPI code for further scalability
 541
 542 As detailed in the reference article (available at
 543 http://hal.inria.fr/hal-01415484), you may want to adapt your code
 544 to improve the simulation performance. But these tricks may seriously
 545 hinder the result quality (or even prevent the app to run) if used
 546 wrongly. We assume that if you want to simulate an HPC application,
 547 you know what you are doing. Don't prove us wrong!
 548
 549 @subsection SMPI_adapting_size Reducing your memory footprint
 550
 551 If you get short on memory (the whole app is executed on a single node when
 552 simulated), you should have a look at the SMPI_SHARED_MALLOC and
 553 SMPI_SHARED_FREE macros. It allows to share memory areas between processes: The
 554 purpose of these macro is that the same line malloc on each process will point
 555 to the exact same memory area. So if you have a malloc of 2M and you have 16
 556 processes, this macro will change your memory consumption from 2M*16 to 2M
 557 only. Only one block for all processes.
 558
 559 If your program is ok with a block containing garbage value because all
 560 processes write and read to the same place without any kind of coordination,
 561 then this macro can dramatically shrink your memory consumption. For example,
 562 that will be very beneficial to a matrix multiplication code, as all blocks will
 563 be stored on the same area. Of course, the resulting computations will useless,
 564 but you can still study the application behavior this way.
 565
 566 Naturally, this won't work if your code is data-dependent. For example, a Jacobi
 567 iterative computation depends on the result computed by the code to detect
 568 convergence conditions, so turning them into garbage by sharing the same memory
 569 area between processes does not seem very wise. You cannot use the
 570 SMPI_SHARED_MALLOC macro in this case, sorry.
 571
 572 This feature is demoed by the example file
 573 <tt>examples/smpi/NAS/dt.c</tt>
 574
 575 @subsection SMPI_adapting_speed Toward faster simulations
 576
 577 If your application is too slow, try using SMPI_SAMPLE_LOCAL,
 578 SMPI_SAMPLE_GLOBAL and friends to indicate which computation loops can
 579 be sampled. Some of the loop iterations will be executed to measure
 580 their duration, and this duration will be used for the subsequent
 581 iterations. These samples are done per processor with
 582 SMPI_SAMPLE_LOCAL, and shared between all processors with
 583 SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution
 584 time of your loop iteration are not stable.
 585
 586 This feature is demoed by the example file
 587 <tt>examples/smpi/NAS/ep.c</tt>
 588
 589 @section SMPI_accuracy Ensuring accurate simulations
 590
 591 Out of the box, SimGrid may give you fairly accurate results, but
 592 there is a plenty of factors that could go wrong and make your results
 593 inaccurate or even plainly wrong. Actually, you can only get accurate
 594 results of a nicely built model, including both the system hardware
 595 and your application. Such models are hard to pass over and reuse in
 596 other settings, because elements that are not relevant to an
 597 application (say, the latency of point-to-point communications,
 598 collective operation implementation details or CPU-network
 599 interaction) may be irrelevant to another application. The dream of
 600 the perfect model, encompassing every aspects is only a chimera, as
 601 the only perfect model of the reality is the reality. If you go for
 602 simulation, then you have to ignore some irrelevant aspects of the
 603 reality, but which aspects are irrelevant is actually
 604 application-dependent...
 605
 606 The only way to assess whether your settings provide accurate results
 607 is to double-check these results. If possible, you should first run
 608 the same experiment in simulation and in real life, gathering as much
 609 information as you can. Try to understand the discrepancies in the
 610 results that you observe between both settings (visualization can be
 611 precious for that). Then, try to modify your model (of the platform,
 612 of the collective operations) to reduce the most preeminent differences.
 613
 614 If the discrepancies come from the computing time, try adapting the \c
 615 smpi/host-speed: reduce it if your simulation runs faster than in
 616 reality. If the error come from the communication, then you need to
 617 fiddle with your platform file.
 618
 619 Be inventive in your modeling. Don't be afraid if the names given by
 620 SimGrid does not match the real names: we got very good results by
 621 modeling multicore/GPU machines with a set of separate hosts
 622 interconnected with very fast networks (but don't trust your model
 623 because it has the right names in the right place either).
 624
 625 Finally, you may want to check [this
 626 article](https://hal.inria.fr/hal-00907887) on the classical pitfalls
 627 in modeling distributed systems.
 628
 629 @section SMPI_troubleshooting Troubleshooting with SMPI
 630
 631 @subsection SMPI_trouble_configure_refuses_smpicc ./configure refuses to use smpicc
 632
 633 If your <tt>./configure</tt> reports that the compiler is not
 634 functional or that you are cross-compiling, try to define the
 635 <tt>SMPI_PRETEND_CC</tt> environment variable before running the
 636 configuration.
 637
 638 @verbatim
 639 SMPI_PRETEND_CC=1 ./configure # here come the configure parameters
 640 make
 641 @endverbatim
 642
 643 Indeed, the programs compiled with <tt>smpicc</tt> cannot be executed
 644 without <tt>smpirun</tt> (they are shared libraries, and they do weird
 645 things on startup), while configure wants to test them directly.
 646 With <tt>SMPI_PRETEND_CC</tt> smpicc does not compile as shared,
 647 and the SMPI initialization stops and returns 0 before doing anything
 648 that would fail without <tt>smpirun</tt>.
 649
 650 \warning
 651
 652   Make sure that SMPI_PRETEND_CC is only set when calling ./configure,
 653   not during the actual execution, or any program compiled with smpicc
 654   will stop before starting.
 655
 656 @subsection SMPI_trouble_configure_dont_find_smpicc ./configure does not pick smpicc as a compiler
 657
 658 In addition to the previous answers, some projects also need to be
 659 explicitely told what compiler to use, as follows:
 660
 661 @verbatim
 662 SMPI_PRETEND_CC=1 ./configure CC=smpicc # here come the other configure parameters
 663 make
 664 @endverbatim
 665
 666 Maybe your configure is using another variable, such as <tt>cc</tt> or
 667 similar. Just check the logs.
 668
 669 @subsection SMPI_trouble_useconds_t  error: unknown type name 'useconds_t'
 670
 671 Try to add <tt>-D_GNU_SOURCE</tt> to your compilation line to get ride
 672 of that error.
 673
 674 The reason is that SMPI provides its own version of <tt>usleep(3)</tt>
 675 to override it and to block in the simulation world, not in the real
 676 one. It needs the <tt>useconds_t</tt> type for that, which is declared
 677 only if you declare <tt>_GNU_SOURCE</tt> before including
 678 <tt>unistd.h</tt>. If your project includes that header file before
 679 SMPI, then you need to ensure that you pass the right configuration
 680 defines as advised above.
 681
 682
 683 */
 684
 685
 686 /** @example include/smpi/smpi.h */