doc/doxygen/module-smpi.doc

   1 /**
   2 @defgroup SMPI_API      SMPI: Simulate real MPI applications
   3 @brief Programming environment for the simulation of MPI applications
   4
   5 @tableofcontents
   6
   7 SMPI enables the study of MPI application by emulating them on top of
   8 the SimGrid simulator. This is particularly interesting to study
   9 existing MPI applications within the comfort of the simulator. The
  10 SMPI reference article is available at
  11 https://hal.inria.fr/hal-01415484. You should also read the
  12 <a href="http://simgrid.org/tutorials/simgrid-smpi-101.pdf">SMPI
  13 introductory slides</a>.
  14
  15 Our goal is to enable the study of **unmodified MPI applications**.
  16 Some constructs and features are still missing, but we can probably
  17 add them on demand.  If you already used MPI before, SMPI should sound
  18 very familiar to you: Use smpicc instead of mpicc, and smpirun instead
  19 of mpirun. The main difference is that smpirun takes a virtual
  20 platform as extra parameter (see @ref platform).
  21
  22 If you are new to MPI, you should first take our online [SMPI
  23 CourseWare](https://simgrid.github.io/SMPI_CourseWare/). It consists
  24 in several projects that progressively introduce the MPI concepts. It
  25 proposes to use SimGrid and SMPI to run the experiments, but the
  26 learning objectives are centered on MPI itself.
  27
  28 For **further scalability**, you may modify your code to speed up your
  29 studies or save memory space.  Maximal **simulation accuracy**
  30 requires some specific care from you.
  31
  32  - @ref SMPI_use
  33    - @ref SMPI_use_compile
  34    - @ref SMPI_use_exec
  35    - @ref SMPI_use_debug
  36    - @ref SMPI_use_colls
  37      - @ref SMPI_use_colls_algos
  38      - @ref SMPI_use_colls_tracing
  39  - @ref SMPI_what
  40    - @ref SMPI_what_coverage
  41    - @ref SMPI_what_globals
  42  - @ref SMPI_adapting
  43    - @ref SMPI_adapting_size
  44    - @ref SMPI_adapting_speed
  45  - @ref SMPI_accuracy
  46  - @ref SMPI_troubleshooting
  47    - @ref SMPI_trouble_configure_refuses_smpicc
  48    - @ref SMPI_trouble_configure_dont_find_smpicc
  49    - @ref SMPI_trouble_useconds_t
  50
  51
  52 @section SMPI_use Using SMPI
  53
  54 @subsection SMPI_use_compile Compiling your code
  55
  56 If your application is in C, then simply use <tt>smpicc</tt> as a
  57 compiler just like you use mpicc with other MPI implementations. This
  58 script still calls your default compiler (gcc, clang, ...) and adds
  59 the right compilation flags along the way. If your application is in
  60 C++, Fortran 77 or Fortran 90, use respectively <tt>smpicxx</tt>,
  61 <tt>smpiff</tt> or <tt>smpif90</tt>.
  62
  63 @subsection SMPI_use_exec Executing your code on the simulator
  64
  65 Use the <tt>smpirun</tt> script as follows for that:
  66
  67 @verbatim
  68 smpirun -hostfile my_hostfile.txt -platform my_platform.xml ./program -blah
  69 @endverbatim
  70
  71  - <tt>my_hostfile.txt</tt> is a classical MPI hostfile (that is, this
  72    file lists the machines on which the processes must be dispatched, one
  73    per line)
  74  - <tt>my_platform.xml</tt> is a classical SimGrid platform file. Of
  75    course, the hosts of the hostfile must exist in the provided
  76    platform.
  77  - <tt>./program</tt> is the MPI program to simulate, that you
  78    compiled with <tt>smpicc</tt>
  79  - <tt>-blah</tt> is a command-line parameter passed to this program.
  80
  81 <tt>smpirun</tt> accepts other parameters, such as <tt>-np</tt> if you
  82 don't want to use all the hosts defined in the hostfile, <tt>-map</tt>
  83 to display on which host each rank gets mapped of <tt>-trace</tt> to
  84 activate the tracing during the simulation. You can get the full list
  85 by running
  86
  87 @verbatim
  88 smpirun -help
  89 @endverbatim
  90
  91 @subsection SMPI_use_debug Debugging your code on top of SMPI
  92
  93 If you want to explore the automatic platform and deployment files
  94 that are generated by @c smpirun, add @c -keep-temps to the command
  95 line.
  96
  97 You can also run your simulation within valgrind or gdb using the
  98 following commands. Once in GDB, each MPI ranks will be represented as
  99 a regular thread, and you can explore the state of each of them as
 100 usual.
 101 @verbatim
 102 smpirun -wrapper valgrind ...other args...
 103 smpirun -wrapper "gdb --args" --cfg=contexts/factory:thread ...other args...
 104 @endverbatim
 105
 106 @subsection SMPI_use_colls Simulating collective operations
 107
 108 MPI collective operations are crucial to the performance of MPI
 109 applications and must be carefully optimized according to many
 110 parameters. Every existing implementation provides several algorithms
 111 for each collective operation, and selects by default the best suited
 112 one, depending on the sizes sent, the number of nodes, the
 113 communicator, or the communication library being used.  These
 114 decisions are based on empirical results and theoretical complexity
 115 estimation, and are very different between MPI implementations. In
 116 most cases, the users can also manually tune the algorithm used for
 117 each collective operation.
 118
 119 SMPI can simulate the behavior of several MPI implementations:
 120 OpenMPI, MPICH,
 121 <a href="http://star-mpi.sourceforge.net/">STAR-MPI</a>, and
 122 MVAPICH2. For that, it provides 115 collective algorithms and several
 123 selector algorithms, that were collected directly in the source code
 124 of the targeted MPI implementations.
 125
 126 You can switch the automatic selector through the
 127 \c smpi/coll-selector configuration item. Possible values:
 128
 129  - <b>ompi</b>: default selection logic of OpenMPI (version 1.7)
 130  - <b>mpich</b>: default selection logic of MPICH (version 3.0.4)
 131  - <b>mvapich2</b>: selection logic of MVAPICH2 (version 1.9) tuned
 132    on the Stampede cluster
 133  - <b>impi</b>: preliminary version of an Intel MPI selector (version
 134    4.1.3, also tuned for the Stampede cluster). Due the closed source
 135    nature of Intel MPI, some of the algorithms described in the
 136    documentation are not available, and are replaced by mvapich ones.
 137  - <b>default</b>: legacy algorithms used in the earlier days of
 138    SimGrid. Do not use for serious perform performance studies.
 139
 140
 141 @subsubsection SMPI_use_colls_algos Available algorithms
 142
 143 You can also pick the algorithm used for each collective with the
 144 corresponding configuration item. For example, to use the pairwise
 145 alltoall algorithm, one should add \c --cfg=smpi/alltoall:pair to the
 146 line. This will override the selector (if any) for this algorithm.
 147 It means that the selected algorithm will be used
 148
 149 Warning: Some collective may require specific conditions to be
 150 executed correctly (for instance having a communicator with a power of
 151 two number of nodes only), which are currently not enforced by
 152 Simgrid. Some crashes can be expected while trying these algorithms
 153 with unusual sizes/parameters
 154
 155 #### MPI_Alltoall
 156
 157 Most of these are best described in <a href="http://www.cs.arizona.edu/~dkl/research/papers/ics06.pdf">STAR-MPI</a>
 158
 159  - default: naive one, by default
 160  - ompi: use openmpi selector for the alltoall operations
 161  - mpich: use mpich selector for the alltoall operations
 162  - mvapich2: use mvapich2 selector for the alltoall operations
 163  - impi: use intel mpi selector for the alltoall operations
 164  - automatic (experimental): use an automatic self-benchmarking algorithm
 165  - bruck: Described by Bruck et.al. in <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949">this paper</a>
 166  - 2dmesh: organizes the nodes as a two dimensional mesh, and perform allgather
 167    along the dimensions
 168  - 3dmesh: adds a third dimension to the previous algorithm
 169  - rdb: recursive doubling : extends the mesh to a nth dimension, each one
 170    containing two nodes
 171  - pair: pairwise exchange, only works for power of 2 procs, size-1 steps,
 172    each process sends and receives from the same process at each step
 173  - pair_light_barrier: same, with small barriers between steps to avoid
 174    contention
 175  - pair_mpi_barrier: same, with MPI_Barrier used
 176  - pair_one_barrier: only one barrier at the beginning
 177  - ring: size-1 steps, at each step a process send to process (n+i)%size, and receives from (n-i)%size
 178  - ring_light_barrier: same, with small barriers between some phases to avoid contention
 179  - ring_mpi_barrier: same, with MPI_Barrier used
 180  - ring_one_barrier: only one barrier at the beginning
 181  - basic_linear: posts all receives and all sends,
 182 starts the communications, and waits for all communication to finish
 183  - mvapich2_scatter_dest: isend/irecv with scattered destinations, posting only a few messages at the same time
 184
 185 #### MPI_Alltoallv
 186
 187  - default: naive one, by default
 188  - ompi: use openmpi selector for the alltoallv operations
 189  - mpich: use mpich selector for the alltoallv operations
 190  - mvapich2: use mvapich2 selector for the alltoallv operations
 191  - impi: use intel mpi selector for the alltoallv operations
 192  - automatic (experimental): use an automatic self-benchmarking algorithm
 193  - bruck: same as alltoall
 194  - pair: same as alltoall
 195  - pair_light_barrier: same as alltoall
 196  - pair_mpi_barrier: same as alltoall
 197  - pair_one_barrier: same as alltoall
 198  - ring: same as alltoall
 199  - ring_light_barrier: same as alltoall
 200  - ring_mpi_barrier: same as alltoall
 201  - ring_one_barrier: same as alltoall
 202  - ompi_basic_linear: same as alltoall
 203
 204 #### MPI_Gather
 205
 206  - default: naive one, by default
 207  - ompi: use openmpi selector for the gather operations
 208  - mpich: use mpich selector for the gather operations
 209  - mvapich2: use mvapich2 selector for the gather operations
 210  - impi: use intel mpi selector for the gather operations
 211  - automatic (experimental): use an automatic self-benchmarking algorithm
 212 which will iterate over all implemented versions and output the best
 213  - ompi_basic_linear: basic linear algorithm from openmpi, each process sends to the root
 214  - ompi_binomial: binomial tree algorithm
 215  - ompi_linear_sync: same as basic linear, but with a synchronization at the
 216  beginning and message cut into two segments.
 217  - mvapich2_two_level: SMP-aware version from MVAPICH. Gather first intra-node (defaults to mpich's gather), and then exchange with only one process/node. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 218
 219 #### MPI_Barrier
 220
 221  - default: naive one, by default
 222  - ompi: use openmpi selector for the barrier operations
 223  - mpich: use mpich selector for the barrier operations
 224  - mvapich2: use mvapich2 selector for the barrier operations
 225  - impi: use intel mpi selector for the barrier operations
 226  - automatic (experimental): use an automatic self-benchmarking algorithm
 227  - ompi_basic_linear: all processes send to root
 228  - ompi_two_procs: special case for two processes
 229  - ompi_bruck: nsteps = sqrt(size), at each step, exchange data with rank-2^k and rank+2^k
 230  - ompi_recursivedoubling: recursive doubling algorithm
 231  - ompi_tree: recursive doubling type algorithm, with tree structure
 232  - ompi_doublering: double ring algorithm
 233  - mvapich2_pair: pairwise algorithm
 234
 235 #### MPI_Scatter
 236
 237  - default: naive one, by default
 238  - ompi: use openmpi selector for the scatter operations
 239  - mpich: use mpich selector for the scatter operations
 240  - mvapich2: use mvapich2 selector for the scatter operations
 241  - impi: use intel mpi selector for the scatter operations
 242  - automatic (experimental): use an automatic self-benchmarking algorithm
 243  - ompi_basic_linear: basic linear scatter
 244  - ompi_binomial: binomial tree scatter
 245  - mvapich2_two_level_direct: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a basic linear inter node stage. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 246  - mvapich2_two_level_binomial: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a binomial phase. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 247
 248 #### MPI_Reduce
 249
 250  - default: naive one, by default
 251  - ompi: use openmpi selector for the reduce operations
 252  - mpich: use mpich selector for the reduce operations
 253  - mvapich2: use mvapich2 selector for the reduce operations
 254  - impi: use intel mpi selector for the reduce operations
 255  - automatic (experimental): use an automatic self-benchmarking algorithm
 256  - arrival_pattern_aware: root exchanges with the first process to arrive
 257  - binomial: uses a binomial tree
 258  - flat_tree: uses a flat tree
 259  - NTSL: Non-topology-specific pipelined linear-bcast function
 260    0->1, 1->2 ,2->3, ....., ->last node: in a pipeline fashion, with segments
 261  of 8192 bytes
 262  - scatter_gather: scatter then gather
 263  - ompi_chain: openmpi reduce algorithms are built on the same basis, but the
 264  topology is generated differently for each flavor
 265 chain = chain with spacing of size/2, and segment size of 64KB
 266  - ompi_pipeline: same with pipeline (chain with spacing of 1), segment size
 267 depends on the communicator size and the message size
 268  - ompi_binary: same with binary tree, segment size of 32KB
 269  - ompi_in_order_binary: same with binary tree, enforcing order on the
 270 operations
 271  - ompi_binomial: same with binomial algo (redundant with default binomial
 272 one in most cases)
 273  - ompi_basic_linear: basic algorithm, each process sends to root
 274  - mvapich2_knomial: k-nomial algorithm. Default factor is 4 (mvapich2 selector adapts it through tuning)
 275  - mvapich2_two_level: SMP-aware reduce, with default set to mpich both for intra and inter communicators. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 276  - rab: <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a>'s reduce algorithm
 277
 278 #### MPI_Allreduce
 279
 280  - default: naive one, by default
 281  - ompi: use openmpi selector for the allreduce operations
 282  - mpich: use mpich selector for the allreduce operations
 283  - mvapich2: use mvapich2 selector for the allreduce operations
 284  - impi: use intel mpi selector for the allreduce operations
 285  - automatic (experimental): use an automatic self-benchmarking algorithm
 286  - lr: logical ring reduce-scatter then logical ring allgather
 287  - rab1: variations of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: reduce_scatter then allgather
 288  - rab2: variations of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: alltoall then allgather
 289  - rab_rsag: variation of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: recursive doubling
 290 reduce_scatter then recursive doubling allgather
 291  - rdb: recursive doubling
 292  - smp_binomial: binomial tree with smp: binomial intra
 293 SMP reduce, inter reduce, inter broadcast then intra broadcast
 294  - smp_binomial_pipeline: same with segment size = 4096 bytes
 295  - smp_rdb: intra: binomial allreduce, inter: Recursive
 296 doubling allreduce, intra: binomial broadcast
 297  - smp_rsag: intra: binomial allreduce, inter: reduce-scatter,
 298 inter:allgather, intra: binomial broadcast
 299  - smp_rsag_lr: intra: binomial allreduce, inter: logical ring
 300 reduce-scatter, logical ring inter:allgather, intra: binomial broadcast
 301  - smp_rsag_rab: intra: binomial allreduce, inter: rab
 302 reduce-scatter, rab inter:allgather, intra: binomial broadcast
 303  - redbcast: reduce then broadcast, using default or tuned algorithms if specified
 304  - ompi_ring_segmented: ring algorithm used by OpenMPI
 305  - mvapich2_rs: rdb for small messages, reduce-scatter then allgather else
 306  - mvapich2_two_level: SMP-aware algorithm, with mpich as intra algoritm, and rdb as inter (Change this behavior by using mvapich2 selector to use tuned values)
 307  - rab: default <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> implementation
 308
 309 #### MPI_Reduce_scatter
 310
 311  - default: naive one, by default
 312  - ompi: use openmpi selector for the reduce_scatter operations
 313  - mpich: use mpich selector for the reduce_scatter operations
 314  - mvapich2: use mvapich2 selector for the reduce_scatter operations
 315  - impi: use intel mpi selector for the reduce_scatter operations
 316  - automatic (experimental): use an automatic self-benchmarking algorithm
 317  - ompi_basic_recursivehalving: recursive halving version from OpenMPI
 318  - ompi_ring: ring version from OpenMPI
 319  - mpich_pair: pairwise exchange version from MPICH
 320  - mpich_rdb: recursive doubling version from MPICH
 321  - mpich_noncomm: only works for power of 2 procs, recursive doubling for noncommutative ops
 322
 323
 324 #### MPI_Allgather
 325
 326  - default: naive one, by default
 327  - ompi: use openmpi selector for the allgather operations
 328  - mpich: use mpich selector for the allgather operations
 329  - mvapich2: use mvapich2 selector for the allgather operations
 330  - impi: use intel mpi selector for the allgather operations
 331  - automatic (experimental): use an automatic self-benchmarking algorithm
 332  - 2dmesh: see alltoall
 333  - 3dmesh: see alltoall
 334  - bruck: Described by Bruck et.al. in <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949">
 335 Efficient algorithms for all-to-all communications in multiport message-passing systems</a>
 336  - GB: Gather - Broadcast (uses tuned version if specified)
 337  - loosely_lr: Logical Ring with grouping by core (hardcoded, default
 338 processes/node: 4)
 339  - NTSLR: Non Topology Specific Logical Ring
 340  - NTSLR_NB: Non Topology Specific Logical Ring, Non Blocking operations
 341  - pair: see alltoall
 342  - rdb: see alltoall
 343  - rhv: only power of 2 number of processes
 344  - ring: see alltoall
 345  - SMP_NTS: gather to root of each SMP, then every root of each SMP node
 346 post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
 347 using logical ring algorithm (hardcoded, default processes/SMP: 8)
 348  - smp_simple: gather to root of each SMP, then every root of each SMP node
 349 post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
 350 using simple algorithm (hardcoded, default processes/SMP: 8)
 351  - spreading_simple: from node i, order of communications is i -> i + 1, i ->
 352  i + 2, ..., i -> (i + p -1) % P
 353  - ompi_neighborexchange: Neighbor Exchange algorithm for allgather.
 354 Described by Chen et.al. in  <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1592302">Performance Evaluation of Allgather Algorithms on Terascale Linux Cluster with Fast Ethernet</a>
 355  - mvapich2_smp: SMP aware algorithm, performing intra-node gather, inter-node allgather with one process/node, and bcast intra-node
 356
 357
 358 #### MPI_Allgatherv
 359
 360  - default: naive one, by default
 361  - ompi: use openmpi selector for the allgatherv operations
 362  - mpich: use mpich selector for the allgatherv operations
 363  - mvapich2: use mvapich2 selector for the allgatherv operations
 364  - impi: use intel mpi selector for the allgatherv operations
 365  - automatic (experimental): use an automatic self-benchmarking algorithm
 366  - GB: Gatherv - Broadcast (uses tuned version if specified, but only for
 367 Bcast, gatherv is not tuned)
 368  - pair: see alltoall
 369  - ring: see alltoall
 370  - ompi_neighborexchange: see allgather
 371  - ompi_bruck: see allgather
 372  - mpich_rdb: recursive doubling algorithm from MPICH
 373  - mpich_ring: ring algorithm from MPICh - performs differently from the  one from STAR-MPI
 374
 375 #### MPI_Bcast
 376
 377  - default: naive one, by default
 378  - ompi: use openmpi selector for the bcast operations
 379  - mpich: use mpich selector for the bcast operations
 380  - mvapich2: use mvapich2 selector for the bcast operations
 381  - impi: use intel mpi selector for the bcast operations
 382  - automatic (experimental): use an automatic self-benchmarking algorithm
 383  - arrival_pattern_aware: root exchanges with the first process to arrive
 384  - arrival_pattern_aware_wait: same with slight variation
 385  - binomial_tree: binomial tree exchange
 386  - flattree: flat tree exchange
 387  - flattree_pipeline: flat tree exchange, message split into 8192 bytes pieces
 388  - NTSB: Non-topology-specific pipelined binary tree with 8192 bytes pieces
 389  - NTSL: Non-topology-specific pipelined linear with 8192 bytes pieces
 390  - NTSL_Isend: Non-topology-specific pipelined linear with 8192 bytes pieces, asynchronous communications
 391  - scatter_LR_allgather: scatter followed by logical ring allgather
 392  - scatter_rdb_allgather: scatter followed by recursive doubling allgather
 393  - arrival_scatter: arrival pattern aware scatter-allgather
 394  - SMP_binary: binary tree algorithm with 8 cores/SMP
 395  - SMP_binomial: binomial tree algorithm with 8 cores/SMP
 396  - SMP_linear: linear algorithm with 8 cores/SMP
 397  - ompi_split_bintree: binary tree algorithm from OpenMPI, with message split in 8192 bytes pieces
 398  - ompi_pipeline: pipeline algorithm from OpenMPI, with message split in 128KB pieces
 399  - mvapich2_inter_node: Inter node default mvapich worker
 400  - mvapich2_intra_node: Intra node default mvapich worker
 401  - mvapich2_knomial_intra_node:  k-nomial intra node default mvapich worker. default factor is 4.
 402
 403 #### Automatic evaluation
 404
 405 (Warning: This is still very experimental)
 406
 407 An automatic version is available for each collective (or even as a selector). This specific
 408 version will loop over all other implemented algorithm for this particular collective, and apply
 409 them while benchmarking the time taken for each process. It will then output the quickest for
 410 each process, and the global quickest. This is still unstable, and a few algorithms which need
 411 specific number of nodes may crash.
 412
 413 #### Adding an algorithm
 414
 415 To add a new algorithm, one should check in the src/smpi/colls folder how other algorithms
 416 are coded. Using plain MPI code inside Simgrid can't be done, so algorithms have to be
 417 changed to use smpi version of the calls instead (MPI_Send will become smpi_mpi_send). Some functions may have different signatures than their MPI counterpart, please check the other algorithms or contact us using <a href="http://lists.gforge.inria.fr/mailman/listinfo/simgrid-devel">SimGrid developers mailing list</a>.
 418
 419 Example: adding a "pair" version of the Alltoall collective.
 420
 421  - Implement it in a file called alltoall-pair.c in the src/smpi/colls folder. This file should include colls_private.hpp.
 422
 423  - The name of the new algorithm function should be smpi_coll_tuned_alltoall_pair, with the same signature as MPI_Alltoall.
 424
 425  - Once the adaptation to SMPI code is done, add a reference to the file ("src/smpi/colls/alltoall-pair.c") in the SMPI_SRC part of the DefinePackages.cmake file inside buildtools/cmake, to allow the file to be built and distributed.
 426
 427  - To register the new version of the algorithm, simply add a line to the corresponding macro in src/smpi/colls/cools.h ( add a "COLL_APPLY(action, COLL_ALLTOALL_SIG, pair)" to the COLL_ALLTOALLS macro ). The algorithm should now be compiled and be selected when using --cfg=smpi/alltoall:pair at runtime.
 428
 429  - To add a test for the algorithm inside Simgrid's test suite, juste add the new algorithm name in the ALLTOALL_COLL list found inside teshsuite/smpi/CMakeLists.txt . When running ctest, a test for the new algorithm should be generated and executed. If it does not pass, please check your code or contact us.
 430
 431  - Please submit your patch for inclusion in SMPI, for example through a pull request on GitHub or directly per email.
 432
 433 @subsubsection SMPI_use_colls_tracing Tracing of internal communications
 434
 435 By default, the collective operations are traced as a unique operation
 436 because tracing all point-to-point communications composing them could
 437 result in overloaded, hard to interpret traces. If you want to debug
 438 and compare collective algorithms, you should set the
 439 \c tracing/smpi/internals configuration item to 1 instead of 0.
 440
 441 Here are examples of two alltoall collective algorithms runs on 16 nodes,
 442 the first one with a ring algorithm, the second with a pairwise one:
 443
 444 @htmlonly
 445 <a href="smpi_simgrid_alltoall_ring_16.png" border=0><img src="smpi_simgrid_alltoall_ring_16.png" width="30%" border=0 align="center"></a>
 446 <a href="smpi_simgrid_alltoall_pair_16.png" border=0><img src="smpi_simgrid_alltoall_pair_16.png" width="30%" border=0 align="center"></a>
 447 <br/>
 448 @endhtmlonly
 449
 450 @section SMPI_what What can run within SMPI?
 451
 452 You can run unmodified MPI applications (both C/C++ and Fortran) within
 453 SMPI, provided that you only use MPI calls that we implemented. Global
 454 variables should be handled correctly on Linux systems.
 455
 456 @subsection SMPI_what_coverage MPI coverage of SMPI
 457
 458 Our coverage of the interface is very decent, but still incomplete;
 459 Given the size of the MPI standard, we may well never manage to
 460 implement absolutely all existing primitives. Currently, we have
 461 almost no support for I/O primitives, but we still pass a very large
 462 amount of the MPICH coverage tests.
 463
 464 The full list of not yet implemented functions is documented in the
 465 file @ref include/smpi/smpi.h, between two lines containing the
 466 <tt>FIXME</tt> marker. If you really miss a feature, please get in
 467 touch with us: we can guide you though the SimGrid code to help you
 468 implementing it, and we'd glad to integrate your contribution to the
 469 main project afterward.
 470
 471 @subsection SMPI_what_globals Privatization of global variables
 472
 473 Concerning the globals, the problem comes from the fact that usually,
 474 MPI processes run as real UNIX processes while they are all folded
 475 into threads of a unique system process in SMPI. Global variables are
 476 usually private to each MPI process while they become shared between
 477 the processes in SMPI.  The problem and some potential solutions are
 478 discussed in this article: "Automatic Handling of Global Variables for
 479 Multi-threaded MPI Programs", available at
 480 http://charm.cs.illinois.edu/newPapers/11-23/paper.pdf (note that this
 481 article does not deal with SMPI but with a competing solution called
 482 AMPI that suffers of the same issue).  This point used to be
 483 problematic in SimGrid, but the problem should now be handled
 484 automatically on Linux.
 485
 486 Older versions of SimGrid came with a script that automatically
 487 privatized the globals through static analysis of the source code. But
 488 our implementation was not robust enough to be used in production, so
 489 it was removed at some point. Currently, SMPI comes with two
 490 privatization mechanisms that you can @ref options_smpi_privatization
 491 "select at runtime". At the time of writing (v3.18), the dlopen
 492 approach is considered to be very fast (it's used by default) while
 493 the mmap approach is considered to be rather slow but very robust.
 494
 495 With the <b>mmap approach</b>, SMPI duplicates and dynamically switch
 496 the \c .data and \c .bss segments of the ELF process when switching
 497 the MPI ranks. This allows each ranks to have its own copy of the
 498 global variables.  No copy actually occures as this mechanism uses \c
 499 mmap for efficiency. This mechanism is considered to be very robust on
 500 all systems supporting \c mmap (Linux and most BSDs). Its performance
 501 is questionable since each context switch between MPI ranks induces
 502 several syscalls to change the \c mmap that redirects the \c .data and
 503 \c .bss segments to the copies of the new rank. The code will also be
 504 copied several times in memory, inducing a slight increase of memory
 505 occupation.
 506
 507 Another limitation is that SMPI only accounts for global variables
 508 defined in the executable. If the processes use external global
 509 variables from dynamic libraries, they won't be switched
 510 correctly. The easiest way to solve this is to statically link against
 511 the library with these globals. This way, each MPI rank will get its
 512 own copy of these libraries. Of course you should never statically
 513 link against the SimGrid library itself.
 514
 515 With the <b>dlopen approach</b>, SMPI loads several copies of the same
 516 executable in memory as if it were a library, so that the global
 517 variables get naturally duplicated. It first requires the executable
 518 to be compiled as a relocatable binary, which is less common for
 519 programs than for libraries. But most distributions are now compiled
 520 this way for security reason as it allows to randomize the address
 521 space layout. It should thus be safe to compile most (any?) program
 522 this way.  The second trick is that the dynamic linker refuses to link
 523 the exact same file several times, be it a library or a relocatable
 524 executable. It makes perfectly sense in the general case, but we need
 525 to circumvent this rule of thumb in our case. To that extend, the
 526 binary is copied in a temporary file before being re-linked against.
 527 `dlmopen()` cannot be used as it only allows 256 contextes, and as it
 528 would also dupplicate simgrid itself.
 529
 530 This approach greatly speeds up the context switching, down to about
 531 40 CPU cycles with our raw contextes, instead of requesting several
 532 syscalls with the \c mmap approach. Another advantage is that it
 533 permits to run the SMPI contexts in parallel, which is obviously not
 534 possible with the \c mmap approach. It was tricky to implement, but we
 535 are not aware of any flaws, so smpirun activates it by default.
 536
 537 In the future, it may be possible to further reduce the memory and
 538 disk consumption. It seems that we could <a
 539 href="https://lwn.net/Articles/415889/">punch holes</a> in the files
 540 before dl-loading them to remove the code and constants, and mmap
 541 these area onto a unique copy. If done correctly, this would reduce
 542 the disk- and memory- usage to the bare minimum, and would also reduce
 543 the pressure on the CPU instruction cache. See
 544 <a href="https://github.com/simgrid/simgrid/issues/137">the relevant
 545 bug</a> on github for implementation leads.\n
 546
 547 Also, currently, only the binary is copied and dlopen-ed for each MPI
 548 rank. We could probably extend this to external dependencies, but for
 549 now, any external dependencies must be statically linked into your
 550 application. As usual, simgrid itself shall never be statically linked
 551 in your app. You don't want to give a copy of SimGrid to each MPI rank:
 552 that's ways too much for them to deal with.
 553
 554 @section SMPI_adapting Adapting your MPI code for further scalability
 555
 556 As detailed in the reference article (available at
 557 http://hal.inria.fr/hal-01415484), you may want to adapt your code
 558 to improve the simulation performance. But these tricks may seriously
 559 hinder the result quality (or even prevent the app to run) if used
 560 wrongly. We assume that if you want to simulate an HPC application,
 561 you know what you are doing. Don't prove us wrong!
 562
 563 @subsection SMPI_adapting_size Reducing your memory footprint
 564
 565 If you get short on memory (the whole app is executed on a single node when
 566 simulated), you should have a look at the SMPI_SHARED_MALLOC and
 567 SMPI_SHARED_FREE macros. It allows to share memory areas between processes: The
 568 purpose of these macro is that the same line malloc on each process will point
 569 to the exact same memory area. So if you have a malloc of 2M and you have 16
 570 processes, this macro will change your memory consumption from 2M*16 to 2M
 571 only. Only one block for all processes.
 572
 573 If your program is ok with a block containing garbage value because all
 574 processes write and read to the same place without any kind of coordination,
 575 then this macro can dramatically shrink your memory consumption. For example,
 576 that will be very beneficial to a matrix multiplication code, as all blocks will
 577 be stored on the same area. Of course, the resulting computations will useless,
 578 but you can still study the application behavior this way.
 579
 580 Naturally, this won't work if your code is data-dependent. For example, a Jacobi
 581 iterative computation depends on the result computed by the code to detect
 582 convergence conditions, so turning them into garbage by sharing the same memory
 583 area between processes does not seem very wise. You cannot use the
 584 SMPI_SHARED_MALLOC macro in this case, sorry.
 585
 586 This feature is demoed by the example file
 587 <tt>examples/smpi/NAS/dt.c</tt>
 588
 589 @subsection SMPI_adapting_speed Toward faster simulations
 590
 591 If your application is too slow, try using SMPI_SAMPLE_LOCAL,
 592 SMPI_SAMPLE_GLOBAL and friends to indicate which computation loops can
 593 be sampled. Some of the loop iterations will be executed to measure
 594 their duration, and this duration will be used for the subsequent
 595 iterations. These samples are done per processor with
 596 SMPI_SAMPLE_LOCAL, and shared between all processors with
 597 SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution
 598 time of your loop iteration are not stable.
 599
 600 This feature is demoed by the example file
 601 <tt>examples/smpi/NAS/ep.c</tt>
 602
 603 @section SMPI_accuracy Ensuring accurate simulations
 604
 605 Out of the box, SimGrid may give you fairly accurate results, but
 606 there is a plenty of factors that could go wrong and make your results
 607 inaccurate or even plainly wrong. Actually, you can only get accurate
 608 results of a nicely built model, including both the system hardware
 609 and your application. Such models are hard to pass over and reuse in
 610 other settings, because elements that are not relevant to an
 611 application (say, the latency of point-to-point communications,
 612 collective operation implementation details or CPU-network
 613 interaction) may be irrelevant to another application. The dream of
 614 the perfect model, encompassing every aspects is only a chimera, as
 615 the only perfect model of the reality is the reality. If you go for
 616 simulation, then you have to ignore some irrelevant aspects of the
 617 reality, but which aspects are irrelevant is actually
 618 application-dependent...
 619
 620 The only way to assess whether your settings provide accurate results
 621 is to double-check these results. If possible, you should first run
 622 the same experiment in simulation and in real life, gathering as much
 623 information as you can. Try to understand the discrepancies in the
 624 results that you observe between both settings (visualization can be
 625 precious for that). Then, try to modify your model (of the platform,
 626 of the collective operations) to reduce the most preeminent differences.
 627
 628 If the discrepancies come from the computing time, try adapting the \c
 629 smpi/host-speed: reduce it if your simulation runs faster than in
 630 reality. If the error come from the communication, then you need to
 631 fiddle with your platform file.
 632
 633 Be inventive in your modeling. Don't be afraid if the names given by
 634 SimGrid does not match the real names: we got very good results by
 635 modeling multicore/GPU machines with a set of separate hosts
 636 interconnected with very fast networks (but don't trust your model
 637 because it has the right names in the right place either).
 638
 639 Finally, you may want to check [this
 640 article](https://hal.inria.fr/hal-00907887) on the classical pitfalls
 641 in modeling distributed systems.
 642
 643 @section SMPI_troubleshooting Troubleshooting with SMPI
 644
 645 @subsection SMPI_trouble_configure_refuses_smpicc ./configure refuses to use smpicc
 646
 647 If your <tt>./configure</tt> reports that the compiler is not
 648 functional or that you are cross-compiling, try to define the
 649 <tt>SMPI_PRETEND_CC</tt> environment variable before running the
 650 configuration.
 651
 652 @verbatim
 653 SMPI_PRETEND_CC=1 ./configure # here come the configure parameters
 654 make
 655 @endverbatim
 656
 657 Indeed, the programs compiled with <tt>smpicc</tt> cannot be executed
 658 without <tt>smpirun</tt> (they are shared libraries, and they do weird
 659 things on startup), while configure wants to test them directly.
 660 With <tt>SMPI_PRETEND_CC</tt> smpicc does not compile as shared,
 661 and the SMPI initialization stops and returns 0 before doing anything
 662 that would fail without <tt>smpirun</tt>.
 663
 664 \warning
 665
 666   Make sure that SMPI_PRETEND_CC is only set when calling ./configure,
 667   not during the actual execution, or any program compiled with smpicc
 668   will stop before starting.
 669
 670 @subsection SMPI_trouble_configure_dont_find_smpicc ./configure does not pick smpicc as a compiler
 671
 672 In addition to the previous answers, some projects also need to be
 673 explicitely told what compiler to use, as follows:
 674
 675 @verbatim
 676 SMPI_PRETEND_CC=1 ./configure CC=smpicc # here come the other configure parameters
 677 make
 678 @endverbatim
 679
 680 Maybe your configure is using another variable, such as <tt>cc</tt> or
 681 similar. Just check the logs.
 682
 683 @subsection SMPI_trouble_useconds_t  error: unknown type name 'useconds_t'
 684
 685 Try to add <tt>-D_GNU_SOURCE</tt> to your compilation line to get ride
 686 of that error.
 687
 688 The reason is that SMPI provides its own version of <tt>usleep(3)</tt>
 689 to override it and to block in the simulation world, not in the real
 690 one. It needs the <tt>useconds_t</tt> type for that, which is declared
 691 only if you declare <tt>_GNU_SOURCE</tt> before including
 692 <tt>unistd.h</tt>. If your project includes that header file before
 693 SMPI, then you need to ensure that you pass the right configuration
 694 defines as advised above.
 695
 696
 697 */
 698
 699
 700 /** @example include/smpi/smpi.h */