docs/source/app_smpi.rst

   1 .. _SMPI_doc:
   2
   3 ===============================
   4 SMPI: Simulate MPI Applications
   5 ===============================
   6
   7 .. raw:: html
   8
   9    <object id="TOC" data="graphical-toc.svg" width="100%" type="image/svg+xml"></object>
  10    <script>
  11    window.onload=function() { // Wait for the SVG to be loaded before changing it
  12      var elem=document.querySelector("#TOC").contentDocument.getElementById("SMPIBox")
  13      elem.style="opacity:0.93999999;fill:#ff0000;fill-opacity:0.1";
  14    }
  15    </script>
  16    <br/>
  17    <br/>
  18
  19 SMPI enables the study of MPI application by emulating them on top of
  20 the SimGrid simulator. This is particularly interesting to study
  21 existing MPI applications within the comfort of the simulator.
  22
  23 To get started with SMPI, you should head to `the SMPI tutorial
  24 <usecase_smpi>`_. You may also want to read the `SMPI reference
  25 article <https://hal.inria.fr/hal-01415484>`_ or these `introductory
  26 slides <http://simgrid.org/tutorials/simgrid-smpi-101.pdf>`_.  If you
  27 are new to MPI, you should first take our online `SMPI CourseWare
  28 <https://simgrid.github.io/SMPI_CourseWare/>`_. It consists in several
  29 projects that progressively introduce the MPI concepts. It proposes to
  30 use SimGrid and SMPI to run the experiments, but the learning
  31 objectives are centered on MPI itself.
  32
  33 Our goal is to enable the study of **unmodified MPI applications**.
  34 Some constructs and features are still missing, but we can probably
  35 add them on demand.  If you already used MPI before, SMPI should sound
  36 very familiar to you: Use smpicc instead of mpicc, and smpirun instead
  37 of mpirun. The main difference is that smpirun takes a :ref:`simulated
  38 platform <platform>` as an extra parameter.
  39
  40 For **further scalability**, you may modify your code to speed up your
  41 studies or save memory space.  Maximal **simulation accuracy**
  42 requires some specific care from you.
  43
  44 ----------
  45 Using SMPI
  46 ----------
  47
  48 ...................
  49 Compiling your Code
  50 ...................
  51
  52 If your application is in C, then simply use ``smpicc`` as a
  53 compiler just like you use mpicc with other MPI implementations. This
  54 script still calls your default compiler (gcc, clang, ...) and adds
  55 the right compilation flags along the way. If your application is in
  56 C++, Fortran 77 or Fortran 90, use respectively ``smpicxx``,
  57 ``smpiff`` or ``smpif90``.
  58
  59 ....................
  60 Simulating your Code
  61 ....................
  62
  63 Use the ``smpirun`` script as follows:
  64
  65 .. code-block:: shell
  66
  67    smpirun -hostfile my_hostfile.txt -platform my_platform.xml ./program -blah
  68
  69 - ``my_hostfile.txt`` is a classical MPI hostfile (that is, this file
  70   lists the machines on which the processes must be dispatched, one
  71   per line)
  72 - ``my_platform.xml`` is a classical SimGrid platform file. Of course,
  73   the hosts of the hostfile must exist in the provided platform.
  74 - ``./program`` is the MPI program to simulate, that you compiled with ``smpicc``
  75 - ``-blah`` is a command-line parameter passed to this program.
  76
  77 ``smpirun`` accepts other parameters, such as ``-np`` if you don't
  78 want to use all the hosts defined in the hostfile, ``-map`` to display
  79 on which host each rank gets mapped of ``-trace`` to activate the
  80 tracing during the simulation. You can get the full list by running
  81 ``smpirun -help``
  82
  83 ...............................
  84 Debugging your Code within SMPI
  85 ...............................
  86
  87 If you want to explore the automatic platform and deployment files
  88 that are generated by ``smpirun``, add ``-keep-temps`` to the command
  89 line.
  90
  91 You can also run your simulation within valgrind or gdb using the
  92 following commands. Once in GDB, each MPI ranks will be represented as
  93 a regular thread, and you can explore the state of each of them as
  94 usual.
  95
  96 .. code-block:: shell
  97
  98    smpirun -wrapper valgrind ...other args...
  99    smpirun -wrapper "gdb --args" --cfg=contexts/factory:thread ...other args...
 100
 101 ................................
 102 Simulating Collective Operations
 103 ................................
 104
 105 MPI collective operations are crucial to the performance of MPI
 106 applications and must be carefully optimized according to many
 107 parameters. Every existing implementation provides several algorithms
 108 for each collective operation, and selects by default the best suited
 109 one, depending on the sizes sent, the number of nodes, the
 110 communicator, or the communication library being used.  These
 111 decisions are based on empirical results and theoretical complexity
 112 estimation, and are very different between MPI implementations. In
 113 most cases, the users can also manually tune the algorithm used for
 114 each collective operation.
 115
 116 SMPI can simulate the behavior of several MPI implementations:
 117 OpenMPI, MPICH, `STAR-MPI <http://star-mpi.sourceforge.net/>`_, and
 118 MVAPICH2. For that, it provides 115 collective algorithms and several
 119 selector algorithms, that were collected directly in the source code
 120 of the targeted MPI implementations.
 121
 122 You can switch the automatic selector through the
 123 ``smpi/coll-selector`` configuration item. Possible values:
 124
 125  - **ompi:** default selection logic of OpenMPI (version 3.1.2)
 126  - **mpich**: default selection logic of MPICH (version 3.3b)
 127  - **mvapich2**: selection logic of MVAPICH2 (version 1.9) tuned
 128    on the Stampede cluster
 129  - **impi**: preliminary version of an Intel MPI selector (version
 130    4.1.3, also tuned for the Stampede cluster). Due the closed source
 131    nature of Intel MPI, some of the algorithms described in the
 132    documentation are not available, and are replaced by mvapich ones.
 133  - **default**: legacy algorithms used in the earlier days of
 134    SimGrid. Do not use for serious perform performance studies.
 135
 136 .. todo:: default should not even exist.
 137
 138 ....................
 139 Available Algorithms
 140 ....................
 141
 142 You can also pick the algorithm used for each collective with the
 143 corresponding configuration item. For example, to use the pairwise
 144 alltoall algorithm, one should add ``--cfg=smpi/alltoall:pair`` to the
 145 line. This will override the selector (if any) for this algorithm.  It
 146 means that the selected algorithm will be used
 147
 148 .. Warning:: Some collective may require specific conditions to be
 149    executed correctly (for instance having a communicator with a power
 150    of two number of nodes only), which are currently not enforced by
 151    Simgrid. Some crashes can be expected while trying these algorithms
 152    with unusual sizes/parameters
 153
 154 MPI_Alltoall
 155 ^^^^^^^^^^^^
 156
 157 Most of these are best described in `STAR-MPI <http://www.cs.arizona.edu/~dkl/research/papers/ics06.pdf>`_.
 158
 159  - default: naive one, by default
 160  - ompi: use openmpi selector for the alltoall operations
 161  - mpich: use mpich selector for the alltoall operations
 162  - mvapich2: use mvapich2 selector for the alltoall operations
 163  - impi: use intel mpi selector for the alltoall operations
 164  - automatic (experimental): use an automatic self-benchmarking algorithm
 165  - bruck: Described by Bruck et.al. in <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949">this paper</a>
 166  - 2dmesh: organizes the nodes as a two dimensional mesh, and perform allgather
 167    along the dimensions
 168  - 3dmesh: adds a third dimension to the previous algorithm
 169  - rdb: recursive doubling: extends the mesh to a nth dimension, each one
 170    containing two nodes
 171  - pair: pairwise exchange, only works for power of 2 procs, size-1 steps,
 172    each process sends and receives from the same process at each step
 173  - pair_light_barrier: same, with small barriers between steps to avoid
 174    contention
 175  - pair_mpi_barrier: same, with MPI_Barrier used
 176  - pair_one_barrier: only one barrier at the beginning
 177  - ring: size-1 steps, at each step a process send to process (n+i)%size, and receives from (n-i)%size
 178  - ring_light_barrier: same, with small barriers between some phases to avoid contention
 179  - ring_mpi_barrier: same, with MPI_Barrier used
 180  - ring_one_barrier: only one barrier at the beginning
 181  - basic_linear: posts all receives and all sends,
 182    starts the communications, and waits for all communication to finish
 183  - mvapich2_scatter_dest: isend/irecv with scattered destinations, posting only a few messages at the same time
 184
 185 MPI_Alltoallv
 186 ^^^^^^^^^^^^^
 187  - default: naive one, by default
 188  - ompi: use openmpi selector for the alltoallv operations
 189  - mpich: use mpich selector for the alltoallv operations
 190  - mvapich2: use mvapich2 selector for the alltoallv operations
 191  - impi: use intel mpi selector for the alltoallv operations
 192  - automatic (experimental): use an automatic self-benchmarking algorithm
 193  - bruck: same as alltoall
 194  - pair: same as alltoall
 195  - pair_light_barrier: same as alltoall
 196  - pair_mpi_barrier: same as alltoall
 197  - pair_one_barrier: same as alltoall
 198  - ring: same as alltoall
 199  - ring_light_barrier: same as alltoall
 200  - ring_mpi_barrier: same as alltoall
 201  - ring_one_barrier: same as alltoall
 202  - ompi_basic_linear: same as alltoall
 203
 204 MPI_Gather
 205 ^^^^^^^^^^
 206
 207  - default: naive one, by default
 208  - ompi: use openmpi selector for the gather operations
 209  - mpich: use mpich selector for the gather operations
 210  - mvapich2: use mvapich2 selector for the gather operations
 211  - impi: use intel mpi selector for the gather operations
 212  - automatic (experimental): use an automatic self-benchmarking algorithm which will iterate over all implemented versions and output the best
 213  - ompi_basic_linear: basic linear algorithm from openmpi, each process sends to the root
 214  - ompi_binomial: binomial tree algorithm
 215  - ompi_linear_sync: same as basic linear, but with a synchronization at the
 216    beginning and message cut into two segments.
 217  - mvapich2_two_level: SMP-aware version from MVAPICH. Gather first intra-node (defaults to mpich's gather), and then exchange with only one process/node. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 218
 219 MPI_Barrier
 220 ^^^^^^^^^^^
 221
 222  - default: naive one, by default
 223  - ompi: use openmpi selector for the barrier operations
 224  - mpich: use mpich selector for the barrier operations
 225  - mvapich2: use mvapich2 selector for the barrier operations
 226  - impi: use intel mpi selector for the barrier operations
 227  - automatic (experimental): use an automatic self-benchmarking algorithm
 228  - ompi_basic_linear: all processes send to root
 229  - ompi_two_procs: special case for two processes
 230  - ompi_bruck: nsteps = sqrt(size), at each step, exchange data with rank-2^k and rank+2^k
 231  - ompi_recursivedoubling: recursive doubling algorithm
 232  - ompi_tree: recursive doubling type algorithm, with tree structure
 233  - ompi_doublering: double ring algorithm
 234  - mvapich2_pair: pairwise algorithm
 235  - mpich_smp: barrier intra-node, then inter-node
 236
 237 MPI_Scatter
 238 ^^^^^^^^^^^
 239
 240  - default: naive one, by default
 241  - ompi: use openmpi selector for the scatter operations
 242  - mpich: use mpich selector for the scatter operations
 243  - mvapich2: use mvapich2 selector for the scatter operations
 244  - impi: use intel mpi selector for the scatter operations
 245  - automatic (experimental): use an automatic self-benchmarking algorithm
 246  - ompi_basic_linear: basic linear scatter
 247  - ompi_binomial: binomial tree scatter
 248  - mvapich2_two_level_direct: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a basic linear inter node stage. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 249  - mvapich2_two_level_binomial: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a binomial phase. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 250
 251 MPI_Reduce
 252 ^^^^^^^^^^
 253
 254  - default: naive one, by default
 255  - ompi: use openmpi selector for the reduce operations
 256  - mpich: use mpich selector for the reduce operations
 257  - mvapich2: use mvapich2 selector for the reduce operations
 258  - impi: use intel mpi selector for the reduce operations
 259  - automatic (experimental): use an automatic self-benchmarking algorithm
 260  - arrival_pattern_aware: root exchanges with the first process to arrive
 261  - binomial: uses a binomial tree
 262  - flat_tree: uses a flat tree
 263  - NTSL: Non-topology-specific pipelined linear-bcast function
 264    0->1, 1->2 ,2->3, ....., ->last node: in a pipeline fashion, with segments
 265    of 8192 bytes
 266  - scatter_gather: scatter then gather
 267  - ompi_chain: openmpi reduce algorithms are built on the same basis, but the
 268    topology is generated differently for each flavor
 269    chain = chain with spacing of size/2, and segment size of 64KB
 270  - ompi_pipeline: same with pipeline (chain with spacing of 1), segment size
 271    depends on the communicator size and the message size
 272  - ompi_binary: same with binary tree, segment size of 32KB
 273  - ompi_in_order_binary: same with binary tree, enforcing order on the
 274    operations
 275  - ompi_binomial: same with binomial algo (redundant with default binomial
 276    one in most cases)
 277  - ompi_basic_linear: basic algorithm, each process sends to root
 278  - mvapich2_knomial: k-nomial algorithm. Default factor is 4 (mvapich2 selector adapts it through tuning)
 279  - mvapich2_two_level: SMP-aware reduce, with default set to mpich both for intra and inter communicators. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
 280  - rab: `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_'s reduce algorithm
 281
 282 MPI_Allreduce
 283 ^^^^^^^^^^^^^
 284
 285  - default: naive one, by default
 286  - ompi: use openmpi selector for the allreduce operations
 287  - mpich: use mpich selector for the allreduce operations
 288  - mvapich2: use mvapich2 selector for the allreduce operations
 289  - impi: use intel mpi selector for the allreduce operations
 290  - automatic (experimental): use an automatic self-benchmarking algorithm
 291  - lr: logical ring reduce-scatter then logical ring allgather
 292  - rab1: variations of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: reduce_scatter then allgather
 293  - rab2: variations of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: alltoall then allgather
 294  - rab_rsag: variation of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: recursive doubling
 295    reduce_scatter then recursive doubling allgather
 296  - rdb: recursive doubling
 297  - smp_binomial: binomial tree with smp: binomial intra
 298    SMP reduce, inter reduce, inter broadcast then intra broadcast
 299  - smp_binomial_pipeline: same with segment size = 4096 bytes
 300  - smp_rdb: intra: binomial allreduce, inter: Recursive
 301    doubling allreduce, intra: binomial broadcast
 302  - smp_rsag: intra: binomial allreduce, inter: reduce-scatter,
 303    inter:allgather, intra: binomial broadcast
 304  - smp_rsag_lr: intra: binomial allreduce, inter: logical ring
 305    reduce-scatter, logical ring inter:allgather, intra: binomial broadcast
 306  - smp_rsag_rab: intra: binomial allreduce, inter: rab
 307    reduce-scatter, rab inter:allgather, intra: binomial broadcast
 308  - redbcast: reduce then broadcast, using default or tuned algorithms if specified
 309  - ompi_ring_segmented: ring algorithm used by OpenMPI
 310  - mvapich2_rs: rdb for small messages, reduce-scatter then allgather else
 311  - mvapich2_two_level: SMP-aware algorithm, with mpich as intra algoritm, and rdb as inter (Change this behavior by using mvapich2 selector to use tuned values)
 312  - rab: default `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_ implementation
 313
 314 MPI_Reduce_scatter
 315 ^^^^^^^^^^^^^^^^^^
 316
 317  - default: naive one, by default
 318  - ompi: use openmpi selector for the reduce_scatter operations
 319  - mpich: use mpich selector for the reduce_scatter operations
 320  - mvapich2: use mvapich2 selector for the reduce_scatter operations
 321  - impi: use intel mpi selector for the reduce_scatter operations
 322  - automatic (experimental): use an automatic self-benchmarking algorithm
 323  - ompi_basic_recursivehalving: recursive halving version from OpenMPI
 324  - ompi_ring: ring version from OpenMPI
 325  - mpich_pair: pairwise exchange version from MPICH
 326  - mpich_rdb: recursive doubling version from MPICH
 327  - mpich_noncomm: only works for power of 2 procs, recursive doubling for noncommutative ops
 328
 329
 330 MPI_Allgather
 331 ^^^^^^^^^^^^^
 332
 333  - default: naive one, by default
 334  - ompi: use openmpi selector for the allgather operations
 335  - mpich: use mpich selector for the allgather operations
 336  - mvapich2: use mvapich2 selector for the allgather operations
 337  - impi: use intel mpi selector for the allgather operations
 338  - automatic (experimental): use an automatic self-benchmarking algorithm
 339  - 2dmesh: see alltoall
 340  - 3dmesh: see alltoall
 341  - bruck: Described by Bruck et.al. in <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949">
 342    Efficient algorithms for all-to-all communications in multiport message-passing systems</a>
 343  - GB: Gather - Broadcast (uses tuned version if specified)
 344  - loosely_lr: Logical Ring with grouping by core (hardcoded, default
 345    processes/node: 4)
 346  - NTSLR: Non Topology Specific Logical Ring
 347  - NTSLR_NB: Non Topology Specific Logical Ring, Non Blocking operations
 348  - pair: see alltoall
 349  - rdb: see alltoall
 350  - rhv: only power of 2 number of processes
 351  - ring: see alltoall
 352  - SMP_NTS: gather to root of each SMP, then every root of each SMP node
 353    post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
 354    using logical ring algorithm (hardcoded, default processes/SMP: 8)
 355  - smp_simple: gather to root of each SMP, then every root of each SMP node
 356    post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
 357    using simple algorithm (hardcoded, default processes/SMP: 8)
 358  - spreading_simple: from node i, order of communications is i -> i + 1, i ->
 359    i + 2, ..., i -> (i + p -1) % P
 360  - ompi_neighborexchange: Neighbor Exchange algorithm for allgather.
 361    Described by Chen et.al. in  `Performance Evaluation of Allgather
 362    Algorithms on Terascale Linux Cluster with Fast Ethernet <http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1592302>`_
 363  - mvapich2_smp: SMP aware algorithm, performing intra-node gather, inter-node allgather with one process/node, and bcast intra-node
 364
 365 MPI_Allgatherv
 366 ^^^^^^^^^^^^^^
 367
 368  - default: naive one, by default
 369  - ompi: use openmpi selector for the allgatherv operations
 370  - mpich: use mpich selector for the allgatherv operations
 371  - mvapich2: use mvapich2 selector for the allgatherv operations
 372  - impi: use intel mpi selector for the allgatherv operations
 373  - automatic (experimental): use an automatic self-benchmarking algorithm
 374  - GB: Gatherv - Broadcast (uses tuned version if specified, but only for Bcast, gatherv is not tuned)
 375  - pair: see alltoall
 376  - ring: see alltoall
 377  - ompi_neighborexchange: see allgather
 378  - ompi_bruck: see allgather
 379  - mpich_rdb: recursive doubling algorithm from MPICH
 380  - mpich_ring: ring algorithm from MPICh - performs differently from the  one from STAR-MPI
 381
 382 MPI_Bcast
 383 ^^^^^^^^^
 384
 385  - default: naive one, by default
 386  - ompi: use openmpi selector for the bcast operations
 387  - mpich: use mpich selector for the bcast operations
 388  - mvapich2: use mvapich2 selector for the bcast operations
 389  - impi: use intel mpi selector for the bcast operations
 390  - automatic (experimental): use an automatic self-benchmarking algorithm
 391  - arrival_pattern_aware: root exchanges with the first process to arrive
 392  - arrival_pattern_aware_wait: same with slight variation
 393  - binomial_tree: binomial tree exchange
 394  - flattree: flat tree exchange
 395  - flattree_pipeline: flat tree exchange, message split into 8192 bytes pieces
 396  - NTSB: Non-topology-specific pipelined binary tree with 8192 bytes pieces
 397  - NTSL: Non-topology-specific pipelined linear with 8192 bytes pieces
 398  - NTSL_Isend: Non-topology-specific pipelined linear with 8192 bytes pieces, asynchronous communications
 399  - scatter_LR_allgather: scatter followed by logical ring allgather
 400  - scatter_rdb_allgather: scatter followed by recursive doubling allgather
 401  - arrival_scatter: arrival pattern aware scatter-allgather
 402  - SMP_binary: binary tree algorithm with 8 cores/SMP
 403  - SMP_binomial: binomial tree algorithm with 8 cores/SMP
 404  - SMP_linear: linear algorithm with 8 cores/SMP
 405  - ompi_split_bintree: binary tree algorithm from OpenMPI, with message split in 8192 bytes pieces
 406  - ompi_pipeline: pipeline algorithm from OpenMPI, with message split in 128KB pieces
 407  - mvapich2_inter_node: Inter node default mvapich worker
 408  - mvapich2_intra_node: Intra node default mvapich worker
 409  - mvapich2_knomial_intra_node:  k-nomial intra node default mvapich worker. default factor is 4.
 410
 411 Automatic Evaluation
 412 ^^^^^^^^^^^^^^^^^^^^
 413
 414 .. warning:: This is still very experimental.
 415
 416 An automatic version is available for each collective (or even as a selector). This specific
 417 version will loop over all other implemented algorithm for this particular collective, and apply
 418 them while benchmarking the time taken for each process. It will then output the quickest for
 419 each process, and the global quickest. This is still unstable, and a few algorithms which need
 420 specific number of nodes may crash.
 421
 422 Adding an algorithm
 423 ^^^^^^^^^^^^^^^^^^^
 424
 425 To add a new algorithm, one should check in the src/smpi/colls folder
 426 how other algorithms are coded. Using plain MPI code inside Simgrid
 427 can't be done, so algorithms have to be changed to use smpi version of
 428 the calls instead (MPI_Send will become smpi_mpi_send). Some functions
 429 may have different signatures than their MPI counterpart, please check
 430 the other algorithms or contact us using the `>SimGrid
 431 developers mailing list <http://lists.gforge.inria.fr/mailman/listinfo/simgrid-devel>`_.
 432
 433 Example: adding a "pair" version of the Alltoall collective.
 434
 435  - Implement it in a file called alltoall-pair.c in the src/smpi/colls folder. This file should include colls_private.hpp.
 436
 437  - The name of the new algorithm function should be smpi_coll_tuned_alltoall_pair, with the same signature as MPI_Alltoall.
 438
 439  - Once the adaptation to SMPI code is done, add a reference to the file ("src/smpi/colls/alltoall-pair.c") in the SMPI_SRC part of the DefinePackages.cmake file inside buildtools/cmake, to allow the file to be built and distributed.
 440
 441  - To register the new version of the algorithm, simply add a line to the corresponding macro in src/smpi/colls/cools.h ( add a "COLL_APPLY(action, COLL_ALLTOALL_SIG, pair)" to the COLL_ALLTOALLS macro ). The algorithm should now be compiled and be selected when using --cfg=smpi/alltoall:pair at runtime.
 442
 443  - To add a test for the algorithm inside Simgrid's test suite, juste add the new algorithm name in the ALLTOALL_COLL list found inside teshsuite/smpi/CMakeLists.txt . When running ctest, a test for the new algorithm should be generated and executed. If it does not pass, please check your code or contact us.
 444
 445  - Please submit your patch for inclusion in SMPI, for example through a pull request on GitHub or directly per email.
 446
 447
 448 Tracing of Internal Communications
 449 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 450
 451 By default, the collective operations are traced as a unique operation
 452 because tracing all point-to-point communications composing them could
 453 result in overloaded, hard to interpret traces. If you want to debug
 454 and compare collective algorithms, you should set the
 455 ``tracing/smpi/internals`` configuration item to 1 instead of 0.
 456
 457 Here are examples of two alltoall collective algorithms runs on 16 nodes,
 458 the first one with a ring algorithm, the second with a pairwise one.
 459
 460 .. image:: /img/smpi_simgrid_alltoall_ring_16.png
 461    :align: center
 462
 463 Alltoall on 16 Nodes with the Ring Algorithm.
 464
 465 .. image:: /img/smpi_simgrid_alltoall_pair_16.png
 466    :align: center
 467
 468 Alltoall on 16 Nodes with the Pairwise Algorithm.
 469
 470 -------------------------
 471 What can run within SMPI?
 472 -------------------------
 473
 474 You can run unmodified MPI applications (both C/C++ and Fortran) within
 475 SMPI, provided that you only use MPI calls that we implemented. Global
 476 variables should be handled correctly on Linux systems.
 477
 478 ....................
 479 MPI coverage of SMPI
 480 ....................
 481
 482 Our coverage of the interface is very decent, but still incomplete;
 483 Given the size of the MPI standard, we may well never manage to
 484 implement absolutely all existing primitives. Currently, we have
 485 almost no support for I/O primitives, but we still pass a very large
 486 amount of the MPICH coverage tests.
 487
 488 The full list of not yet implemented functions is documented in the
 489 file `include/smpi/smpi.h
 490 <https://framagit.org/simgrid/simgrid/tree/master/include/smpi/smpi.h>`_
 491 in your version of SimGrid, between two lines containing the ``FIXME``
 492 marker. If you really miss a feature, please get in touch with us: we
 493 can guide you though the SimGrid code to help you implementing it, and
 494 we'd be glad to integrate your contribution to the main project.
 495
 496 .................................
 497 Privatization of global variables
 498 .................................
 499
 500 Concerning the globals, the problem comes from the fact that usually,
 501 MPI processes run as real UNIX processes while they are all folded
 502 into threads of a unique system process in SMPI. Global variables are
 503 usually private to each MPI process while they become shared between
 504 the processes in SMPI.  The problem and some potential solutions are
 505 discussed in this article: `Automatic Handling of Global Variables for
 506 Multi-threaded MPI Programs
 507 <http://charm.cs.illinois.edu/newPapers/11-23/paper.pdf>` (note that
 508 this article does not deal with SMPI but with a competing solution
 509 called AMPI that suffers of the same issue).  This point used to be
 510 problematic in SimGrid, but the problem should now be handled
 511 automatically on Linux.
 512
 513 Older versions of SimGrid came with a script that automatically
 514 privatized the globals through static analysis of the source code. But
 515 our implementation was not robust enough to be used in production, so
 516 it was removed at some point. Currently, SMPI comes with two
 517 privatization mechanisms that you can :ref:`select at runtime
 518 <options_smpi_privatization>`_.  The dlopen approach is used by
 519 default as it is much faster and still very robust.  The mmap approach
 520 is an older approach that proves to be slower.
 521
 522 With the **mmap approach**, SMPI duplicates and dynamically switch the
 523 ``.data`` and ``.bss`` segments of the ELF process when switching the
 524 MPI ranks. This allows each ranks to have its own copy of the global
 525 variables.  No copy actually occures as this mechanism uses ``mmap()``
 526 for efficiency. This mechanism is considered to be very robust on all
 527 systems supporting ``mmap()`` (Linux and most BSDs). Its performance
 528 is questionable since each context switch between MPI ranks induces
 529 several syscalls to change the ``mmap`` that redirects the ``.data``
 530 and ``.bss`` segments to the copies of the new rank. The code will
 531 also be copied several times in memory, inducing a slight increase of
 532 memory occupation.
 533
 534 Another limitation is that SMPI only accounts for global variables
 535 defined in the executable. If the processes use external global
 536 variables from dynamic libraries, they won't be switched
 537 correctly. The easiest way to solve this is to statically link against
 538 the library with these globals. This way, each MPI rank will get its
 539 own copy of these libraries. Of course you should never statically
 540 link against the SimGrid library itself.
 541
 542 With the **dlopen approach**, SMPI loads several copies of the same
 543 executable in memory as if it were a library, so that the global
 544 variables get naturally dupplicated. It first requires the executable
 545 to be compiled as a relocatable binary, which is less common for
 546 programs than for libraries. But most distributions are now compiled
 547 this way for security reason as it allows to randomize the address
 548 space layout. It should thus be safe to compile most (any?) program
 549 this way.  The second trick is that the dynamic linker refuses to link
 550 the exact same file several times, be it a library or a relocatable
 551 executable. It makes perfectly sense in the general case, but we need
 552 to circumvent this rule of thumb in our case. To that extend, the
 553 binary is copied in a temporary file before being re-linked against.
 554 ``dlmopen()`` cannot be used as it only allows 256 contextes, and as it
 555 would also dupplicate simgrid itself.
 556
 557 This approach greatly speeds up the context switching, down to about
 558 40 CPU cycles with our raw contextes, instead of requesting several
 559 syscalls with the ``mmap()`` approach. Another advantage is that it
 560 permits to run the SMPI contexts in parallel, which is obviously not
 561 possible with the ``mmap()`` approach. It was tricky to implement, but
 562 we are not aware of any flaws, so smpirun activates it by default.
 563
 564 In the future, it may be possible to further reduce the memory and
 565 disk consumption. It seems that we could `punch holes
 566 <https://lwn.net/Articles/415889/>`_ in the files before dl-loading
 567 them to remove the code and constants, and mmap these area onto a
 568 unique copy. If done correctly, this would reduce the disk- and
 569 memory- usage to the bare minimum, and would also reduce the pressure
 570 on the CPU instruction cache. See the `relevant bug
 571 <https://github.com/simgrid/simgrid/issues/137>`_ on github for
 572 implementation leads.\n
 573
 574 Also, currently, only the binary is copied and dlopen-ed for each MPI
 575 rank. We could probably extend this to external dependencies, but for
 576 now, any external dependencies must be statically linked into your
 577 application. As usual, simgrid itself shall never be statically linked
 578 in your app. You don't want to give a copy of SimGrid to each MPI rank:
 579 that's ways too much for them to deal with.
 580
 581 .. todo: speak of smpi/privatize-libs here
 582
 583 ----------------------------------------------
 584 Adapting your MPI code for further scalability
 585 ----------------------------------------------
 586
 587 As detailed in the `reference article
 588 <http://hal.inria.fr/hal-01415484>`_, you may want to adapt your code
 589 to improve the simulation performance. But these tricks may seriously
 590 hinder the result quality (or even prevent the app to run) if used
 591 wrongly. We assume that if you want to simulate an HPC application,
 592 you know what you are doing. Don't prove us wrong!
 593
 594 ..............................
 595 Reducing your memory footprint
 596 ..............................
 597
 598 If you get short on memory (the whole app is executed on a single node when
 599 simulated), you should have a look at the SMPI_SHARED_MALLOC and
 600 SMPI_SHARED_FREE macros. It allows to share memory areas between processes: The
 601 purpose of these macro is that the same line malloc on each process will point
 602 to the exact same memory area. So if you have a malloc of 2M and you have 16
 603 processes, this macro will change your memory consumption from 2M*16 to 2M
 604 only. Only one block for all processes.
 605
 606 If your program is ok with a block containing garbage value because all
 607 processes write and read to the same place without any kind of coordination,
 608 then this macro can dramatically shrink your memory consumption. For example,
 609 that will be very beneficial to a matrix multiplication code, as all blocks will
 610 be stored on the same area. Of course, the resulting computations will useless,
 611 but you can still study the application behavior this way.
 612
 613 Naturally, this won't work if your code is data-dependent. For example, a Jacobi
 614 iterative computation depends on the result computed by the code to detect
 615 convergence conditions, so turning them into garbage by sharing the same memory
 616 area between processes does not seem very wise. You cannot use the
 617 SMPI_SHARED_MALLOC macro in this case, sorry.
 618
 619 This feature is demoed by the example file
 620 `examples/smpi/NAS/dt.c <https://framagit.org/simgrid/simgrid/tree/master/examples/smpi/NAS/dt.c>`_
 621
 622 .........................
 623 Toward Faster Simulations
 624 .........................
 625
 626 If your application is too slow, try using SMPI_SAMPLE_LOCAL,
 627 SMPI_SAMPLE_GLOBAL and friends to indicate which computation loops can
 628 be sampled. Some of the loop iterations will be executed to measure
 629 their duration, and this duration will be used for the subsequent
 630 iterations. These samples are done per processor with
 631 SMPI_SAMPLE_LOCAL, and shared between all processors with
 632 SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution
 633 time of your loop iteration are not stable.
 634
 635 This feature is demoed by the example file
 636 `examples/smpi/NAS/ep.c <https://framagit.org/simgrid/simgrid/tree/master/examples/smpi/NAS/ep.c>`_
 637
 638 .............................
 639 Ensuring Accurate Simulations
 640 .............................
 641
 642 Out of the box, SimGrid may give you fairly accurate results, but
 643 there is a plenty of factors that could go wrong and make your results
 644 inaccurate or even plainly wrong. Actually, you can only get accurate
 645 results of a nicely built model, including both the system hardware
 646 and your application. Such models are hard to pass over and reuse in
 647 other settings, because elements that are not relevant to an
 648 application (say, the latency of point-to-point communications,
 649 collective operation implementation details or CPU-network
 650 interaction) may be irrelevant to another application. The dream of
 651 the perfect model, encompassing every aspects is only a chimera, as
 652 the only perfect model of the reality is the reality. If you go for
 653 simulation, then you have to ignore some irrelevant aspects of the
 654 reality, but which aspects are irrelevant is actually
 655 application-dependent...
 656
 657 The only way to assess whether your settings provide accurate results
 658 is to double-check these results. If possible, you should first run
 659 the same experiment in simulation and in real life, gathering as much
 660 information as you can. Try to understand the discrepancies in the
 661 results that you observe between both settings (visualization can be
 662 precious for that). Then, try to modify your model (of the platform,
 663 of the collective operations) to reduce the most preeminent differences.
 664
 665 If the discrepancies come from the computing time, try adapting the
 666 ``smpi/host-speed``: reduce it if your simulation runs faster than in
 667 reality. If the error come from the communication, then you need to
 668 fiddle with your platform file.
 669
 670 Be inventive in your modeling. Don't be afraid if the names given by
 671 SimGrid does not match the real names: we got very good results by
 672 modeling multicore/GPU machines with a set of separate hosts
 673 interconnected with very fast networks (but don't trust your model
 674 because it has the right names in the right place either).
 675
 676 Finally, you may want to check `this article
 677 <https://hal.inria.fr/hal-00907887>`_ on the classical pitfalls in
 678 modeling distributed systems.
 679
 680 -------------------------
 681 Troubleshooting with SMPI
 682 -------------------------
 683
 684 .................................
 685 ./configure refuses to use smpicc
 686 .................................
 687
 688 If your ``./configure`` reports that the compiler is not
 689 functional or that you are cross-compiling, try to define the
 690 ``SMPI_PRETEND_CC`` environment variable before running the
 691 configuration.
 692
 693 .. code-block:: shell
 694
 695    SMPI_PRETEND_CC=1 ./configure # here come the configure parameters
 696    make
 697
 698 Indeed, the programs compiled with ``smpicc`` cannot be executed
 699 without ``smpirun`` (they are shared libraries and do weird things on
 700 startup), while configure wants to test them directly.  With
 701 ``SMPI_PRETEND_CC`` smpicc does not compile as shared, and the SMPI
 702 initialization stops and returns 0 before doing anything that would
 703 fail without ``smpirun``.
 704
 705 .. warning::
 706
 707   Make sure that SMPI_PRETEND_CC is only set when calling ./configure,
 708   not during the actual execution, or any program compiled with smpicc
 709   will stop before starting.
 710
 711 ..............................................
 712 ./configure does not pick smpicc as a compiler
 713 ..............................................
 714
 715 In addition to the previous answers, some projects also need to be
 716 explicitely told what compiler to use, as follows:
 717
 718 .. code-block:: shell
 719
 720    SMPI_PRETEND_CC=1 ./configure CC=smpicc # here come the other configure parameters
 721    make
 722
 723 Maybe your configure is using another variable, such as ``cc`` (in
 724 lower case) or similar. Just check the logs.
 725
 726 .....................................
 727 error: unknown type name 'useconds_t'
 728 .....................................
 729
 730 Try to add ``-D_GNU_SOURCE`` to your compilation line to get ride
 731 of that error.
 732
 733 The reason is that SMPI provides its own version of ``usleep(3)``
 734 to override it and to block in the simulation world, not in the real
 735 one. It needs the ``useconds_t`` type for that, which is declared
 736 only if you declare ``_GNU_SOURCE`` before including
 737 ``unistd.h``. If your project includes that header file before
 738 SMPI, then you need to ensure that you pass the right configuration
 739 defines as advised above.