python3-dev is another dependency of the Python bindings

[simgrid.git] / docs / source / app_smpi.rst
diff --git a/docs/source/app_smpi.rst b/docs/source/app_smpi.rst

index 037c764..ddea04e 100644 (file)
--- a/docs/source/app_smpi.rst
+++ b/docs/source/app_smpi.rst
@@ -74,9 +74,9 @@ If you use cmake, set the variables ``MPI_C_COMPILER``, ``MPI_CXX_COMPILER`` and
  ``MPI_Fortran_COMPILER`` to the full path of smpicc, smpicxx and smpiff (or
  smpif90), respectively. Example:
  
-.. code-block:: shell
+.. code-block:: console
  
-   cmake -DMPI_C_COMPILER=/opt/simgrid/bin/smpicc -DMPI_CXX_COMPILER=/opt/simgrid/bin/smpicxx -DMPI_Fortran_COMPILER=/opt/simgrid/bin/smpiff .
+   $ cmake -DMPI_C_COMPILER=/opt/simgrid/bin/smpicc -DMPI_CXX_COMPILER=/opt/simgrid/bin/smpicxx -DMPI_Fortran_COMPILER=/opt/simgrid/bin/smpiff .
  
  ....................
  Simulating your Code
@@ -84,13 +84,15 @@ Simulating your Code
  
  Use the ``smpirun`` script as follows:
  
-.. code-block:: shell
+.. code-block:: console
  
-   smpirun -hostfile my_hostfile.txt -platform my_platform.xml ./program -blah
+   $ smpirun -hostfile my_hostfile.txt -platform my_platform.xml ./program -blah
  
  - ``my_hostfile.txt`` is a classical MPI hostfile (that is, this file
    lists the machines on which the processes must be dispatched, one
-  per line)
+  per line). Using the ``hostname:num_procs`` syntax will deploy num_procs
+  MPI processes on the host, sharing available cores (equivalent to listing
+  the same host num_procs times on different lines).
  - ``my_platform.xml`` is a classical SimGrid platform file. Of course,
    the hosts of the hostfile must exist in the provided platform.
  - ``./program`` is the MPI program to simulate, that you compiled with ``smpicc``
@@ -104,7 +106,7 @@ tracing during the simulation. You can get the full list by running
  
  Finally, you can pass :ref:`any valid SimGrid parameter <options>` to your
  program. In particular, you can pass ``--cfg=network/model:ns-3`` to
-switch to use :ref:`model_ns3`. These parameters should be placed after
+switch to use :ref:`models_ns3`. These parameters should be placed after
  the name of your binary on the command line.
  
  ...............................
@@ -120,10 +122,34 @@ following commands. Once in GDB, each MPI ranks will be represented as
  a regular thread, and you can explore the state of each of them as
  usual.
  
-.. code-block:: shell
+.. code-block:: console
+
+   $ smpirun -wrapper valgrind ...other args...
+   $ smpirun -wrapper "gdb --args" --cfg=contexts/factory:thread ...other args...
+
+Some shortcuts are available:
+
+- ``-gdb`` is equivalent to ``-wrapper "gdb --args" -keep-temps``, to run within gdb debugger
+- ``-lldb`` is equivalent to ``-wrapper "lldb --" -keep-temps``, to run within lldb debugger
+- ``-vgdb`` is equivalent to ``-wrapper "valgrind --vgdb=yes --vgdb-error=0" -keep-temps``,
+  to run within valgrind and allow to attach a debugger
+
+To help locate bottlenecks and largest allocations in the simulated application,
+the -analyze flag can be passed to smpirun. It will activate
+:ref:`smpi/display-timing<cfg=smpi/display-timing>` and
+:ref:`smpi/display-allocs<cfg=smpi/display-allocs>` options and provide hints
+at the end of execution.
+
+SMPI will also report MPI handle (Comm, Request, Op, Datatype...) leaks
+at the end of execution. This can help identify memory leaks that can trigger
+crashes and slowdowns.
+By default it only displays the number of leaked items detected.
+Option :ref:`smpi/list-leaks:n<cfg=smpi/list-leaks>` can be used to display the
+n first leaks encountered and their type. To get more information, running smpirun
+with ``-wrapper "valgrind --leak-check=full --track-origins=yes"`` should show
+the exact origin of leaked handles.
+Known issue : MPI_Cancel may trigger internal leaks within SMPI.
  
-   smpirun -wrapper valgrind ...other args...
-   smpirun -wrapper "gdb --args" --cfg=contexts/factory:thread ...other args...
  
  .. _SMPI_use_colls:
  
@@ -151,7 +177,7 @@ of the targeted MPI implementations.
  You can switch the automatic selector through the
  ``smpi/coll-selector`` configuration item. Possible values:
  
- - **ompi:** default selection logic of OpenMPI (version 3.1.2)
+ - **ompi:** default selection logic of OpenMPI (version 4.1.2)
   - **mpich**: default selection logic of MPICH (version 3.3b)
   - **mvapich2**: selection logic of MVAPICH2 (version 1.9) tuned
     on the Stampede cluster
@@ -185,257 +211,232 @@ MPI_Alltoall
  
  Most of these are best described in `STAR-MPI's white paper <https://doi.org/10.1145/1183401.1183431>`_.
  
- - default: naive one, by default
- - ompi: use openmpi selector for the alltoall operations
- - mpich: use mpich selector for the alltoall operations
- - mvapich2: use mvapich2 selector for the alltoall operations
- - impi: use intel mpi selector for the alltoall operations
- - automatic (experimental): use an automatic self-benchmarking algorithm
- - bruck: Described by Bruck et.al. in `this paper <http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949>`_
- - 2dmesh: organizes the nodes as a two dimensional mesh, and perform allgather
-   along the dimensions
- - 3dmesh: adds a third dimension to the previous algorithm
- - rdb: recursive doubling: extends the mesh to a nth dimension, each one
-   containing two nodes
- - pair: pairwise exchange, only works for power of 2 procs, size-1 steps,
-   each process sends and receives from the same process at each step
- - pair_light_barrier: same, with small barriers between steps to avoid
-   contention
- - pair_mpi_barrier: same, with MPI_Barrier used
- - pair_one_barrier: only one barrier at the beginning
- - ring: size-1 steps, at each step a process send to process (n+i)%size, and receives from (n-i)%size
- - ring_light_barrier: same, with small barriers between some phases to avoid contention
- - ring_mpi_barrier: same, with MPI_Barrier used
- - ring_one_barrier: only one barrier at the beginning
- - basic_linear: posts all receives and all sends,
-   starts the communications, and waits for all communication to finish
- - mvapich2_scatter_dest: isend/irecv with scattered destinations, posting only a few messages at the same time
+``default``: naive one, by default. |br|
+``ompi``: use openmpi selector for the alltoall operations. |br|
+``mpich``: use mpich selector for the alltoall operations. |br|
+``mvapich2``: use mvapich2 selector for the alltoall operations. |br|
+``impi``: use intel mpi selector for the alltoall operations. |br|
+``automatic (experimental)``: use an automatic self-benchmarking algorithm. |br|
+``bruck``: Described by Bruck et. al. in `this paper <http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949>`_. |br|
+``2dmesh``: organizes the nodes as a two dimensional mesh, and perform allgather along the dimensions. |br|
+``3dmesh``: adds a third dimension to the previous algorithm. |br|
+``rdb``: recursive doubling``: extends the mesh to a nth dimension, each one containing two nodes. |br|
+``pair``: pairwise exchange, only works for power of 2 procs, size-1 steps, each process sends and receives from the same process at each step. |br|
+``pair_light_barrier``: same, with small barriers between steps to avoid contention. |br|
+``pair_mpi_barrier``: same, with MPI_Barrier used. |br|
+``pair_one_barrier``: only one barrier at the beginning. |br|
+``ring``: size-1 steps, at each step a process send to process (n+i)%size, and receives from (n-i)%size. |br|
+``ring_light_barrier``: same, with small barriers between some phases to avoid contention. |br|
+``ring_mpi_barrier``: same, with MPI_Barrier used. |br|
+``ring_one_barrier``: only one barrier at the beginning. |br|
+``basic_linear``: posts all receives and all sends, starts the communications, and waits for all communication to finish. |br|
+``mvapich2_scatter_dest``: isend/irecv with scattered destinations, posting only a few messages at the same time. |br|
  
  MPI_Alltoallv
  ^^^^^^^^^^^^^
- - default: naive one, by default
- - ompi: use openmpi selector for the alltoallv operations
- - mpich: use mpich selector for the alltoallv operations
- - mvapich2: use mvapich2 selector for the alltoallv operations
- - impi: use intel mpi selector for the alltoallv operations
- - automatic (experimental): use an automatic self-benchmarking algorithm
- - bruck: same as alltoall
- - pair: same as alltoall
- - pair_light_barrier: same as alltoall
- - pair_mpi_barrier: same as alltoall
- - pair_one_barrier: same as alltoall
- - ring: same as alltoall
- - ring_light_barrier: same as alltoall
- - ring_mpi_barrier: same as alltoall
- - ring_one_barrier: same as alltoall
- - ompi_basic_linear: same as alltoall
+
+``default``: naive one, by default. |br|
+``ompi``: use openmpi selector for the alltoallv operations. |br|
+``mpich``: use mpich selector for the alltoallv operations. |br|
+``mvapich2``: use mvapich2 selector for the alltoallv operations. |br|
+``impi``: use intel mpi selector for the alltoallv operations. |br|
+``automatic (experimental)``: use an automatic self-benchmarking algorithm. |br|
+``bruck``: same as alltoall. |br|
+``pair``: same as alltoall. |br|
+``pair_light_barrier``: same as alltoall. |br|
+``pair_mpi_barrier``: same as alltoall. |br|
+``pair_one_barrier``: same as alltoall. |br|
+``ring``: same as alltoall. |br|
+``ring_light_barrier``: same as alltoall. |br|
+``ring_mpi_barrier``: same as alltoall. |br|
+``ring_one_barrier``: same as alltoall. |br|
+``ompi_basic_linear``: same as alltoall. |br|
  
  MPI_Gather
  ^^^^^^^^^^
  
- - default: naive one, by default
- - ompi: use openmpi selector for the gather operations
- - mpich: use mpich selector for the gather operations
- - mvapich2: use mvapich2 selector for the gather operations
- - impi: use intel mpi selector for the gather operations
- - automatic (experimental): use an automatic self-benchmarking algorithm which will iterate over all implemented versions and output the best
- - ompi_basic_linear: basic linear algorithm from openmpi, each process sends to the root
- - ompi_binomial: binomial tree algorithm
- - ompi_linear_sync: same as basic linear, but with a synchronization at the
-   beginning and message cut into two segments.
- - mvapich2_two_level: SMP-aware version from MVAPICH. Gather first intra-node (defaults to mpich's gather), and then exchange with only one process/node. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
+``default``: naive one, by default. |br|
+``ompi``: use openmpi selector for the gather operations. |br|
+``mpich``: use mpich selector for the gather operations. |br|
+``mvapich2``: use mvapich2 selector for the gather operations. |br|
+``impi``: use intel mpi selector for the gather operations. |br|
+``automatic (experimental)``: use an automatic self-benchmarking algorithm which will iterate over all implemented versions and output the best. |br|
+``ompi_basic_linear``: basic linear algorithm from openmpi, each process sends to the root. |br|
+``ompi_binomial``: binomial tree algorithm. |br|
+``ompi_linear_sync``: same as basic linear, but with a synchronization at the beginning and message cut into two segments. |br|
+``mvapich2_two_level``: SMP-aware version from MVAPICH. Gather first intra-node (defaults to mpich's gather), and then exchange with only one process/node. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster. |br|
  
  MPI_Barrier
  ^^^^^^^^^^^
  
- - default: naive one, by default
- - ompi: use openmpi selector for the barrier operations
- - mpich: use mpich selector for the barrier operations
- - mvapich2: use mvapich2 selector for the barrier operations
- - impi: use intel mpi selector for the barrier operations
- - automatic (experimental): use an automatic self-benchmarking algorithm
- - ompi_basic_linear: all processes send to root
- - ompi_two_procs: special case for two processes
- - ompi_bruck: nsteps = sqrt(size), at each step, exchange data with rank-2^k and rank+2^k
- - ompi_recursivedoubling: recursive doubling algorithm
- - ompi_tree: recursive doubling type algorithm, with tree structure
- - ompi_doublering: double ring algorithm
- - mvapich2_pair: pairwise algorithm
- - mpich_smp: barrier intra-node, then inter-node
+``default``: naive one, by default. |br|
+``ompi``: use openmpi selector for the barrier operations. |br|
+``mpich``: use mpich selector for the barrier operations. |br|
+``mvapich2``: use mvapich2 selector for the barrier operations. |br|
+``impi``: use intel mpi selector for the barrier operations. |br|
+``automatic (experimental)``: use an automatic self-benchmarking algorithm. |br|
+``ompi_basic_linear``: all processes send to root. |br|
+``ompi_two_procs``: special case for two processes. |br|
+``ompi_bruck``: nsteps = sqrt(size), at each step, exchange data with rank-2^k and rank+2^k. |br|
+``ompi_recursivedoubling``: recursive doubling algorithm. |br|
+``ompi_tree``: recursive doubling type algorithm, with tree structure. |br|
+``ompi_doublering``: double ring algorithm. |br|
+``mvapich2_pair``: pairwise algorithm. |br|
+``mpich_smp``: barrier intra-node, then inter-node. |br|
  
  MPI_Scatter
  ^^^^^^^^^^^
  
- - default: naive one, by default
- - ompi: use openmpi selector for the scatter operations
- - mpich: use mpich selector for the scatter operations
- - mvapich2: use mvapich2 selector for the scatter operations
- - impi: use intel mpi selector for the scatter operations
- - automatic (experimental): use an automatic self-benchmarking algorithm
- - ompi_basic_linear: basic linear scatter
- - ompi_binomial: binomial tree scatter
- - mvapich2_two_level_direct: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a basic linear inter node stage. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster. 
- - mvapich2_two_level_binomial: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a binomial phase. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
+``default``: naive one, by default. |br|
+``ompi``: use openmpi selector for the scatter operations. |br|
+``mpich``: use mpich selector for the scatter operations. |br|
+``mvapich2``: use mvapich2 selector for the scatter operations. |br|
+``impi``: use intel mpi selector for the scatter operations. |br|
+``automatic (experimental)``: use an automatic self-benchmarking algorithm. |br|
+``ompi_basic_linear``: basic linear scatter. |br|
+``ompi_linear_nb``: linear scatter, non blocking sends. |br|
+``ompi_binomial``: binomial tree scatter. |br|
+``mvapich2_two_level_direct``: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a basic linear inter node stage. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster. |br|
+``mvapich2_two_level_binomial``: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a binomial phase. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster. |br|
  
  MPI_Reduce
  ^^^^^^^^^^
  
- - default: naive one, by default
- - ompi: use openmpi selector for the reduce operations
- - mpich: use mpich selector for the reduce operations
- - mvapich2: use mvapich2 selector for the reduce operations
- - impi: use intel mpi selector for the reduce operations
- - automatic (experimental): use an automatic self-benchmarking algorithm
- - arrival_pattern_aware: root exchanges with the first process to arrive
- - binomial: uses a binomial tree
- - flat_tree: uses a flat tree
- - NTSL: Non-topology-specific pipelined linear-bcast function
-   0->1, 1->2 ,2->3, ....., ->last node: in a pipeline fashion, with segments
-   of 8192 bytes
- - scatter_gather: scatter then gather
- - ompi_chain: openmpi reduce algorithms are built on the same basis, but the
-   topology is generated differently for each flavor
-   chain = chain with spacing of size/2, and segment size of 64KB
- - ompi_pipeline: same with pipeline (chain with spacing of 1), segment size
-   depends on the communicator size and the message size
- - ompi_binary: same with binary tree, segment size of 32KB
- - ompi_in_order_binary: same with binary tree, enforcing order on the
-   operations
- - ompi_binomial: same with binomial algo (redundant with default binomial
-   one in most cases)
- - ompi_basic_linear: basic algorithm, each process sends to root
- - mvapich2_knomial: k-nomial algorithm. Default factor is 4 (mvapich2 selector adapts it through tuning)
- - mvapich2_two_level: SMP-aware reduce, with default set to mpich both for intra and inter communicators. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
- - rab: `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_'s reduce algorithm
+``default``: naive one, by default. |br|
+``ompi``: use openmpi selector for the reduce operations. |br|
+``mpich``: use mpich selector for the reduce operations. |br|
+``mvapich2``: use mvapich2 selector for the reduce operations. |br|
+``impi``: use intel mpi selector for the reduce operations. |br|
+``automatic (experimental)``: use an automatic self-benchmarking algorithm. |br|
+``arrival_pattern_aware``: root exchanges with the first process to arrive. |br|
+``binomial``: uses a binomial tree. |br|
+``flat_tree``: uses a flat tree. |br|
+``NTSL``: Non-topology-specific pipelined linear-bcast function. |br| 0->1, 1->2 ,2->3, ....., ->last node: in a pipeline fashion, with segments of 8192 bytes. |br|
+``scatter_gather``: scatter then gather. |br|
+``ompi_chain``: openmpi reduce algorithms are built on the same basis, but the topology is generated differently for each flavor. chain = chain with spacing of size/2, and segment size of 64KB. |br|
+``ompi_pipeline``: same with pipeline (chain with spacing of 1), segment size depends on the communicator size and the message size. |br|
+``ompi_binary``: same with binary tree, segment size of 32KB. |br|
+``ompi_in_order_binary``: same with binary tree, enforcing order on the operations. |br|
+``ompi_binomial``: same with binomial algo (redundant with default binomial one in most cases). |br|
+``ompi_basic_linear``: basic algorithm, each process sends to root. |br|
+``mvapich2_knomial``: k-nomial algorithm. Default factor is 4 (mvapich2 selector adapts it through tuning). |br|
+``mvapich2_two_level``: SMP-aware reduce, with default set to mpich both for intra and inter communicators. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster. |br|
+``rab``: `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_'s reduce algorithm. |br|
  
  MPI_Allreduce
  ^^^^^^^^^^^^^
  
- - default: naive one, by default
- - ompi: use openmpi selector for the allreduce operations
- - mpich: use mpich selector for the allreduce operations
- - mvapich2: use mvapich2 selector for the allreduce operations
- - impi: use intel mpi selector for the allreduce operations
- - automatic (experimental): use an automatic self-benchmarking algorithm
- - lr: logical ring reduce-scatter then logical ring allgather
- - rab1: variations of the  `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_ algorithm: reduce_scatter then allgather
- - rab2: variations of the  `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_ algorithm: alltoall then allgather
- - rab_rsag: variation of the  `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_ algorithm: recursive doubling
-   reduce_scatter then recursive doubling allgather
- - rdb: recursive doubling
- - smp_binomial: binomial tree with smp: binomial intra
-   SMP reduce, inter reduce, inter broadcast then intra broadcast
- - smp_binomial_pipeline: same with segment size = 4096 bytes
- - smp_rdb: intra: binomial allreduce, inter: Recursive
-   doubling allreduce, intra: binomial broadcast
- - smp_rsag: intra: binomial allreduce, inter: reduce-scatter,
-   inter:allgather, intra: binomial broadcast
- - smp_rsag_lr: intra: binomial allreduce, inter: logical ring
-   reduce-scatter, logical ring inter:allgather, intra: binomial broadcast
- - smp_rsag_rab: intra: binomial allreduce, inter: rab
-   reduce-scatter, rab inter:allgather, intra: binomial broadcast
- - redbcast: reduce then broadcast, using default or tuned algorithms if specified
- - ompi_ring_segmented: ring algorithm used by OpenMPI
- - mvapich2_rs: rdb for small messages, reduce-scatter then allgather else
- - mvapich2_two_level: SMP-aware algorithm, with mpich as intra algorithm, and rdb as inter (Change this behavior by using mvapich2 selector to use tuned values)
- - rab: default `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_ implementation
+``default``: naive one, by defautl. |br|
+``ompi``: use openmpi selector for the allreduce operations. |br|
+``mpich``: use mpich selector for the allreduce operations. |br|
+``mvapich2``: use mvapich2 selector for the allreduce operations. |br|
+``impi``: use intel mpi selector for the allreduce operations. |br|
+``automatic (experimental)``: use an automatic self-benchmarking algorithm. |br|
+``lr``: logical ring reduce-scatter then logical ring allgather. |br|
+``rab1``: variations of the  `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_ algorithm: reduce_scatter then allgather. |br|
+``rab2``: variations of the  `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_ algorithm: alltoall then allgather. |br|
+``rab_rsag``: variation of the  `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_ algorithm: recursive doubling reduce_scatter then recursive doubling allgather. |br|
+``rdb``: recursive doubling. |br|
+``smp_binomial``: binomial tree with smp: binomial intra. |br| SMP reduce, inter reduce, inter broadcast then intra broadcast. |br|
+``smp_binomial_pipeline``: same with segment size = 4096 bytes. |br|
+``smp_rdb``: intra``: binomial allreduce, inter: Recursive doubling allreduce, intra``: binomial broadcast. |br|
+``smp_rsag``: intra: binomial allreduce, inter: reduce-scatter, inter:allgather, intra: binomial broadcast. |br|
+``smp_rsag_lr``: intra: binomial allreduce, inter: logical ring reduce-scatter, logical ring inter:allgather, intra: binomial broadcast. |br|
+``smp_rsag_rab``: intra: binomial allreduce, inter: rab reduce-scatter, rab inter:allgather, intra: binomial broadcast. |br|
+``redbcast``: reduce then broadcast, using default or tuned algorithms if specified. |br|
+``ompi_ring_segmented``: ring algorithm used by OpenMPI. |br|
+``mvapich2_rs``: rdb for small messages, reduce-scatter then allgather else. |br|
+``mvapich2_two_level``: SMP-aware algorithm, with mpich as intra algorithm, and rdb as inter (Change this behavior by using mvapich2 selector to use tuned values). |br|
+``rab``: default `Rabenseifner <https://fs.hlrs.de/projects/par/mpi//myreduce.html>`_ implementation. |br|
  
  MPI_Reduce_scatter
  ^^^^^^^^^^^^^^^^^^
  
- - default: naive one, by default
- - ompi: use openmpi selector for the reduce_scatter operations
- - mpich: use mpich selector for the reduce_scatter operations
- - mvapich2: use mvapich2 selector for the reduce_scatter operations
- - impi: use intel mpi selector for the reduce_scatter operations
- - automatic (experimental): use an automatic self-benchmarking algorithm
- - ompi_basic_recursivehalving: recursive halving version from OpenMPI
- - ompi_ring: ring version from OpenMPI
- - mpich_pair: pairwise exchange version from MPICH
- - mpich_rdb: recursive doubling version from MPICH
- - mpich_noncomm: only works for power of 2 procs, recursive doubling for noncommutative ops
+``default``: naive one, by default. |br|
+``ompi``: use openmpi selector for the reduce_scatter operations. |br|
+``mpich``: use mpich selector for the reduce_scatter operations. |br|
+``mvapich2``: use mvapich2 selector for the reduce_scatter operations. |br|
+``impi``: use intel mpi selector for the reduce_scatter operations. |br|
+``automatic (experimental)``: use an automatic self-benchmarking algorithm. |br|
+``ompi_basic_recursivehalving``: recursive halving version from OpenMPI. |br|
+``ompi_ring``: ring version from OpenMPI. |br|
+``ompi_butterfly``: butterfly version from OpenMPI. |br|
+``mpich_pair``: pairwise exchange version from MPICH. |br|
+``mpich_rdb``: recursive doubling version from MPICH. |br|
+``mpich_noncomm``: only works for power of 2 procs, recursive doubling for noncommutative ops. |br|
  
  
  MPI_Allgather
  ^^^^^^^^^^^^^
  
- - default: naive one, by default
- - ompi: use openmpi selector for the allgather operations
- - mpich: use mpich selector for the allgather operations
- - mvapich2: use mvapich2 selector for the allgather operations
- - impi: use intel mpi selector for the allgather operations
- - automatic (experimental): use an automatic self-benchmarking algorithm
- - 2dmesh: see alltoall
- - 3dmesh: see alltoall
- - bruck: Described by Bruck et.al. in <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949">
-   Efficient algorithms for all-to-all communications in multiport message-passing systems</a>
- - GB: Gather - Broadcast (uses tuned version if specified)
- - loosely_lr: Logical Ring with grouping by core (hardcoded, default
-   processes/node: 4)
- - NTSLR: Non Topology Specific Logical Ring
- - NTSLR_NB: Non Topology Specific Logical Ring, Non Blocking operations
- - pair: see alltoall
- - rdb: see alltoall
- - rhv: only power of 2 number of processes
- - ring: see alltoall
- - SMP_NTS: gather to root of each SMP, then every root of each SMP node
-   post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
-   using logical ring algorithm (hardcoded, default processes/SMP: 8)
- - smp_simple: gather to root of each SMP, then every root of each SMP node
-   post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
-   using simple algorithm (hardcoded, default processes/SMP: 8)
- - spreading_simple: from node i, order of communications is i -> i + 1, i ->
-   i + 2, ..., i -> (i + p -1) % P
- - ompi_neighborexchange: Neighbor Exchange algorithm for allgather.
-   Described by Chen et.al. in  `Performance Evaluation of Allgather
-   Algorithms on Terascale Linux Cluster with Fast Ethernet <http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1592302>`_
- - mvapich2_smp: SMP aware algorithm, performing intra-node gather, inter-node allgather with one process/node, and bcast intra-node
+``default``: naive one, by default. |br|
+``ompi``: use openmpi selector for the allgather operations. |br|
+``mpich``: use mpich selector for the allgather operations. |br|
+``mvapich2``: use mvapich2 selector for the allgather operations. |br|
+``impi``: use intel mpi selector for the allgather operations. |br|
+``automatic (experimental)``: use an automatic self-benchmarking algorithm. |br|
+``2dmesh``: see alltoall. |br|
+``3dmesh``: see alltoall. |br|
+``bruck``: Described by Bruck et.al. in <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949"> Efficient algorithms for all-to-all communications in multiport message-passing systems</a>. |br|
+``GB``: Gather - Broadcast (uses tuned version if specified). |br|
+``loosely_lr``: Logical Ring with grouping by core (hardcoded, default processes/node: 4). |br|
+``NTSLR``: Non Topology Specific Logical Ring. |br|
+``NTSLR_NB``: Non Topology Specific Logical Ring, Non Blocking operations. |br|
+``pair``: see alltoall. |br|
+``rdb``: see alltoall. |br|
+``rhv``: only power of 2 number of processes. |br|
+``ring``: see alltoall. |br|
+``SMP_NTS``: gather to root of each SMP, then every root of each SMP node. post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message, using logical ring algorithm (hardcoded, default processes/SMP: 8). |br|
+``smp_simple``: gather to root of each SMP, then every root of each SMP node post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message, using simple algorithm (hardcoded, default processes/SMP: 8). |br|
+``spreading_simple``: from node i, order of communications is i -> i + 1, i -> i + 2, ..., i -> (i + p -1) % P. |br|
+``ompi_neighborexchange``: Neighbor Exchange algorithm for allgather. Described by Chen et.al. in  `Performance Evaluation of Allgather Algorithms on Terascale Linux Cluster with Fast Ethernet <http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1592302>`_. |br|
+``mvapich2_smp``: SMP aware algorithm, performing intra-node gather, inter-node allgather with one process/node, and bcast intra-node
  
  MPI_Allgatherv
  ^^^^^^^^^^^^^^
  
- - default: naive one, by default
- - ompi: use openmpi selector for the allgatherv operations
- - mpich: use mpich selector for the allgatherv operations
- - mvapich2: use mvapich2 selector for the allgatherv operations
- - impi: use intel mpi selector for the allgatherv operations
- - automatic (experimental): use an automatic self-benchmarking algorithm
- - GB: Gatherv - Broadcast (uses tuned version if specified, but only for Bcast, gatherv is not tuned)
- - pair: see alltoall
- - ring: see alltoall
- - ompi_neighborexchange: see allgather
- - ompi_bruck: see allgather
- - mpich_rdb: recursive doubling algorithm from MPICH
- - mpich_ring: ring algorithm from MPICh - performs differently from the  one from STAR-MPI
+``default``: naive one, by default. |br|
+``ompi``: use openmpi selector for the allgatherv operations. |br|
+``mpich``: use mpich selector for the allgatherv operations. |br|
+``mvapich2``: use mvapich2 selector for the allgatherv operations. |br|
+``impi``: use intel mpi selector for the allgatherv operations. |br|
+``automatic (experimental)``: use an automatic self-benchmarking algorithm. |br|
+``GB``: Gatherv - Broadcast (uses tuned version if specified, but only for Bcast, gatherv is not tuned). |br|
+``pair``: see alltoall. |br|
+``ring``: see alltoall. |br|
+``ompi_neighborexchange``: see allgather. |br|
+``ompi_bruck``: see allgather. |br|
+``mpich_rdb``: recursive doubling algorithm from MPICH. |br|
+``mpich_ring``: ring algorithm from MPICh - performs differently from the  one from STAR-MPI.
  
  MPI_Bcast
  ^^^^^^^^^
  
- - default: naive one, by default
- - ompi: use openmpi selector for the bcast operations
- - mpich: use mpich selector for the bcast operations
- - mvapich2: use mvapich2 selector for the bcast operations
- - impi: use intel mpi selector for the bcast operations
- - automatic (experimental): use an automatic self-benchmarking algorithm
- - arrival_pattern_aware: root exchanges with the first process to arrive
- - arrival_pattern_aware_wait: same with slight variation
- - binomial_tree: binomial tree exchange
- - flattree: flat tree exchange
- - flattree_pipeline: flat tree exchange, message split into 8192 bytes pieces
- - NTSB: Non-topology-specific pipelined binary tree with 8192 bytes pieces
- - NTSL: Non-topology-specific pipelined linear with 8192 bytes pieces
- - NTSL_Isend: Non-topology-specific pipelined linear with 8192 bytes pieces, asynchronous communications
- - scatter_LR_allgather: scatter followed by logical ring allgather
- - scatter_rdb_allgather: scatter followed by recursive doubling allgather
- - arrival_scatter: arrival pattern aware scatter-allgather
- - SMP_binary: binary tree algorithm with 8 cores/SMP
- - SMP_binomial: binomial tree algorithm with 8 cores/SMP
- - SMP_linear: linear algorithm with 8 cores/SMP
- - ompi_split_bintree: binary tree algorithm from OpenMPI, with message split in 8192 bytes pieces
- - ompi_pipeline: pipeline algorithm from OpenMPI, with message split in 128KB pieces
- - mvapich2_inter_node: Inter node default mvapich worker
- - mvapich2_intra_node: Intra node default mvapich worker
- - mvapich2_knomial_intra_node:  k-nomial intra node default mvapich worker. default factor is 4.
+``default``: naive one, by default. |br|
+``ompi``: use openmpi selector for the bcast operations. |br|
+``mpich``: use mpich selector for the bcast operations. |br|
+``mvapich2``: use mvapich2 selector for the bcast operations. |br|
+``impi``: use intel mpi selector for the bcast operations. |br|
+``automatic (experimental)``: use an automatic self-benchmarking algorithm. |br|
+``arrival_pattern_aware``: root exchanges with the first process to arrive. |br|
+``arrival_pattern_aware_wait``: same with slight variation. |br|
+``binomial_tree``: binomial tree exchange. |br|
+``flattree``: flat tree exchange. |br|
+``flattree_pipeline``: flat tree exchange, message split into 8192 bytes pieces. |br|
+``NTSB``: Non-topology-specific pipelined binary tree with 8192 bytes pieces. |br|
+``NTSL``: Non-topology-specific pipelined linear with 8192 bytes pieces. |br|
+``NTSL_Isend``: Non-topology-specific pipelined linear with 8192 bytes pieces, asynchronous communications. |br|
+``scatter_LR_allgather``: scatter followed by logical ring allgather. |br|
+``scatter_rdb_allgather``: scatter followed by recursive doubling allgather. |br|
+``arrival_scatter``: arrival pattern aware scatter-allgather. |br|
+``SMP_binary``: binary tree algorithm with 8 cores/SMP. |br|
+``SMP_binomial``: binomial tree algorithm with 8 cores/SMP. |br|
+``SMP_linear``: linear algorithm with 8 cores/SMP. |br|
+``ompi_split_bintree``: binary tree algorithm from OpenMPI, with message split in 8192 bytes pieces. |br|
+``ompi_pipeline``: pipeline algorithm from OpenMPI, with message split in 128KB pieces. |br|
+``mvapich2_inter_node``: Inter node default mvapich worker. |br|
+``mvapich2_intra_node``: Intra node default mvapich worker. |br|
+``mvapich2_knomial_intra_node``:  k-nomial intra node default mvapich worker. default factor is 4.
  
  Automatic Evaluation
  ^^^^^^^^^^^^^^^^^^^^
@@ -457,7 +458,8 @@ can't be done, so algorithms have to be changed to use smpi version of
  the calls instead (MPI_Send will become smpi_mpi_send). Some functions
  may have different signatures than their MPI counterpart, please check
  the other algorithms or contact us using the `>SimGrid
-developers mailing list <http://lists.gforge.inria.fr/mailman/listinfo/simgrid-devel>`_.
+user mailing list <https://sympa.inria.fr/sympa/info/simgrid-community>`_,
+or on `>Mattermost <https://framateam.org/simgrid/channels/town-square>`_.
  
  Example: adding a "pair" version of the Alltoall collective.
  
@@ -508,19 +510,14 @@ variables should be handled correctly on Linux systems.
  MPI coverage of SMPI
  ....................
  
-Our coverage of the interface is very decent, but still incomplete;
-Given the size of the MPI standard, we may well never manage to
-implement absolutely all existing primitives. Currently, we have
-almost no support for I/O primitives, but we still pass a very large
-amount of the MPICH coverage tests.
+SMPI support a large faction of the MPI interface: we pass many of the MPICH coverage tests, and many of the existing
+:ref:`proxy apps <SMPI_proxy_apps>` run almost unmodified on top of SMPI. But our support is still incomplete, with I/O
+primitives the being one of the major missing feature.
  
-The full list of not yet implemented functions is documented in the
-file `include/smpi/smpi.h
-<https://framagit.org/simgrid/simgrid/tree/master/include/smpi/smpi.h>`_
-in your version of SimGrid, between two lines containing the ``FIXME``
-marker. If you really miss a feature, please get in touch with us: we
-can guide you though the SimGrid code to help you implementing it, and
-we'd be glad to integrate your contribution to the main project.
+The full list of functions that remain to be implemented is documented in the file `include/smpi/smpi.h
+<https://framagit.org/simgrid/simgrid/tree/master/include/smpi/smpi.h>`_ in your version of SimGrid, between two lines
+containing the ``FIXME`` marker. If you miss a feature, please get in touch with us: we can guide you through the SimGrid
+code to help you implementing it, and we'd be glad to integrate your contribution to the main project.
  
  .. _SMPI_what_globals:
  
@@ -663,7 +660,13 @@ their duration, and this duration will be used for the subsequent
  iterations. These samples are done per processor with
  SMPI_SAMPLE_LOCAL, and shared between all processors with
  SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution
-time of your loop iteration are not stable.
+time of your loop iteration are not stable. If some parameters have an
+incidence on the timing of a kernel, and if they are reused often
+(same kernel launched with a few different sizes during the run, for example),
+SMPI_SAMPLE_LOCAL_TAG and SMPI_SAMPLE_GLOBAL_TAG can be used, with a tag
+as last parameter, to differentiate between calls. The tag is a character
+chain crafted by the user, with a maximum size of 128, and should include
+what is necessary to group calls of a given size together.
  
  This feature is demoed by the example file
  `examples/smpi/NAS/ep.c <https://framagit.org/simgrid/simgrid/tree/master/examples/smpi/NAS/ep.c>`_
@@ -710,23 +713,55 @@ Finally, you may want to check `this article
  <https://hal.inria.fr/hal-00907887>`_ on the classical pitfalls in
  modeling distributed systems.
  
+.. _SMPI_proxy_apps:
+
+----------------------
+Examples of SMPI Usage
+----------------------
+
+A small amount of examples can be found directly in the SimGrid
+archive, under `examples/smpi <https://framagit.org/simgrid/simgrid/-/tree/master/examples/smpi>`_.
+Some show how to simply run MPI code in SimGrid, how to use the
+tracing/replay mechanism or how to use plugins written in S4U to
+extend the simulator abilities.
+
+Another source of examples lay in the SimGrid archive, under
+`teshsuite/smpi <https://framagit.org/simgrid/simgrid/-/tree/master/examples/smpi>`_.
+They are not in the ``examples`` directory because they probably don't
+constitute pedagogical examples. Instead, they are intended to stress
+our implementation during the tests. Some of you may be interested
+anyway.
+
+But the best source of SMPI examples is certainly the `proxy app
+<https://framagit.org/simgrid/SMPI-proxy-apps>`_ external project.
+Proxy apps are scale models of real, massive HPC applications: each of
+them exhibits the same communication and computation patterns than the
+massive application that it stands for. But they last only a few
+thousands lines instead of some millions of lines. These proxy apps
+are usually provided for educational purpose, and also to ensure that
+the represented large HPC applications will correctly work with the
+next generation of runtimes and hardware. `This project
+<https://framagit.org/simgrid/SMPI-proxy-apps>`_ gathers proxy apps
+from different sources, along with the patches needed (if any) to run
+them on top of SMPI.
+
  -------------------------
  Troubleshooting with SMPI
  -------------------------
  
-.................................
-./configure refuses to use smpicc
-.................................
+.........................................
+./configure or cmake refuse to use smpicc
+.........................................
  
-If your ``./configure`` reports that the compiler is not
+If your configuration script (such as ``./configure`` or ``cmake``) reports that the compiler is not
  functional or that you are cross-compiling, try to define the
  ``SMPI_PRETEND_CC`` environment variable before running the
  configuration.
  
-.. code-block:: shell
+.. code-block:: console
  
-   SMPI_PRETEND_CC=1 ./configure # here come the configure parameters
-   make
+   $ SMPI_PRETEND_CC=1 ./configure # here come the configure parameters
+   $ make
  
  Indeed, the programs compiled with ``smpicc`` cannot be executed
  without ``smpirun`` (they are shared libraries and do weird things on
@@ -737,21 +772,21 @@ fail without ``smpirun``.
  
  .. warning::
  
-  Make sure that SMPI_PRETEND_CC is only set when calling ./configure,
+  Make sure that SMPI_PRETEND_CC is only set when calling the configuration script but
    not during the actual execution, or any program compiled with smpicc
    will stop before starting.
  
-..............................................
-./configure does not pick smpicc as a compiler
-..............................................
+.....................................................
+./configure or cmake do not pick smpicc as a compiler
+.....................................................
  
  In addition to the previous answers, some projects also need to be
  explicitly told what compiler to use, as follows:
  
-.. code-block:: shell
+.. code-block:: console
  
-   SMPI_PRETEND_CC=1 ./configure CC=smpicc # here come the other configure parameters
-   make
+   $ SMPI_PRETEND_CC=1 cmake CC=smpicc # here come the other configure parameters
+   $ make
  
  Maybe your configure is using another variable, such as ``cc`` (in
  lower case) or similar. Just check the logs.
@@ -760,7 +795,7 @@ lower case) or similar. Just check the logs.
  error: unknown type name 'useconds_t'
  .....................................
  
-Try to add ``-D_GNU_SOURCE`` to your compilation line to get ride
+Try to add ``-D_GNU_SOURCE`` to your compilation line to get rid
  of that error.
  
  The reason is that SMPI provides its own version of ``usleep(3)``
@@ -787,7 +822,7 @@ SimGrid uses time-independent traces, in which each actor is given a
  script of the actions to do sequentially. These trace files can
  actually be captured with the online version of SMPI, as follows:
  
-.. code-block:: shell
+.. code-block:: console
  
     $ smpirun -trace-ti --cfg=tracing/filename:LU.A.32 -np 32 -platform ../cluster_backbone.xml bin/lu.A.32
  
@@ -800,12 +835,14 @@ To replay this with SMPI, you need to first compile the provided
  `simgrid/examples/smpi/replay
  <https://framagit.org/simgrid/simgrid/tree/master/examples/smpi/replay>`_.
  
-.. code-block:: shell
+.. code-block:: console
  
     $ smpicxx ../replay.cpp -O3 -o ../smpi_replay
  
  Afterward, you can replay your trace in SMPI as follows:
  
+.. code-block:: console
+
     $ smpirun -np 32 -platform ../cluster_torus.xml -ext smpi_replay ../smpi_replay LU.A.32
  
  All the outputs are gone, as the application is not really simulated
@@ -814,3 +851,7 @@ simulation and the replay, you will see that the behavior is
  unchanged. The simulation does not run much faster on this very
  example, but this becomes very interesting when your application
  is computationally hungry.
+
+.. |br| raw:: html
+
+   <br />