In this mode, your application is actually executed. Every computation
occurs for real while every communication is simulated. In addition,
the executions are automatically benchmarked so that their timings can
In this mode, your application is actually executed. Every computation
occurs for real while every communication is simulated. In addition,
the executions are automatically benchmarked so that their timings can
SMPI can also go offline by replaying a trace. :ref:`Trace replay
<SMPI_offline>` is usually ways faster than online simulation (because
SMPI can also go offline by replaying a trace. :ref:`Trace replay
<SMPI_offline>` is usually ways faster than online simulation (because
- **ompi:** default selection logic of OpenMPI (version 3.1.2)
- **mpich**: default selection logic of MPICH (version 3.3b)
- **mvapich2**: selection logic of MVAPICH2 (version 1.9) tuned
- **ompi:** default selection logic of OpenMPI (version 3.1.2)
- **mpich**: default selection logic of MPICH (version 3.3b)
- **mvapich2**: selection logic of MVAPICH2 (version 1.9) tuned
- **impi**: preliminary version of an Intel MPI selector (version
4.1.3, also tuned for the Stampede cluster). Due the closed source
nature of Intel MPI, some of the algorithms described in the
- **impi**: preliminary version of an Intel MPI selector (version
4.1.3, also tuned for the Stampede cluster). Due the closed source
nature of Intel MPI, some of the algorithms described in the
- **default**: legacy algorithms used in the earlier days of
SimGrid. Do not use for serious perform performance studies.
- **default**: legacy algorithms used in the earlier days of
SimGrid. Do not use for serious perform performance studies.
- mpich: use mpich selector for the alltoall operations
- mvapich2: use mvapich2 selector for the alltoall operations
- impi: use intel mpi selector for the alltoall operations
- mpich: use mpich selector for the alltoall operations
- mvapich2: use mvapich2 selector for the alltoall operations
- impi: use intel mpi selector for the alltoall operations
containing two nodes
- pair: pairwise exchange, only works for power of 2 procs, size-1 steps,
each process sends and receives from the same process at each step
containing two nodes
- pair: pairwise exchange, only works for power of 2 procs, size-1 steps,
each process sends and receives from the same process at each step
- mpich: use mpich selector for the alltoallv operations
- mvapich2: use mvapich2 selector for the alltoallv operations
- impi: use intel mpi selector for the alltoallv operations
- mpich: use mpich selector for the alltoallv operations
- mvapich2: use mvapich2 selector for the alltoallv operations
- impi: use intel mpi selector for the alltoallv operations
- mpich: use mpich selector for the barrier operations
- mvapich2: use mvapich2 selector for the barrier operations
- impi: use intel mpi selector for the barrier operations
- mpich: use mpich selector for the barrier operations
- mvapich2: use mvapich2 selector for the barrier operations
- impi: use intel mpi selector for the barrier operations
- ompi_basic_linear: all processes send to root
- ompi_two_procs: special case for two processes
- ompi_bruck: nsteps = sqrt(size), at each step, exchange data with rank-2^k and rank+2^k
- ompi_basic_linear: all processes send to root
- ompi_two_procs: special case for two processes
- ompi_bruck: nsteps = sqrt(size), at each step, exchange data with rank-2^k and rank+2^k
- mpich: use mpich selector for the scatter operations
- mvapich2: use mvapich2 selector for the scatter operations
- impi: use intel mpi selector for the scatter operations
- mpich: use mpich selector for the scatter operations
- mvapich2: use mvapich2 selector for the scatter operations
- impi: use intel mpi selector for the scatter operations
- - automatic (experimental): use an automatic self-benchmarking algorithm
- - ompi_basic_linear: basic linear scatter
+ - automatic (experimental): use an automatic self-benchmarking algorithm
+ - ompi_basic_linear: basic linear scatter
- ompi_binomial: binomial tree scatter
- mvapich2_two_level_direct: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a basic linear inter node stage. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
- mvapich2_two_level_binomial: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a binomial phase. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
- ompi_binomial: binomial tree scatter
- mvapich2_two_level_direct: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a basic linear inter node stage. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
- mvapich2_two_level_binomial: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a binomial phase. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
- mpich: use mpich selector for the reduce operations
- mvapich2: use mvapich2 selector for the reduce operations
- impi: use intel mpi selector for the reduce operations
- mpich: use mpich selector for the reduce operations
- mvapich2: use mvapich2 selector for the reduce operations
- impi: use intel mpi selector for the reduce operations
- arrival_pattern_aware: root exchanges with the first process to arrive
- binomial: uses a binomial tree
- flat_tree: uses a flat tree
- arrival_pattern_aware: root exchanges with the first process to arrive
- binomial: uses a binomial tree
- flat_tree: uses a flat tree
0->1, 1->2 ,2->3, ....., ->last node: in a pipeline fashion, with segments
of 8192 bytes
- scatter_gather: scatter then gather
- ompi_chain: openmpi reduce algorithms are built on the same basis, but the
topology is generated differently for each flavor
0->1, 1->2 ,2->3, ....., ->last node: in a pipeline fashion, with segments
of 8192 bytes
- scatter_gather: scatter then gather
- ompi_chain: openmpi reduce algorithms are built on the same basis, but the
topology is generated differently for each flavor
- chain = chain with spacing of size/2, and segment size of 64KB
- - ompi_pipeline: same with pipeline (chain with spacing of 1), segment size
+ chain = chain with spacing of size/2, and segment size of 64KB
+ - ompi_pipeline: same with pipeline (chain with spacing of 1), segment size
depends on the communicator size and the message size
- ompi_binary: same with binary tree, segment size of 32KB
depends on the communicator size and the message size
- ompi_binary: same with binary tree, segment size of 32KB
one in most cases)
- ompi_basic_linear: basic algorithm, each process sends to root
- mvapich2_knomial: k-nomial algorithm. Default factor is 4 (mvapich2 selector adapts it through tuning)
- mvapich2_two_level: SMP-aware reduce, with default set to mpich both for intra and inter communicators. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
one in most cases)
- ompi_basic_linear: basic algorithm, each process sends to root
- mvapich2_knomial: k-nomial algorithm. Default factor is 4 (mvapich2 selector adapts it through tuning)
- mvapich2_two_level: SMP-aware reduce, with default set to mpich both for intra and inter communicators. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
- mpich: use mpich selector for the allreduce operations
- mvapich2: use mvapich2 selector for the allreduce operations
- impi: use intel mpi selector for the allreduce operations
- mpich: use mpich selector for the allreduce operations
- mvapich2: use mvapich2 selector for the allreduce operations
- impi: use intel mpi selector for the allreduce operations
- lr: logical ring reduce-scatter then logical ring allgather
- rab1: variations of the <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: reduce_scatter then allgather
- rab2: variations of the <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: alltoall then allgather
- lr: logical ring reduce-scatter then logical ring allgather
- rab1: variations of the <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: reduce_scatter then allgather
- rab2: variations of the <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: alltoall then allgather
- - rab_rsag: variation of the <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: recursive doubling
- reduce_scatter then recursive doubling allgather
+ - rab_rsag: variation of the <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: recursive doubling
+ reduce_scatter then recursive doubling allgather
SMP reduce, inter reduce, inter broadcast then intra broadcast
- smp_binomial_pipeline: same with segment size = 4096 bytes
SMP reduce, inter reduce, inter broadcast then intra broadcast
- smp_binomial_pipeline: same with segment size = 4096 bytes
reduce-scatter, logical ring inter:allgather, intra: binomial broadcast
- smp_rsag_rab: intra: binomial allreduce, inter: rab
reduce-scatter, rab inter:allgather, intra: binomial broadcast
reduce-scatter, logical ring inter:allgather, intra: binomial broadcast
- smp_rsag_rab: intra: binomial allreduce, inter: rab
reduce-scatter, rab inter:allgather, intra: binomial broadcast
- mpich: use mpich selector for the reduce_scatter operations
- mvapich2: use mvapich2 selector for the reduce_scatter operations
- impi: use intel mpi selector for the reduce_scatter operations
- mpich: use mpich selector for the reduce_scatter operations
- mvapich2: use mvapich2 selector for the reduce_scatter operations
- impi: use intel mpi selector for the reduce_scatter operations
- ompi_basic_recursivehalving: recursive halving version from OpenMPI
- ompi_ring: ring version from OpenMPI
- mpich_pair: pairwise exchange version from MPICH
- ompi_basic_recursivehalving: recursive halving version from OpenMPI
- ompi_ring: ring version from OpenMPI
- mpich_pair: pairwise exchange version from MPICH
- mpich: use mpich selector for the allgather operations
- mvapich2: use mvapich2 selector for the allgather operations
- impi: use intel mpi selector for the allgather operations
- mpich: use mpich selector for the allgather operations
- mvapich2: use mvapich2 selector for the allgather operations
- impi: use intel mpi selector for the allgather operations
processes/node: 4)
- NTSLR: Non Topology Specific Logical Ring
- NTSLR_NB: Non Topology Specific Logical Ring, Non Blocking operations
processes/node: 4)
- NTSLR: Non Topology Specific Logical Ring
- NTSLR_NB: Non Topology Specific Logical Ring, Non Blocking operations
- - SMP_NTS: gather to root of each SMP, then every root of each SMP node
- post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
+ - SMP_NTS: gather to root of each SMP, then every root of each SMP node
+ post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
- - smp_simple: gather to root of each SMP, then every root of each SMP node
- post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
+ - smp_simple: gather to root of each SMP, then every root of each SMP node
+ post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message,
using simple algorithm (hardcoded, default processes/SMP: 8)
- spreading_simple: from node i, order of communications is i -> i + 1, i ->
i + 2, ..., i -> (i + p -1) % P
using simple algorithm (hardcoded, default processes/SMP: 8)
- spreading_simple: from node i, order of communications is i -> i + 1, i ->
i + 2, ..., i -> (i + p -1) % P
Described by Chen et.al. in `Performance Evaluation of Allgather
Algorithms on Terascale Linux Cluster with Fast Ethernet <http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1592302>`_
- mvapich2_smp: SMP aware algorithm, performing intra-node gather, inter-node allgather with one process/node, and bcast intra-node
Described by Chen et.al. in `Performance Evaluation of Allgather
Algorithms on Terascale Linux Cluster with Fast Ethernet <http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1592302>`_
- mvapich2_smp: SMP aware algorithm, performing intra-node gather, inter-node allgather with one process/node, and bcast intra-node
- mpich: use mpich selector for the allgatherv operations
- mvapich2: use mvapich2 selector for the allgatherv operations
- impi: use intel mpi selector for the allgatherv operations
- mpich: use mpich selector for the allgatherv operations
- mvapich2: use mvapich2 selector for the allgatherv operations
- impi: use intel mpi selector for the allgatherv operations
- GB: Gatherv - Broadcast (uses tuned version if specified, but only for Bcast, gatherv is not tuned)
- pair: see alltoall
- ring: see alltoall
- GB: Gatherv - Broadcast (uses tuned version if specified, but only for Bcast, gatherv is not tuned)
- pair: see alltoall
- ring: see alltoall
- mpich: use mpich selector for the bcast operations
- mvapich2: use mvapich2 selector for the bcast operations
- impi: use intel mpi selector for the bcast operations
- mpich: use mpich selector for the bcast operations
- mvapich2: use mvapich2 selector for the bcast operations
- impi: use intel mpi selector for the bcast operations
- arrival_pattern_aware: root exchanges with the first process to arrive
- arrival_pattern_aware_wait: same with slight variation
- binomial_tree: binomial tree exchange
- arrival_pattern_aware: root exchanges with the first process to arrive
- arrival_pattern_aware_wait: same with slight variation
- binomial_tree: binomial tree exchange
- SMP_linear: linear algorithm with 8 cores/SMP
- ompi_split_bintree: binary tree algorithm from OpenMPI, with message split in 8192 bytes pieces
- ompi_pipeline: pipeline algorithm from OpenMPI, with message split in 128KB pieces
- SMP_linear: linear algorithm with 8 cores/SMP
- ompi_split_bintree: binary tree algorithm from OpenMPI, with message split in 8192 bytes pieces
- ompi_pipeline: pipeline algorithm from OpenMPI, with message split in 128KB pieces
- mvapich2_intra_node: Intra node default mvapich worker
- mvapich2_knomial_intra_node: k-nomial intra node default mvapich worker. default factor is 4.
- mvapich2_intra_node: Intra node default mvapich worker
- mvapich2_knomial_intra_node: k-nomial intra node default mvapich worker. default factor is 4.
-An automatic version is available for each collective (or even as a selector). This specific
-version will loop over all other implemented algorithm for this particular collective, and apply
-them while benchmarking the time taken for each process. It will then output the quickest for
-each process, and the global quickest. This is still unstable, and a few algorithms which need
+An automatic version is available for each collective (or even as a selector). This specific
+version will loop over all other implemented algorithm for this particular collective, and apply
+them while benchmarking the time taken for each process. It will then output the quickest for
+each process, and the global quickest. This is still unstable, and a few algorithms which need
and compare collective algorithms, you should set the
``tracing/smpi/internals`` configuration item to 1 instead of 0.
and compare collective algorithms, you should set the
``tracing/smpi/internals`` configuration item to 1 instead of 0.
the first one with a ring algorithm, the second with a pairwise one.
.. image:: /img/smpi_simgrid_alltoall_ring_16.png
:align: center
the first one with a ring algorithm, the second with a pairwise one.
.. image:: /img/smpi_simgrid_alltoall_ring_16.png
:align: center
implement absolutely all existing primitives. Currently, we have
almost no support for I/O primitives, but we still pass a very large
amount of the MPICH coverage tests.
implement absolutely all existing primitives. Currently, we have
almost no support for I/O primitives, but we still pass a very large
amount of the MPICH coverage tests.
then this macro can dramatically shrink your memory consumption. For example,
that will be very beneficial to a matrix multiplication code, as all blocks will
be stored on the same area. Of course, the resulting computations will useless,
then this macro can dramatically shrink your memory consumption. For example,
that will be very beneficial to a matrix multiplication code, as all blocks will
be stored on the same area. Of course, the resulting computations will useless,
Naturally, this won't work if your code is data-dependent. For example, a Jacobi
iterative computation depends on the result computed by the code to detect
Naturally, this won't work if your code is data-dependent. For example, a Jacobi
iterative computation depends on the result computed by the code to detect
SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution
time of your loop iteration are not stable.
SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution
time of your loop iteration are not stable.
`examples/smpi/NAS/ep.c <https://framagit.org/simgrid/simgrid/tree/master/examples/smpi/NAS/ep.c>`_
.............................
`examples/smpi/NAS/ep.c <https://framagit.org/simgrid/simgrid/tree/master/examples/smpi/NAS/ep.c>`_
.............................
precious for that). Then, try to modify your model (of the platform,
of the collective operations) to reduce the most preeminent differences.
precious for that). Then, try to modify your model (of the platform,
of the collective operations) to reduce the most preeminent differences.
``smpi/host-speed``: reduce it if your simulation runs faster than in
reality. If the error come from the communication, then you need to
fiddle with your platform file.
``smpi/host-speed``: reduce it if your simulation runs faster than in
reality. If the error come from the communication, then you need to
fiddle with your platform file.
Although SMPI is often used for :ref:`online simulation
<SMPI_online>`, where the application is executed for real, you can
Although SMPI is often used for :ref:`online simulation
<SMPI_online>`, where the application is executed for real, you can
SimGrid uses time-independent traces, in which each actor is given a
script of the actions to do sequentially. These trace files can
SimGrid uses time-independent traces, in which each actor is given a
script of the actions to do sequentially. These trace files can
The produced trace is composed of a file ``LU.A.32`` and a folder
``LU.A.32_files``. The file names don't match with the MPI ranks, but
The produced trace is composed of a file ``LU.A.32`` and a folder
``LU.A.32_files``. The file names don't match with the MPI ranks, but