convert the SMPI doc to sphinx

[simgrid.git] / doc / doxygen / module-smpi.doc
diff --git a/doc/doxygen/module-smpi.doc b/doc/doxygen/module-smpi.doc

deleted file mode 100644 (file)

index cdaa5f2..0000000
--- a/doc/doxygen/module-smpi.doc
+++ /dev/null
@@ -1,701 +0,0 @@
-/** 
-@defgroup SMPI_API      SMPI: Simulate real MPI applications
-@brief Programming environment for the simulation of MPI applications
-
-@tableofcontents
-
-SMPI enables the study of MPI application by emulating them on top of
-the SimGrid simulator. This is particularly interesting to study
-existing MPI applications within the comfort of the simulator. The
-SMPI reference article is available at
-https://hal.inria.fr/hal-01415484. You should also read the 
-<a href="http://simgrid.org/tutorials/simgrid-smpi-101.pdf">SMPI
-introductory slides</a>.
-
-Our goal is to enable the study of **unmodified MPI applications**.
-Some constructs and features are still missing, but we can probably
-add them on demand.  If you already used MPI before, SMPI should sound
-very familiar to you: Use smpicc instead of mpicc, and smpirun instead
-of mpirun. The main difference is that smpirun takes a virtual
-platform as extra parameter (see @ref platform).
-
-If you are new to MPI, you should first take our online [SMPI
-CourseWare](https://simgrid.github.io/SMPI_CourseWare/). It consists
-in several projects that progressively introduce the MPI concepts. It
-proposes to use SimGrid and SMPI to run the experiments, but the
-learning objectives are centered on MPI itself. 
-
-For **further scalability**, you may modify your code to speed up your
-studies or save memory space.  Maximal **simulation accuracy**
-requires some specific care from you.
-
- - @ref SMPI_use
-   - @ref SMPI_use_compile
-   - @ref SMPI_use_exec
-   - @ref SMPI_use_debug
-   - @ref SMPI_use_colls
-     - @ref SMPI_use_colls_algos
-     - @ref SMPI_use_colls_tracing
- - @ref SMPI_what
-   - @ref SMPI_what_coverage
-   - @ref SMPI_what_globals
- - @ref SMPI_adapting
-   - @ref SMPI_adapting_size
-   - @ref SMPI_adapting_speed
- - @ref SMPI_accuracy
- - @ref SMPI_troubleshooting
-   - @ref SMPI_trouble_configure_refuses_smpicc
-   - @ref SMPI_trouble_configure_dont_find_smpicc
-   - @ref SMPI_trouble_useconds_t
-
-
-@section SMPI_use Using SMPI
-
-@subsection SMPI_use_compile Compiling your code
-
-If your application is in C, then simply use <tt>smpicc</tt> as a
-compiler just like you use mpicc with other MPI implementations. This
-script still calls your default compiler (gcc, clang, ...) and adds
-the right compilation flags along the way. If your application is in
-C++, Fortran 77 or Fortran 90, use respectively <tt>smpicxx</tt>,
-<tt>smpiff</tt> or <tt>smpif90</tt>.
-
-@subsection SMPI_use_exec Executing your code on the simulator
-
-Use the <tt>smpirun</tt> script as follows for that:
-
-@verbatim
-smpirun -hostfile my_hostfile.txt -platform my_platform.xml ./program -blah
-@endverbatim
-
- - <tt>my_hostfile.txt</tt> is a classical MPI hostfile (that is, this
-   file lists the machines on which the processes must be dispatched, one
-   per line)
- - <tt>my_platform.xml</tt> is a classical SimGrid platform file. Of
-   course, the hosts of the hostfile must exist in the provided
-   platform.
- - <tt>./program</tt> is the MPI program to simulate, that you
-   compiled with <tt>smpicc</tt>
- - <tt>-blah</tt> is a command-line parameter passed to this program.
-
-<tt>smpirun</tt> accepts other parameters, such as <tt>-np</tt> if you
-don't want to use all the hosts defined in the hostfile, <tt>-map</tt>
-to display on which host each rank gets mapped of <tt>-trace</tt> to
-activate the tracing during the simulation. You can get the full list
-by running
-
-@verbatim
-smpirun -help
-@endverbatim
-
-@subsection SMPI_use_debug Debugging your code on top of SMPI
-
-If you want to explore the automatic platform and deployment files
-that are generated by @c smpirun, add @c -keep-temps to the command
-line.
-
-You can also run your simulation within valgrind or gdb using the
-following commands. Once in GDB, each MPI ranks will be represented as
-a regular thread, and you can explore the state of each of them as
-usual.
-@verbatim
-smpirun -wrapper valgrind ...other args...
-smpirun -wrapper "gdb --args" --cfg=contexts/factory:thread ...other args...
-@endverbatim
-
-@subsection SMPI_use_colls Simulating collective operations
-
-MPI collective operations are crucial to the performance of MPI
-applications and must be carefully optimized according to many
-parameters. Every existing implementation provides several algorithms
-for each collective operation, and selects by default the best suited
-one, depending on the sizes sent, the number of nodes, the
-communicator, or the communication library being used.  These
-decisions are based on empirical results and theoretical complexity
-estimation, and are very different between MPI implementations. In
-most cases, the users can also manually tune the algorithm used for
-each collective operation.
-
-SMPI can simulate the behavior of several MPI implementations:
-OpenMPI, MPICH,
-<a href="http://star-mpi.sourceforge.net/">STAR-MPI</a>, and
-MVAPICH2. For that, it provides 115 collective algorithms and several
-selector algorithms, that were collected directly in the source code
-of the targeted MPI implementations.
-
-You can switch the automatic selector through the
-\c smpi/coll-selector configuration item. Possible values:
-
- - <b>ompi</b>: default selection logic of OpenMPI (version 3.1.2)
- - <b>mpich</b>: default selection logic of MPICH (version 3.3b)
- - <b>mvapich2</b>: selection logic of MVAPICH2 (version 1.9) tuned
-   on the Stampede cluster   
- - <b>impi</b>: preliminary version of an Intel MPI selector (version
-   4.1.3, also tuned for the Stampede cluster). Due the closed source
-   nature of Intel MPI, some of the algorithms described in the
-   documentation are not available, and are replaced by mvapich ones.   
- - <b>default</b>: legacy algorithms used in the earlier days of
-   SimGrid. Do not use for serious perform performance studies.
-
-
-@subsubsection SMPI_use_colls_algos Available algorithms
-
-You can also pick the algorithm used for each collective with the
-corresponding configuration item. For example, to use the pairwise
-alltoall algorithm, one should add \c --cfg=smpi/alltoall:pair to the
-line. This will override the selector (if any) for this algorithm.
-It means that the selected algorithm will be used 
-
-Warning: Some collective may require specific conditions to be
-executed correctly (for instance having a communicator with a power of
-two number of nodes only), which are currently not enforced by
-Simgrid. Some crashes can be expected while trying these algorithms
-with unusual sizes/parameters
-
-#### MPI_Alltoall
-
-Most of these are best described in <a href="http://www.cs.arizona.edu/~dkl/research/papers/ics06.pdf">STAR-MPI</a>
-
- - default: naive one, by default
- - ompi: use openmpi selector for the alltoall operations
- - mpich: use mpich selector for the alltoall operations
- - mvapich2: use mvapich2 selector for the alltoall operations
- - impi: use intel mpi selector for the alltoall operations
- - automatic (experimental): use an automatic self-benchmarking algorithm 
- - bruck: Described by Bruck et.al. in <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949">this paper</a>
- - 2dmesh: organizes the nodes as a two dimensional mesh, and perform allgather 
-   along the dimensions
- - 3dmesh: adds a third dimension to the previous algorithm
- - rdb: recursive doubling: extends the mesh to a nth dimension, each one 
-   containing two nodes
- - pair: pairwise exchange, only works for power of 2 procs, size-1 steps,
-   each process sends and receives from the same process at each step
- - pair_light_barrier: same, with small barriers between steps to avoid
-   contention
- - pair_mpi_barrier: same, with MPI_Barrier used
- - pair_one_barrier: only one barrier at the beginning
- - ring: size-1 steps, at each step a process send to process (n+i)%size, and receives from (n-i)%size
- - ring_light_barrier: same, with small barriers between some phases to avoid contention
- - ring_mpi_barrier: same, with MPI_Barrier used
- - ring_one_barrier: only one barrier at the beginning
- - basic_linear: posts all receives and all sends,
-starts the communications, and waits for all communication to finish
- - mvapich2_scatter_dest: isend/irecv with scattered destinations, posting only a few messages at the same time
-
-#### MPI_Alltoallv
-
- - default: naive one, by default
- - ompi: use openmpi selector for the alltoallv operations
- - mpich: use mpich selector for the alltoallv operations
- - mvapich2: use mvapich2 selector for the alltoallv operations
- - impi: use intel mpi selector for the alltoallv operations
- - automatic (experimental): use an automatic self-benchmarking algorithm 
- - bruck: same as alltoall
- - pair: same as alltoall
- - pair_light_barrier: same as alltoall
- - pair_mpi_barrier: same as alltoall
- - pair_one_barrier: same as alltoall
- - ring: same as alltoall
- - ring_light_barrier: same as alltoall
- - ring_mpi_barrier: same as alltoall
- - ring_one_barrier: same as alltoall
- - ompi_basic_linear: same as alltoall
-
-#### MPI_Gather
-
- - default: naive one, by default
- - ompi: use openmpi selector for the gather operations
- - mpich: use mpich selector for the gather operations
- - mvapich2: use mvapich2 selector for the gather operations
- - impi: use intel mpi selector for the gather operations
- - automatic (experimental): use an automatic self-benchmarking algorithm 
-which will iterate over all implemented versions and output the best
- - ompi_basic_linear: basic linear algorithm from openmpi, each process sends to the root
- - ompi_binomial: binomial tree algorithm
- - ompi_linear_sync: same as basic linear, but with a synchronization at the
- beginning and message cut into two segments.
- - mvapich2_two_level: SMP-aware version from MVAPICH. Gather first intra-node (defaults to mpich's gather), and then exchange with only one process/node. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
-
-#### MPI_Barrier
-
- - default: naive one, by default
- - ompi: use openmpi selector for the barrier operations
- - mpich: use mpich selector for the barrier operations
- - mvapich2: use mvapich2 selector for the barrier operations
- - impi: use intel mpi selector for the barrier operations
- - automatic (experimental): use an automatic self-benchmarking algorithm 
- - ompi_basic_linear: all processes send to root
- - ompi_two_procs: special case for two processes
- - ompi_bruck: nsteps = sqrt(size), at each step, exchange data with rank-2^k and rank+2^k
- - ompi_recursivedoubling: recursive doubling algorithm
- - ompi_tree: recursive doubling type algorithm, with tree structure
- - ompi_doublering: double ring algorithm
- - mvapich2_pair: pairwise algorithm
- - mpich_smp: barrier intra-node, then inter-node
-
-#### MPI_Scatter
-
- - default: naive one, by default
- - ompi: use openmpi selector for the scatter operations
- - mpich: use mpich selector for the scatter operations
- - mvapich2: use mvapich2 selector for the scatter operations
- - impi: use intel mpi selector for the scatter operations
- - automatic (experimental): use an automatic self-benchmarking algorithm 
- - ompi_basic_linear: basic linear scatter 
- - ompi_binomial: binomial tree scatter
- - mvapich2_two_level_direct: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a basic linear inter node stage. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster. 
- - mvapich2_two_level_binomial: SMP aware algorithm, with an intra-node stage (default set to mpich selector), and then a binomial phase. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
-
-#### MPI_Reduce
-
- - default: naive one, by default
- - ompi: use openmpi selector for the reduce operations
- - mpich: use mpich selector for the reduce operations
- - mvapich2: use mvapich2 selector for the reduce operations
- - impi: use intel mpi selector for the reduce operations
- - automatic (experimental): use an automatic self-benchmarking algorithm 
- - arrival_pattern_aware: root exchanges with the first process to arrive
- - binomial: uses a binomial tree
- - flat_tree: uses a flat tree
- - NTSL: Non-topology-specific pipelined linear-bcast function 
-   0->1, 1->2 ,2->3, ....., ->last node: in a pipeline fashion, with segments
- of 8192 bytes
- - scatter_gather: scatter then gather
- - ompi_chain: openmpi reduce algorithms are built on the same basis, but the
- topology is generated differently for each flavor
-chain = chain with spacing of size/2, and segment size of 64KB 
- - ompi_pipeline: same with pipeline (chain with spacing of 1), segment size 
-depends on the communicator size and the message size
- - ompi_binary: same with binary tree, segment size of 32KB
- - ompi_in_order_binary: same with binary tree, enforcing order on the 
-operations
- - ompi_binomial: same with binomial algo (redundant with default binomial 
-one in most cases)
- - ompi_basic_linear: basic algorithm, each process sends to root
- - mvapich2_knomial: k-nomial algorithm. Default factor is 4 (mvapich2 selector adapts it through tuning)
- - mvapich2_two_level: SMP-aware reduce, with default set to mpich both for intra and inter communicators. Use mvapich2 selector to change these to tuned algorithms for Stampede cluster.
- - rab: <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a>'s reduce algorithm 
-
-#### MPI_Allreduce
-
- - default: naive one, by default
- - ompi: use openmpi selector for the allreduce operations
- - mpich: use mpich selector for the allreduce operations
- - mvapich2: use mvapich2 selector for the allreduce operations
- - impi: use intel mpi selector for the allreduce operations
- - automatic (experimental): use an automatic self-benchmarking algorithm 
- - lr: logical ring reduce-scatter then logical ring allgather
- - rab1: variations of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: reduce_scatter then allgather
- - rab2: variations of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: alltoall then allgather
- - rab_rsag: variation of the  <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> algorithm: recursive doubling 
-reduce_scatter then recursive doubling allgather 
- - rdb: recursive doubling
- - smp_binomial: binomial tree with smp: binomial intra 
-SMP reduce, inter reduce, inter broadcast then intra broadcast
- - smp_binomial_pipeline: same with segment size = 4096 bytes
- - smp_rdb: intra: binomial allreduce, inter: Recursive 
-doubling allreduce, intra: binomial broadcast
- - smp_rsag: intra: binomial allreduce, inter: reduce-scatter, 
-inter:allgather, intra: binomial broadcast
- - smp_rsag_lr: intra: binomial allreduce, inter: logical ring 
-reduce-scatter, logical ring inter:allgather, intra: binomial broadcast
- - smp_rsag_rab: intra: binomial allreduce, inter: rab
-reduce-scatter, rab inter:allgather, intra: binomial broadcast
- - redbcast: reduce then broadcast, using default or tuned algorithms if specified
- - ompi_ring_segmented: ring algorithm used by OpenMPI
- - mvapich2_rs: rdb for small messages, reduce-scatter then allgather else
- - mvapich2_two_level: SMP-aware algorithm, with mpich as intra algoritm, and rdb as inter (Change this behavior by using mvapich2 selector to use tuned values)
- - rab: default <a href="https://fs.hlrs.de/projects/par/mpi//myreduce.html">Rabenseifner</a> implementation
-
-#### MPI_Reduce_scatter
-
- - default: naive one, by default
- - ompi: use openmpi selector for the reduce_scatter operations
- - mpich: use mpich selector for the reduce_scatter operations
- - mvapich2: use mvapich2 selector for the reduce_scatter operations
- - impi: use intel mpi selector for the reduce_scatter operations
- - automatic (experimental): use an automatic self-benchmarking algorithm 
- - ompi_basic_recursivehalving: recursive halving version from OpenMPI
- - ompi_ring: ring version from OpenMPI
- - mpich_pair: pairwise exchange version from MPICH
- - mpich_rdb: recursive doubling version from MPICH
- - mpich_noncomm: only works for power of 2 procs, recursive doubling for noncommutative ops
-
-
-#### MPI_Allgather
-
- - default: naive one, by default
- - ompi: use openmpi selector for the allgather operations
- - mpich: use mpich selector for the allgather operations
- - mvapich2: use mvapich2 selector for the allgather operations
- - impi: use intel mpi selector for the allgather operations
- - automatic (experimental): use an automatic self-benchmarking algorithm 
- - 2dmesh: see alltoall
- - 3dmesh: see alltoall
- - bruck: Described by Bruck et.al. in <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=642949">
-Efficient algorithms for all-to-all communications in multiport message-passing systems</a> 
- - GB: Gather - Broadcast (uses tuned version if specified)
- - loosely_lr: Logical Ring with grouping by core (hardcoded, default 
-processes/node: 4)
- - NTSLR: Non Topology Specific Logical Ring
- - NTSLR_NB: Non Topology Specific Logical Ring, Non Blocking operations
- - pair: see alltoall
- - rdb: see alltoall
- - rhv: only power of 2 number of processes
- - ring: see alltoall
- - SMP_NTS: gather to root of each SMP, then every root of each SMP node 
-post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message, 
-using logical ring algorithm (hardcoded, default processes/SMP: 8)
- - smp_simple: gather to root of each SMP, then every root of each SMP node 
-post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message, 
-using simple algorithm (hardcoded, default processes/SMP: 8)
- - spreading_simple: from node i, order of communications is i -> i + 1, i ->
- i + 2, ..., i -> (i + p -1) % P
- - ompi_neighborexchange: Neighbor Exchange algorithm for allgather. 
-Described by Chen et.al. in  <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1592302">Performance Evaluation of Allgather Algorithms on Terascale Linux Cluster with Fast Ethernet</a>
- - mvapich2_smp: SMP aware algorithm, performing intra-node gather, inter-node allgather with one process/node, and bcast intra-node
-
-
-#### MPI_Allgatherv
-
- - default: naive one, by default
- - ompi: use openmpi selector for the allgatherv operations
- - mpich: use mpich selector for the allgatherv operations
- - mvapich2: use mvapich2 selector for the allgatherv operations
- - impi: use intel mpi selector for the allgatherv operations
- - automatic (experimental): use an automatic self-benchmarking algorithm 
- - GB: Gatherv - Broadcast (uses tuned version if specified, but only for 
-Bcast, gatherv is not tuned)
- - pair: see alltoall
- - ring: see alltoall
- - ompi_neighborexchange: see allgather
- - ompi_bruck: see allgather
- - mpich_rdb: recursive doubling algorithm from MPICH
- - mpich_ring: ring algorithm from MPICh - performs differently from the  one from STAR-MPI
-
-#### MPI_Bcast
-
- - default: naive one, by default
- - ompi: use openmpi selector for the bcast operations
- - mpich: use mpich selector for the bcast operations
- - mvapich2: use mvapich2 selector for the bcast operations
- - impi: use intel mpi selector for the bcast operations
- - automatic (experimental): use an automatic self-benchmarking algorithm 
- - arrival_pattern_aware: root exchanges with the first process to arrive
- - arrival_pattern_aware_wait: same with slight variation
- - binomial_tree: binomial tree exchange
- - flattree: flat tree exchange
- - flattree_pipeline: flat tree exchange, message split into 8192 bytes pieces
- - NTSB: Non-topology-specific pipelined binary tree with 8192 bytes pieces
- - NTSL: Non-topology-specific pipelined linear with 8192 bytes pieces
- - NTSL_Isend: Non-topology-specific pipelined linear with 8192 bytes pieces, asynchronous communications
- - scatter_LR_allgather: scatter followed by logical ring allgather
- - scatter_rdb_allgather: scatter followed by recursive doubling allgather
- - arrival_scatter: arrival pattern aware scatter-allgather
- - SMP_binary: binary tree algorithm with 8 cores/SMP
- - SMP_binomial: binomial tree algorithm with 8 cores/SMP
- - SMP_linear: linear algorithm with 8 cores/SMP
- - ompi_split_bintree: binary tree algorithm from OpenMPI, with message split in 8192 bytes pieces
- - ompi_pipeline: pipeline algorithm from OpenMPI, with message split in 128KB pieces
- - mvapich2_inter_node: Inter node default mvapich worker 
- - mvapich2_intra_node: Intra node default mvapich worker
- - mvapich2_knomial_intra_node:  k-nomial intra node default mvapich worker. default factor is 4.
-
-#### Automatic evaluation 
-
-(Warning: This is still very experimental)
-
-An automatic version is available for each collective (or even as a selector). This specific 
-version will loop over all other implemented algorithm for this particular collective, and apply 
-them while benchmarking the time taken for each process. It will then output the quickest for 
-each process, and the global quickest. This is still unstable, and a few algorithms which need 
-specific number of nodes may crash.
-
-#### Adding an algorithm
-
-To add a new algorithm, one should check in the src/smpi/colls folder how other algorithms 
-are coded. Using plain MPI code inside Simgrid can't be done, so algorithms have to be 
-changed to use smpi version of the calls instead (MPI_Send will become smpi_mpi_send). Some functions may have different signatures than their MPI counterpart, please check the other algorithms or contact us using <a href="http://lists.gforge.inria.fr/mailman/listinfo/simgrid-devel">SimGrid developers mailing list</a>.
-
-Example: adding a "pair" version of the Alltoall collective.
-
- - Implement it in a file called alltoall-pair.c in the src/smpi/colls folder. This file should include colls_private.hpp.
-
- - The name of the new algorithm function should be smpi_coll_tuned_alltoall_pair, with the same signature as MPI_Alltoall.
-
- - Once the adaptation to SMPI code is done, add a reference to the file ("src/smpi/colls/alltoall-pair.c") in the SMPI_SRC part of the DefinePackages.cmake file inside buildtools/cmake, to allow the file to be built and distributed.
-
- - To register the new version of the algorithm, simply add a line to the corresponding macro in src/smpi/colls/cools.h ( add a "COLL_APPLY(action, COLL_ALLTOALL_SIG, pair)" to the COLL_ALLTOALLS macro ). The algorithm should now be compiled and be selected when using --cfg=smpi/alltoall:pair at runtime.
-
- - To add a test for the algorithm inside Simgrid's test suite, juste add the new algorithm name in the ALLTOALL_COLL list found inside teshsuite/smpi/CMakeLists.txt . When running ctest, a test for the new algorithm should be generated and executed. If it does not pass, please check your code or contact us.
-
- - Please submit your patch for inclusion in SMPI, for example through a pull request on GitHub or directly per email.
-
-@subsubsection SMPI_use_colls_tracing Tracing of internal communications
-
-By default, the collective operations are traced as a unique operation
-because tracing all point-to-point communications composing them could
-result in overloaded, hard to interpret traces. If you want to debug
-and compare collective algorithms, you should set the
-\c tracing/smpi/internals configuration item to 1 instead of 0.
-
-Here are examples of two alltoall collective algorithms runs on 16 nodes, 
-the first one with a ring algorithm, the second with a pairwise one:
-
-@htmlonly
-<a href="smpi_simgrid_alltoall_ring_16.png" border=0><img src="smpi_simgrid_alltoall_ring_16.png" width="30%" border=0 align="center"></a>
-<a href="smpi_simgrid_alltoall_pair_16.png" border=0><img src="smpi_simgrid_alltoall_pair_16.png" width="30%" border=0 align="center"></a>
-<br/>
-@endhtmlonly
-
-@section SMPI_what What can run within SMPI?
-
-You can run unmodified MPI applications (both C/C++ and Fortran) within
-SMPI, provided that you only use MPI calls that we implemented. Global
-variables should be handled correctly on Linux systems.
-
-@subsection SMPI_what_coverage MPI coverage of SMPI
-
-Our coverage of the interface is very decent, but still incomplete;
-Given the size of the MPI standard, we may well never manage to 
-implement absolutely all existing primitives. Currently, we have
-almost no support for I/O primitives, but we still pass a very large
-amount of the MPICH coverage tests.
-
-The full list of not yet implemented functions is documented in the
-file @ref include/smpi/smpi.h, between two lines containing the
-<tt>FIXME</tt> marker. If you really miss a feature, please get in
-touch with us: we can guide you though the SimGrid code to help you
-implementing it, and we'd glad to integrate your contribution to the
-main project afterward.
-
-@subsection SMPI_what_globals Privatization of global variables
-
-Concerning the globals, the problem comes from the fact that usually,
-MPI processes run as real UNIX processes while they are all folded
-into threads of a unique system process in SMPI. Global variables are
-usually private to each MPI process while they become shared between
-the processes in SMPI.  The problem and some potential solutions are
-discussed in this article: "Automatic Handling of Global Variables for
-Multi-threaded MPI Programs", available at
-http://charm.cs.illinois.edu/newPapers/11-23/paper.pdf (note that this
-article does not deal with SMPI but with a competing solution called
-AMPI that suffers of the same issue).  This point used to be
-problematic in SimGrid, but the problem should now be handled
-automatically on Linux.
-
-Older versions of SimGrid came with a script that automatically
-privatized the globals through static analysis of the source code. But
-our implementation was not robust enough to be used in production, so
-it was removed at some point. Currently, SMPI comes with two
-privatization mechanisms that you can @ref options_smpi_privatization
-"select at runtime". At the time of writing (v3.18), the dlopen
-approach is considered to be very fast (it's used by default) while
-the mmap approach is considered to be rather slow but very robust.
-
-With the <b>mmap approach</b>, SMPI duplicates and dynamically switch
-the \c .data and \c .bss segments of the ELF process when switching
-the MPI ranks. This allows each ranks to have its own copy of the
-global variables.  No copy actually occures as this mechanism uses \c
-mmap for efficiency. This mechanism is considered to be very robust on
-all systems supporting \c mmap (Linux and most BSDs). Its performance
-is questionable since each context switch between MPI ranks induces
-several syscalls to change the \c mmap that redirects the \c .data and
-\c .bss segments to the copies of the new rank. The code will also be
-copied several times in memory, inducing a slight increase of memory
-occupation.
-
-Another limitation is that SMPI only accounts for global variables
-defined in the executable. If the processes use external global
-variables from dynamic libraries, they won't be switched
-correctly. The easiest way to solve this is to statically link against
-the library with these globals. This way, each MPI rank will get its
-own copy of these libraries. Of course you should never statically
-link against the SimGrid library itself.
-
-With the <b>dlopen approach</b>, SMPI loads several copies of the same
-executable in memory as if it were a library, so that the global
-variables get naturally duplicated. It first requires the executable
-to be compiled as a relocatable binary, which is less common for
-programs than for libraries. But most distributions are now compiled
-this way for security reason as it allows to randomize the address
-space layout. It should thus be safe to compile most (any?) program
-this way.  The second trick is that the dynamic linker refuses to link
-the exact same file several times, be it a library or a relocatable
-executable. It makes perfectly sense in the general case, but we need
-to circumvent this rule of thumb in our case. To that extend, the
-binary is copied in a temporary file before being re-linked against.
-`dlmopen()` cannot be used as it only allows 256 contextes, and as it
-would also dupplicate simgrid itself.
-
-This approach greatly speeds up the context switching, down to about
-40 CPU cycles with our raw contextes, instead of requesting several
-syscalls with the \c mmap approach. Another advantage is that it
-permits to run the SMPI contexts in parallel, which is obviously not
-possible with the \c mmap approach. It was tricky to implement, but we
-are not aware of any flaws, so smpirun activates it by default.
-
-In the future, it may be possible to further reduce the memory and
-disk consumption. It seems that we could <a
-href="https://lwn.net/Articles/415889/">punch holes</a> in the files
-before dl-loading them to remove the code and constants, and mmap
-these area onto a unique copy. If done correctly, this would reduce
-the disk- and memory- usage to the bare minimum, and would also reduce
-the pressure on the CPU instruction cache. See 
-<a href="https://github.com/simgrid/simgrid/issues/137">the relevant
-bug</a> on github for implementation leads.\n
-
-Also, currently, only the binary is copied and dlopen-ed for each MPI
-rank. We could probably extend this to external dependencies, but for
-now, any external dependencies must be statically linked into your
-application. As usual, simgrid itself shall never be statically linked
-in your app. You don't want to give a copy of SimGrid to each MPI rank:
-that's ways too much for them to deal with.
-
-@section SMPI_adapting Adapting your MPI code for further scalability
-
-As detailed in the reference article (available at
-http://hal.inria.fr/hal-01415484), you may want to adapt your code
-to improve the simulation performance. But these tricks may seriously
-hinder the result quality (or even prevent the app to run) if used
-wrongly. We assume that if you want to simulate an HPC application,
-you know what you are doing. Don't prove us wrong!
-
-@subsection SMPI_adapting_size Reducing your memory footprint
-
-If you get short on memory (the whole app is executed on a single node when
-simulated), you should have a look at the SMPI_SHARED_MALLOC and
-SMPI_SHARED_FREE macros. It allows to share memory areas between processes: The
-purpose of these macro is that the same line malloc on each process will point
-to the exact same memory area. So if you have a malloc of 2M and you have 16
-processes, this macro will change your memory consumption from 2M*16 to 2M
-only. Only one block for all processes.
-
-If your program is ok with a block containing garbage value because all
-processes write and read to the same place without any kind of coordination,
-then this macro can dramatically shrink your memory consumption. For example,
-that will be very beneficial to a matrix multiplication code, as all blocks will
-be stored on the same area. Of course, the resulting computations will useless,
-but you can still study the application behavior this way. 
-
-Naturally, this won't work if your code is data-dependent. For example, a Jacobi
-iterative computation depends on the result computed by the code to detect
-convergence conditions, so turning them into garbage by sharing the same memory
-area between processes does not seem very wise. You cannot use the
-SMPI_SHARED_MALLOC macro in this case, sorry.
-
-This feature is demoed by the example file
-<tt>examples/smpi/NAS/dt.c</tt>
-
-@subsection SMPI_adapting_speed Toward faster simulations
-
-If your application is too slow, try using SMPI_SAMPLE_LOCAL,
-SMPI_SAMPLE_GLOBAL and friends to indicate which computation loops can
-be sampled. Some of the loop iterations will be executed to measure
-their duration, and this duration will be used for the subsequent
-iterations. These samples are done per processor with
-SMPI_SAMPLE_LOCAL, and shared between all processors with
-SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution
-time of your loop iteration are not stable.
-
-This feature is demoed by the example file 
-<tt>examples/smpi/NAS/ep.c</tt>
-
-@section SMPI_accuracy Ensuring accurate simulations
-
-Out of the box, SimGrid may give you fairly accurate results, but
-there is a plenty of factors that could go wrong and make your results
-inaccurate or even plainly wrong. Actually, you can only get accurate
-results of a nicely built model, including both the system hardware
-and your application. Such models are hard to pass over and reuse in
-other settings, because elements that are not relevant to an
-application (say, the latency of point-to-point communications,
-collective operation implementation details or CPU-network
-interaction) may be irrelevant to another application. The dream of
-the perfect model, encompassing every aspects is only a chimera, as
-the only perfect model of the reality is the reality. If you go for
-simulation, then you have to ignore some irrelevant aspects of the
-reality, but which aspects are irrelevant is actually
-application-dependent...
-
-The only way to assess whether your settings provide accurate results
-is to double-check these results. If possible, you should first run
-the same experiment in simulation and in real life, gathering as much
-information as you can. Try to understand the discrepancies in the
-results that you observe between both settings (visualization can be
-precious for that). Then, try to modify your model (of the platform,
-of the collective operations) to reduce the most preeminent differences.
-
-If the discrepancies come from the computing time, try adapting the \c
-smpi/host-speed: reduce it if your simulation runs faster than in
-reality. If the error come from the communication, then you need to
-fiddle with your platform file.
-
-Be inventive in your modeling. Don't be afraid if the names given by
-SimGrid does not match the real names: we got very good results by
-modeling multicore/GPU machines with a set of separate hosts
-interconnected with very fast networks (but don't trust your model
-because it has the right names in the right place either).
-
-Finally, you may want to check [this
-article](https://hal.inria.fr/hal-00907887) on the classical pitfalls
-in modeling distributed systems.
-
-@section SMPI_troubleshooting Troubleshooting with SMPI
-
-@subsection SMPI_trouble_configure_refuses_smpicc ./configure refuses to use smpicc
-
-If your <tt>./configure</tt> reports that the compiler is not
-functional or that you are cross-compiling, try to define the
-<tt>SMPI_PRETEND_CC</tt> environment variable before running the
-configuration.
-
-@verbatim
-SMPI_PRETEND_CC=1 ./configure # here come the configure parameters
-make
-@endverbatim
-
-Indeed, the programs compiled with <tt>smpicc</tt> cannot be executed
-without <tt>smpirun</tt> (they are shared libraries, and they do weird
-things on startup), while configure wants to test them directly.
-With <tt>SMPI_PRETEND_CC</tt> smpicc does not compile as shared,
-and the SMPI initialization stops and returns 0 before doing anything
-that would fail without <tt>smpirun</tt>.
-
-\warning 
-
-  Make sure that SMPI_PRETEND_CC is only set when calling ./configure,
-  not during the actual execution, or any program compiled with smpicc
-  will stop before starting.
-
-@subsection SMPI_trouble_configure_dont_find_smpicc ./configure does not pick smpicc as a compiler
-
-In addition to the previous answers, some projects also need to be
-explicitely told what compiler to use, as follows:
-
-@verbatim
-SMPI_PRETEND_CC=1 ./configure CC=smpicc # here come the other configure parameters
-make
-@endverbatim
-
-Maybe your configure is using another variable, such as <tt>cc</tt> or
-similar. Just check the logs.
-
-@subsection SMPI_trouble_useconds_t  error: unknown type name 'useconds_t'
-
-Try to add <tt>-D_GNU_SOURCE</tt> to your compilation line to get ride
-of that error.
-
-The reason is that SMPI provides its own version of <tt>usleep(3)</tt>
-to override it and to block in the simulation world, not in the real
-one. It needs the <tt>useconds_t</tt> type for that, which is declared
-only if you declare <tt>_GNU_SOURCE</tt> before including
-<tt>unistd.h</tt>. If your project includes that header file before
-SMPI, then you need to ensure that you pass the right configuration
-defines as advised above.
-
-
-*/
-
-
-/** @example include/smpi/smpi.h */