+ - Please submit your patch for inclusion in SMPI, for example through a pull request on GitHub or directly per email.
+
+@subsubsection SMPI_use_colls_tracing Tracing of internal communications
+
+By default, the collective operations are traced as a unique operation
+because tracing all point-to-point communications composing them could
+result in overloaded, hard to interpret traces. If you want to debug
+and compare collective algorithms, you should set the
+\c tracing/smpi/internals configuration item to 1 instead of 0.
+
+Here are examples of two alltoall collective algorithms runs on 16 nodes,
+the first one with a ring algorithm, the second with a pairwise one:
+
+@htmlonly
+<a href="smpi_simgrid_alltoall_ring_16.png" border=0><img src="smpi_simgrid_alltoall_ring_16.png" width="30%" border=0 align="center"></a>
+<a href="smpi_simgrid_alltoall_pair_16.png" border=0><img src="smpi_simgrid_alltoall_pair_16.png" width="30%" border=0 align="center"></a>
+<br/>
+@endhtmlonly
+
+@section SMPI_what What can run within SMPI?
+
+You can run unmodified MPI applications (both C/C++ and Fortran) within
+SMPI, provided that you only use MPI calls that we implemented. Global
+variables should be handled correctly on Linux systems.
+
+@subsection SMPI_what_coverage MPI coverage of SMPI
+
+Our coverage of the interface is very decent, but still incomplete;
+Given the size of the MPI standard, we may well never manage to
+implement absolutely all existing primitives. Currently, we have
+almost no support for I/O primitives, but we still pass a very large
+amount of the MPICH coverage tests.
+
+The full list of not yet implemented functions is documented in the
+file @ref include/smpi/smpi.h, between two lines containing the
+<tt>FIXME</tt> marker. If you really miss a feature, please get in
+touch with us: we can guide you though the SimGrid code to help you
+implementing it, and we'd glad to integrate your contribution to the
+main project afterward.
+
+@subsection SMPI_what_globals Privatization of global variables
+
+Concerning the globals, the problem comes from the fact that usually,
+MPI processes run as real UNIX processes while they are all folded
+into threads of a unique system process in SMPI. Global variables are
+usually private to each MPI process while they become shared between
+the processes in SMPI. The problem and some potential solutions are
+discussed in this article: "Automatic Handling of Global Variables for
+Multi-threaded MPI Programs", available at
+http://charm.cs.illinois.edu/newPapers/11-23/paper.pdf (note that this
+article does not deal with SMPI but with a competing solution called
+AMPI that suffers of the same issue). This point used to be
+problematic in SimGrid, but the problem should now be handled
+automatically on Linux.
+
+Older versions of SimGrid came with a script that automatically
+privatized the globals through static analysis of the source code. But
+our implementation was not robust enough to be used in production, so
+it was removed at some point. Currently, SMPI comes with two
+privatization mechanisms that you can @ref options_smpi_privatization
+"select at runtime". At the time of writing (v3.18), the dlopen
+approach is considered to be very fast (it's used by default) while
+the mmap approach is considered to be rather slow but very robust.
+
+With the <b>mmap approach</b>, SMPI duplicates and dynamically switch
+the \c .data and \c .bss segments of the ELF process when switching
+the MPI ranks. This allows each ranks to have its own copy of the
+global variables. No copy actually occures as this mechanism uses \c
+mmap for efficiency. This mechanism is considered to be very robust on
+all systems supporting \c mmap (Linux and most BSDs). Its performance
+is questionable since each context switch between MPI ranks induces
+several syscalls to change the \c mmap that redirects the \c .data and
+\c .bss segments to the copies of the new rank. The code will also be
+copied several times in memory, inducing a slight increase of memory
+occupation.
+
+Another limitation is that SMPI only accounts for global variables
+defined in the executable. If the processes use external global
+variables from dynamic libraries, they won't be switched
+correctly. The easiest way to solve this is to statically link against
+the library with these globals. This way, each MPI rank will get its
+own copy of these libraries. Of course you should never statically
+link against the SimGrid library itself.
+
+With the <b>dlopen approach</b>, SMPI loads several copies of the same
+executable in memory as if it were a library, so that the global
+variables get naturally duplicated. It first requires the executable
+to be compiled as a relocatable binary, which is less common for
+programs than for libraries. But most distributions are now compiled
+this way for security reason as it allows to randomize the address
+space layout. It should thus be safe to compile most (any?) program
+this way. The second trick is that the dynamic linker refuses to link
+the exact same file several times, be it a library or a relocatable
+executable. It makes perfectly sense in the general case, but we need
+to circumvent this rule of thumb in our case. To that extend, the
+binary is copied in a temporary file before being re-linked against.
+`dlmopen()` cannot be used as it only allows 256 contextes, and as it
+would also dupplicate simgrid itself.
+
+This approach greatly speeds up the context switching, down to about
+40 CPU cycles with our raw contextes, instead of requesting several
+syscalls with the \c mmap approach. Another advantage is that it
+permits to run the SMPI contexts in parallel, which is obviously not
+possible with the \c mmap approach. It was tricky to implement, but we
+are not aware of any flaws, so smpirun activates it by default.
+
+In the future, it may be possible to further reduce the memory and
+disk consumption. It seems that we could <a
+href="https://lwn.net/Articles/415889/">punch holes</a> in the files
+before dl-loading them to remove the code and constants, and mmap
+these area onto a unique copy. If done correctly, this would reduce
+the disk- and memory- usage to the bare minimum, and would also reduce
+the pressure on the CPU instruction cache. See
+<a href="https://github.com/simgrid/simgrid/issues/137">the relevant
+bug</a> on github for implementation leads.\n
+
+Also, currently, only the binary is copied and dlopen-ed for each MPI
+rank. We could probably extend this to external dependencies, but for
+now, any external dependencies must be statically linked into your
+application. As usual, simgrid itself shall never be statically linked
+in your app. You don't want to give a copy of SimGrid to each MPI rank:
+that's ways too much for them to deal with.
+
+@section SMPI_adapting Adapting your MPI code for further scalability
+
+As detailed in the reference article (available at
+http://hal.inria.fr/hal-01415484), you may want to adapt your code
+to improve the simulation performance. But these tricks may seriously
+hinder the result quality (or even prevent the app to run) if used
+wrongly. We assume that if you want to simulate an HPC application,
+you know what you are doing. Don't prove us wrong!
+
+@subsection SMPI_adapting_size Reducing your memory footprint