-Multi-threaded MPI Programs",
-available at http://charm.cs.illinois.edu/newPapers/11-23/paper.pdf
-(note that this article does not deal with SMPI but with a competing
-solution called AMPI that suffers of the same issue).
-
-SimGrid can duplicate and dynamically switch the .data and .bss
-segments of the ELF process when switching the MPI ranks, allowing
-each ranks to have its own copy of the global variables. This feature
-is expected to work correctly on Linux and BSD, so smpirun activates
-it by default. As no copy is involved, performance should not be
-altered (but memory occupation will be higher).
-
-If you want to turn it off, pass \c -no-privatize to smpirun. This may
-be necessary if your application uses dynamic libraries as the global
-variables of these libraries will not be privatized. You can fix this
-by linking statically with these libraries (but NOT with libsimgrid,
-as we need SimGrid's own global variables).
+Multi-threaded MPI Programs", available at
+http://charm.cs.illinois.edu/newPapers/11-23/paper.pdf (note that this
+article does not deal with SMPI but with a competing solution called
+AMPI that suffers of the same issue). This point used to be
+problematic in SimGrid, but the problem should now be handled
+automatically on Linux.
+
+Older versions of SimGrid came with a script that automatically
+privatized the globals through static analysis of the source code. But
+our implementation was not robust enough to be used in production, so
+it was removed at some point. Currently, SMPI comes with two
+privatization mechanisms that you can @ref options_smpi_privatization
+"select at runtime". At the time of writing (v3.18), the dlopen
+approach is considered to be very fast (it's used by default) while
+the mmap approach is considered to be rather slow but very robust.
+
+With the <b>mmap approach</b>, SMPI duplicates and dynamically switch
+the \c .data and \c .bss segments of the ELF process when switching
+the MPI ranks. This allows each ranks to have its own copy of the
+global variables. No copy actually occures as this mechanism uses \c
+mmap for efficiency. This mechanism is considered to be very robust on
+all systems supporting \c mmap (Linux and most BSDs). Its performance
+is questionable since each context switch between MPI ranks induces
+several syscalls to change the \c mmap that redirects the \c .data and
+\c .bss segments to the copies of the new rank. The code will also be
+copied several times in memory, inducing a slight increase of memory
+occupation.
+
+Another limitation is that SMPI only accounts for global variables
+defined in the executable. If the processes use external global
+variables from dynamic libraries, they won't be switched
+correctly. The easiest way to solve this is to statically link against
+the library with these globals. This way, each MPI rank will get its
+own copy of these libraries. Of course you should never statically
+link against the SimGrid library itself.
+
+With the <b>dlopen approach</b>, SMPI loads several copies of the same
+executable in memory as if it were a library, so that the global
+variables get naturally duplicated. It first requires the executable
+to be compiled as a relocatable binary, which is less common for
+programs than for libraries. But most distributions are now compiled
+this way for security reason as it allows to randomize the address
+space layout. It should thus be safe to compile most (any?) program
+this way. The second trick is that the dynamic linker refuses to link
+the exact same file several times, be it a library or a relocatable
+executable. It makes perfectly sense in the general case, but we need
+to circumvent this rule of thumb in our case. To that extend, the
+binary is copied in a temporary file before being re-linked against.
+`dlmopen()` cannot be used as it only allows 256 contextes, and as it
+would also dupplicate simgrid itself.
+
+This approach greatly speeds up the context switching, down to about
+40 CPU cycles with our raw contextes, instead of requesting several
+syscalls with the \c mmap approach. Another advantage is that it
+permits to run the SMPI contexts in parallel, which is obviously not
+possible with the \c mmap approach. It was tricky to implement, but we
+are not aware of any flaws, so smpirun activates it by default.
+
+In the future, it may be possible to further reduce the memory and
+disk consumption. It seems that we could <a
+href="https://lwn.net/Articles/415889/">punch holes</a> in the files
+before dl-loading them to remove the code and constants, and mmap
+these area onto a unique copy. If done correctly, this would reduce
+the disk- and memory- usage to the bare minimum, and would also reduce
+the pressure on the CPU instruction cache. See
+<a href="https://github.com/simgrid/simgrid/issues/137">the relevant
+bug</a> on github for implementation leads.\n
+
+Also, currently, only the binary is copied and dlopen-ed for each MPI
+rank. We could probably extend this to external dependencies, but for
+now, any external dependencies must be statically linked into your
+application. As usual, simgrid itself shall never be statically linked
+in your app. You don't want to give a copy of SimGrid to each MPI rank:
+that's ways too much for them to deal with.