rework the SMPI documentation quite a bit

[simgrid.git] / doc / doxygen / module-smpi.doc
diff --git a/doc/doxygen/module-smpi.doc b/doc/doxygen/module-smpi.doc

index 71e68a0..77abc9b 100644 (file)
--- a/doc/doxygen/module-smpi.doc
+++ b/doc/doxygen/module-smpi.doc
@@ -1,64 +1,81 @@
  /** @addtogroup SMPI_API
  
-This programming environment permits to study existing MPI application
-by emulating them on top of the SimGrid simulator. In other words, it
-will constitute an emulation solution for parallel codes. You don't
-even have to modify your code for that, although that may help, as
-detailed below.
+This programming environment enables the study of MPI application by
+emulating them on top of the SimGrid simulator. This is particularly
+interesting to study existing MPI applications within the comfort of
+the simulator. The motivation for this work is detailed in the
+reference article (available at http://hal.inria.fr/inria-00527150).
+
+Our goal is to enable the study of *unmodified* MPI applications,
+although we are not quite there yet (see @ref SMPI_what). In
+addition, you can modify your code to speed up your studies or
+otherwise increase their scalability (see @ref SMPI_adapting).
  
  \section SMPI_who Who should use SMPI (and who shouldn't)
  
-You should use this programming environment of the SimGrid suite if
-you want to study existing MPI applications.  If you want to study
-some heuristics for a given problem (and if your goal is to produce
-theorems and publications, not code), have a look at the \ref MSG_API
-environment, or the \ref SD_API one if you need to use DAGs. If none
-of those programming environments fits your needs, you may consider
-implementing your own directly on top of \ref SURF_API (but you
-probably want to contact us before).
+SMPI is now considered as stable and you can use it in production. You
+may probably want to read the scientific publications that detail the
+models used and their limits, but this should not be absolutely
+necessary.
+
+If you already fluently write and use MPI applications, SMPI should
+sound very familiar to you. Use smpicc instead of mpicc, and smpirun
+instead of mpirun. Some more information are given below on this page.
+
+Of course, if you don't know what MPI is, the documentation of SMPI
+will seem a bit terse to you. You should pick up a good MPI tutorial
+on the Internet (or a course in your university) and come back to SMPI
+once you know a bit more about MPI. Alternatively, you may want to
+turn to the other SimGrid interface such as the \ref MSG_API
+environment, or the \ref SD_API one.
  
  \section SMPI_what What can run within SMPI?
  
  You can run unmodified MPI applications (both C and Fortran) within
-SMPI, provided you only use MPI calls that we implemented in MPI. Our
-coverage of the interface is not bad, but will probably never be
-complete. One sided communications and I/O primitives are not targeted
-for now. The full list of not yet implemented functions is available
-in file <tt>include/smpi/smpi.h</tt> of the archive, between two lines
+SMPI, provided that (1) you only use MPI calls that we implemented in
+MPI and (2) you don't use any globals in your application.
+
+\subsection SMPI_what_coverage MPI coverage of SMPI
+
+Our coverage of the interface is very decent, but still incomplete;
+Given the size of the MPI standard, it may well be that we never
+implement absolutely all existing primitives. One sided communications
+and I/O primitives are not targeted for now. Our current state is
+still very decent: we pass most of the MPICH coverage tests.
+
+The full list of not yet implemented functions is documented in the
+file <tt>include/smpi/smpi.h</tt> of the archive, between two lines
  containing the <tt>FIXME</tt> marker. If you really need a missing
  feature, please get in touch with us: we can guide you though the
  SimGrid code to help you implementing it, and we'd glad to integrate
  it in the main project afterward if you contribute them back.
  
-\section SMPI_adapting Adapting your MPI code to the use of SMPI
-
-As detailed in the reference article (available at
-http://hal.inria.fr/inria-00527150), you may want to adapt your code
-to improve the simulation performance. But these tricks may seriously
-hinder the result qualtity (or even prevent the app to run) if used
-wrongly. We assume that if you want to simulate an HPC application,
-you know what you are doing. Don't prove us wrong!
-
-If you get short on memory (the whole app is executed on a single node
-when simulated), you should have a look at the SMPI_SHARED_MALLOC and
-SMPI_SHARED_FREE macros. It allows to share memory areas between
-processes. For example, matrix multiplication code may want to store
-the blocks on the same area. Of course, the resulting computations
-will useless, but you can still study the application behavior this
-way. Of course, if your code is data-dependent, this won't work.
-
-If your application is too slow, try using SMPI_SAMPLE_LOCAL,
-SMPI_SAMPLE_GLOBAL and friends to indicate which computation loops can
-be sampled. Some of the loop iterations will be executed to measure
-their duration, and this duration will be used for the subsequent
-iterations. These samples are done per processor with
-SMPI_SAMPLE_LOCAL, and shared between all processors with
-SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution
-time of your loop iteration are not stable.
-
-Yes, that's right, these macros are not documented yet, but we'll fix
-it as soon as time permits. Sorry about that -- patch welcomed!
-Meanwhile, grep for them on the examples for more information.
+\subsection SMPI_what_globals Issues with the globals
+
+Concerning the globals, the problem comes from the fact that usually,
+MPI processes run as real UNIX processes while they are all folded
+into threads of a unique system process in SMPI. Global variables are
+usually private to each MPI process while they become shared between
+the processes in SMPI. This point is rather problematic, and currently
+forces to modify your application to privatize the global variables.
+
+We tried several techniques to work this around. We used to have a
+script that privatized automatically the globals through static
+analysis of the source code, but it was not robust enough to be used
+in production. This issue, as well as several potential solutions, is
+discussed in this article: "Automatic Handling of Global Variables for
+Multi-threaded MPI Programs",
+available at http://charm.cs.illinois.edu/newPapers/11-23/paper.pdf
+(note that this article does not deal with SMPI but with a concurrent
+solution called AMPI that suffers of the same issue). 
+
+Currently, we have no solution to offer you, because all proposed solutions will
+modify the performance of your application (in the computational
+sections). Sacrificing realism for usability is not very satisfying, so we did
+not implement them yet. You will thus have to modify your application if it uses
+global variables. We are working on another solution, leveraging distributed
+simulation to keep each MPI process within a separate system process, but this
+is far from being ready at the moment.
  
  \section SMPI_compiling Compiling your code
  
@@ -93,4 +110,54 @@ by running
  smpirun -help
  \endverbatim
  
+\section SMPI_adapting Adapting your MPI code to the use of SMPI
+
+As detailed in the reference article (available at
+http://hal.inria.fr/inria-00527150), you may want to adapt your code
+to improve the simulation performance. But these tricks may seriously
+hinder the result quality (or even prevent the app to run) if used
+wrongly. We assume that if you want to simulate an HPC application,
+you know what you are doing. Don't prove us wrong!
+
+\section SMPI_adapting_size Reducing your memory footprint
+
+If you get short on memory (the whole app is executed on a single node when
+simulated), you should have a look at the SMPI_SHARED_MALLOC and
+SMPI_SHARED_FREE macros. It allows to share memory areas between processes: The
+purpose of these macro is that the same line malloc on each process will point
+to the exact same memory area. So if you have a malloc of 2M and you have 16
+processes, this macro will change your memory consumption from 2M*16 to 2M
+only. Only one block for all processes.
+
+If your program is ok with a block containing garbage value because all
+processes write and read to the same place without any kind of coordination,
+then this macro can dramatically shrink your memory consumption. For example,
+that will be very beneficial to a matrix multiplication code, as all blocks will
+be stored on the same area. Of course, the resulting computations will useless,
+but you can still study the application behavior this way. 
+
+Naturally, this won't work if your code is data-dependent. For example, a Jacobi
+iterative computation depends on the result computed by the code to detect
+convergence conditions, so turning them into garbage by sharing the same memory
+area between processes does not seem very wise. You cannot use the
+SMPI_SHARED_MALLOC macro in this case, sorry.
+
+This feature is demoed by the example file
+<tt>examples/smpi/NAS/DT-folding/dt.c</tt>
+
+\section SMPI_adapting_speed Toward faster simulations
+
+If your application is too slow, try using SMPI_SAMPLE_LOCAL,
+SMPI_SAMPLE_GLOBAL and friends to indicate which computation loops can
+be sampled. Some of the loop iterations will be executed to measure
+their duration, and this duration will be used for the subsequent
+iterations. These samples are done per processor with
+SMPI_SAMPLE_LOCAL, and shared between all processors with
+SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution
+time of your loop iteration are not stable.
+
+This feature is demoed by the example file 
+<tt>examples/smpi/NAS/EP-sampling/ep.c</tt>
+
+
  */
 \ No newline at end of file