+\section SMPI_adapting Adapting your MPI code to the use of SMPI
+
+As detailed in the reference article (available at
+http://hal.inria.fr/inria-00527150), you may want to adapt your code
+to improve the simulation performance. But these tricks may seriously
+hinder the result quality (or even prevent the app to run) if used
+wrongly. We assume that if you want to simulate an HPC application,
+you know what you are doing. Don't prove us wrong!
+
+\section SMPI_adapting_size Reducing your memory footprint
+
+If you get short on memory (the whole app is executed on a single node when
+simulated), you should have a look at the SMPI_SHARED_MALLOC and
+SMPI_SHARED_FREE macros. It allows to share memory areas between processes: The
+purpose of these macro is that the same line malloc on each process will point
+to the exact same memory area. So if you have a malloc of 2M and you have 16
+processes, this macro will change your memory consumption from 2M*16 to 2M
+only. Only one block for all processes.
+
+If your program is ok with a block containing garbage value because all
+processes write and read to the same place without any kind of coordination,
+then this macro can dramatically shrink your memory consumption. For example,
+that will be very beneficial to a matrix multiplication code, as all blocks will
+be stored on the same area. Of course, the resulting computations will useless,
+but you can still study the application behavior this way.
+
+Naturally, this won't work if your code is data-dependent. For example, a Jacobi
+iterative computation depends on the result computed by the code to detect
+convergence conditions, so turning them into garbage by sharing the same memory
+area between processes does not seem very wise. You cannot use the
+SMPI_SHARED_MALLOC macro in this case, sorry.
+
+This feature is demoed by the example file
+<tt>examples/smpi/NAS/DT-folding/dt.c</tt>
+
+\section SMPI_adapting_speed Toward faster simulations
+
+If your application is too slow, try using SMPI_SAMPLE_LOCAL,
+SMPI_SAMPLE_GLOBAL and friends to indicate which computation loops can
+be sampled. Some of the loop iterations will be executed to measure
+their duration, and this duration will be used for the subsequent
+iterations. These samples are done per processor with
+SMPI_SAMPLE_LOCAL, and shared between all processors with
+SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution
+time of your loop iteration are not stable.
+
+This feature is demoed by the example file
+<tt>examples/smpi/NAS/EP-sampling/ep.c</tt>
+
+