convert the old SMPI tuto to sphinx (was only a placeholder)

[simgrid.git] / docs / source / tuto_smpi.rst
diff --git a/docs/source/tuto_smpi.rst b/docs/source/tuto_smpi.rst

index 4599f92..eef698c 100644 (file)
--- a/docs/source/tuto_smpi.rst
+++ b/docs/source/tuto_smpi.rst
@@ -37,7 +37,7 @@ only plan to debug your application in a reproducible setup, without
  any performance-related analysis.
  
  How does it work?
  any performance-related analysis.
  
  How does it work?
-^^^^^^^^^^^^^^^^^
+.................
  
  In SMPI, communications are simulated while computations are
  emulated. This means that while computations occur as they would in
  
  In SMPI, communications are simulated while computations are
  emulated. This means that while computations occur as they would in
@@ -61,9 +61,8 @@ to predict the time taken by each communications. Any computations
  occuring between two MPI calls are benchmarked, and the corresponding
  time is reported into the simulator.
  
  occuring between two MPI calls are benchmarked, and the corresponding
  time is reported into the simulator.
  
-.. image:: /tuto_smpi/img/big-picture.png
-   :align: center
-
+.. image:: /tuto_smpi/img/big-picture.svg
+   :align: center         
  
  Describing Your Platform
  ------------------------
  
  Describing Your Platform
  ------------------------
@@ -71,44 +70,419 @@ Describing Your Platform
  As a SMPI user, you are supposed to provide a description of your
  virtual platform, that is mostly a set of simulated hosts and network
  links with some performance characteristics. SimGrid provides a plenty
  As a SMPI user, you are supposed to provide a description of your
  virtual platform, that is mostly a set of simulated hosts and network
  links with some performance characteristics. SimGrid provides a plenty
-of :ref:`documentation <platform>`_ and examples (in the
+of :ref:`documentation <platform>` and examples (in the
  `examples/platforms <https://framagit.org/simgrid/simgrid/tree/master/examples/platforms>`_
  source directory), and this section only shows a small set of introductory
  examples.
  
  `examples/platforms <https://framagit.org/simgrid/simgrid/tree/master/examples/platforms>`_
  source directory), and this section only shows a small set of introductory
  examples.
  
+Feel free to skip this section if you want to jump right away to usage
+examples.
+
  Simple Example with 3 hosts
  Simple Example with 3 hosts
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+...........................
  
  At the most basic level, you can describe your simulated platform as a
  graph of hosts and network links. For instance:
  
  
  At the most basic level, you can describe your simulated platform as a
  graph of hosts and network links. For instance:
  
-.. image:: /tuto_smpi/img/3hosts.png
+.. image:: /tuto_smpi/3hosts.png
+   :align: center
+
+.. literalinclude:: /tuto_smpi/3hosts.xml
+   :language: xml
+
+Note the way in which hosts, links, and routes are defined in
+this XML. All hosts are defined with a speed (in Gflops), and links
+with a latency (in us) and bandwidth (in MBytes per second). Other
+units are possible and written as expected. Routes specify the list of
+links encountered from one route to another. Routes are symmetrical by
+default.
+
+Cluster with a Crossbar
+.......................
+
+A very common parallel computing platform is a homogeneous cluster in
+which hosts are interconnected via a crossbar switch with as many
+ports as hosts, so that any disjoint pairs of hosts can communicate
+concurrently at full speed. For instance:
+
+.. literalinclude:: ../../examples/platforms/cluster_crossbar.xml
+   :language: xml
+   :lines: 1-3,18-
+
+One specifies a name prefix and suffix for each host, and then give an
+integer range. In the example the cluster contains 65535 hosts (!),
+named ``node-0.simgrid.org`` to ``node-65534.simgrid.org``. All hosts
+have the same power (1 Gflop/sec) and are connected to the switch via
+links with same bandwidth (125 MBytes/sec) and latency (50
+microseconds).
+
+.. todo::
+
+   Add the picture.
+
+Cluster with a Shared Backbone
+..............................
+
+Another popular model for a parallel platform is that of a set of
+homogeneous hosts connected to a shared communication medium, a
+backbone, with some finite bandwidth capacity and on which
+communicating host pairs can experience contention. For instance:
+
+
+.. literalinclude:: ../../examples/platforms/cluster_backbone.xml
+   :language: xml
+   :lines: 1-3,18-
+
+The only differences with the crossbar cluster above are the ``bb_bw``
+and ``bb_lat`` attributes that specify the backbone characteristics
+(here, a 500 microseconds latency and a 2.25 GByte/sec
+bandwidth). This link is used for every communication within the
+cluster. The route from ``node-0.simgrid.org`` to ``node-1.simgrid.org``
+counts 3 links: the private link of ``node-0.simgrid.org``, the backbone
+and the private link of ``node-1.simgrid.org``.
+          
+.. todo::
+
+   Add the picture.
+
+Torus Cluster
+.............
+
+Many HPC facilities use torus clusters to reduce sharing and
+performance loss on concurrent internal communications. Modeling this
+in SimGrid is very easy. Simply add a ``topology="TORUS"`` attribute
+to your cluster. Configure it with the ``topo_parameters="X,Y,Z"``
+attribute, where ``X``, ``Y`` and ``Z`` are the dimension of your
+torus.
+
+.. image:: ../../examples/platforms/cluster_torus.svg
+   :align: center
+
+.. literalinclude:: ../../examples/platforms/cluster_torus.xml
+   :language: xml
+
+Note that in this example, we used ``loopback_bw`` and
+``loopback_lat`` to specify the characteristics of the loopback link
+of each node (i.e., the link allowing each node to communicate with
+itself). We could have done so in previous example too. When no
+loopback is given, the communication from a node to itself is handled
+as if it were two distinct nodes: it goes twice through the private
+link and through the backbone (if any).
+
+Fat-Tree Cluster
+................
+
+This topology was introduced to reduce the amount of links in the
+cluster (and thus reduce its price) while maintaining a high bisection
+bandwidth and a relatively low diameter. To model this in SimGrid,
+pass a ``topology="FAT_TREE"`` attribute to your cluster. The
+``topo_parameters=#levels;#downlinks;#uplinks;link count`` follows the
+semantic introduced in the `Figure 1B of this article
+<http://webee.eedev.technion.ac.il/wp-content/uploads/2014/08/publication_574.pdf>`_.
+
+Here is the meaning of this example: ``2 ; 4,4 ; 1,2 ; 1,2``
+
+- That's a two-level cluster (thus the initial ``2``).
+- Routers are connected to 4 elements below them, regardless of its
+  level. Thus the ``4,4`` component that is used as
+  ``#downlinks``. This means that the hosts are grouped by 4 on a
+  given router, and that there is 4 level-1 routers (in the middle of
+  the figure).
+- Hosts are connected to only 1 router above them, while these routers
+  are connected to 2 routers above them (thus the ``1,2`` used as
+  ``#uplink``).
+- Hosts have only one link to their router while every path between a
+  level-1 routers and level-2 routers use 2 parallel links. Thus the
+  ``1,2`` that is used as ``link count``.
+
+.. image:: ../../examples/platforms/cluster_fat_tree.svg
     :align: center
  
     :align: center
  
-.. hidden-code-block:: xml
-    :starthidden: True
-    :label: See the XML platform description file...
-
-    <?xml version='1.0'?>
-    <!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
-    <platform version="4.1">
-      <zone id="AS0" routing="Full">
-        <host id="host0" speed="1Gf"/>
-        <host id="host1" speed="2Gf"/>
-        <host id="host2" speed="40Gf"/>
-        <link id="link0" bandwidth="125MBps" latency="100us"/>
-        <link id="link1" bandwidth="50MBps" latency="150us"/>
-        <link id="link2" bandwidth="250MBps" latency="50us"/>
-        <route src="host0" dst="host1"><link_ctn id="link0"/><link_ctn id="link1"/></route>
-        <route src="host1" dst="host2"><link_ctn id="link1"/><link_ctn id="link2"/></route>
-        <route src="host0" dst="host2"><link_ctn id="link0"/><link_ctn id="link2"/></route>
-      </zone>
-    </platform>
-
-In this XML, note the way in which hosts, links, and routes are
-defined. All hosts are defined with a power (i.e., compute speed in
-Gflops), and links with a latency (in us) and bandwidth (in MBytes per
-second). Other units are possible and written as expected. By default,
-routes are symmetrical.
-       
+.. literalinclude:: ../../examples/platforms/cluster_fat_tree.xml
+   :language: xml
+   :lines: 1-3,10-
+
+
+Dragonfly Cluster
+.................
+
+This topology was introduced to further reduce the amount of links
+while maintaining a high bandwidth for local communications. To model
+this in SimGrid, pass a ``topology="DRAGONFLY"`` attribute to your
+cluster.
+
+.. literalinclude:: ../../examples/platforms/cluster_dragonfly.xml
+   :language: xml
+
+.. todo::
+
+   Add the image, and the documuentation of the topo_parameters.
+
+Final Word
+..........
+
+We only glanced over the abilities offered by SimGrid to describe the
+platform topology. Other networking zones model non-HPC platforms
+(such as wide area networks, ISP network comprising set-top boxes, or
+even your own routing schema). You can interconnect several networking
+zones in your platform to form a tree of zones, that is both a time-
+and memory-efficient representation of distributed platforms. Please
+head to the dedicated :ref:`documentation <platform>` for more
+information.
+
+Hands-on!
+---------
+
+It is time to start using SMPI yourself. For that, you first need to
+install it somehow, and then you will need a MPI application to play with.
+
+Using Docker
+............
+
+The easiest way to take the tutorial is to use the dedicated Docker
+image. Once you `installed Docker itself
+<https://docs.docker.com/install/>`_, simply do the following:
+
+.. code-block:: shell
+
+   docker pull simgrid/tuto-smpi
+   docker run -it --rm --name simgrid --volume ~/smpi-tutorial:/source/tutorial simgrid/tuto-smpi bash
+
+This will start a new container with all you need to take this
+tutorial, and create a ``smpi-tutorial`` directory in your home on
+your host machine that will be visible as ``/source/tutorial`` within the
+container.  You can then edit the files you want with your favorite
+editor in ``~/smpi-tutorial``, and compile them within the
+container to enjoy the provided dependencies.
+
+.. warning::
+
+   Any change to the container out of ``/source/tutorial`` will be lost
+   when you log out of the container, so don't edit the other files!
+
+All needed dependencies are already installed in this container
+(SimGrid, the C/C++/Fortran compilers, make, pajeng and R). Vite being
+only optional in this tutorial, it is not installed to reduce the
+image size. 
+
+The container also include the example platform files from the
+previous section as well as the source code of the NAS Parallel
+Benchmarks. These files are available under
+``/source/simgrid-template-smpi`` in the image. You should copy it to
+your working directory when you first log in:
+
+.. code-block:: shell
+
+   cp -r /source/simgrid-template-smpi/* /source/tutorial
+   cd /source/tutorial
+
+Using your Computer Natively
+............................
+
+To take the tutorial on your machine, you first need to :ref:`install
+SimGrid <install>`, the C/C++/Fortran compilers and also ``pajeng`` to
+visualize the traces. You may want to install `Vite
+<http://vite.gforge.inria.fr/>`_ to get a first glance at the
+traces. The provided code template requires make to compile. On
+Debian and Ubuntu for example, you can get them as follows:
+
+.. code-block:: shell
+
+   sudo apt install simgrid pajeng make gcc g++ gfortran vite
+
+To take this tutorial, you will also need the platform files from the
+previous section as well as the source code of the NAS Parallel
+Benchmarks. Just  clone `this repository
+<https://framagit.org/simgrid/simgrid-template-smpi>`_  to get them all:
+
+.. code-block:: shell
+
+   git clone git@framagit.org:simgrid/simgrid-template-smpi.git
+   cd simgrid-template-smpi/
+
+If you struggle with the compilation, then you should double check
+your :ref:`SimGrid installation <install>`.  On need, please refer to
+the :ref:`Troubleshooting your Project Setup
+<install_yours_troubleshooting>` section.
+
+Lab 0: Hello World
+------------------
+
+It is time to simulate your first MPI program. Use the simplistic
+example `roundtrip.c
+<https://framagit.org/simgrid/simgrid-template-smpi/raw/master/roundtrip.c?inline=false>`_
+that comes with the template.
+
+.. literalinclude:: /tuto_smpi/roundtrip.c
+   :language: c
+
+Compiling and Executing
+.......................
+             
+Compiling the program is straightforward (double check your
+:ref:`SimGrid installation <install>` if you get an error message):
+
+
+.. code-block:: shell
+               
+  $ smpicc -O3 roundtrip.c -o roundtrip
+
+
+Once compiled, you can simulate the execution of this program on 16
+nodes from the ``cluster_crossbar.xml`` platform as follows:
+
+.. code-block:: shell
+
+   $ smpirun -np 16 -platform cluster_crossbar.xml -hostfile cluster_hostfile ./roundtrip
+
+- The ``-np 16`` option, just like in regular MPI, specifies the
+  number of MPI processes to use. 
+- The ``-hostfile cluster_hostfile`` option, just like in regular
+  MPI, specifies the host file. If you omit this option, ``smpirun``
+  will deploy the application on the first machines of your platform.
+- The ``-platform cluster_crossbar.xml`` option, **which doesn't exist
+  in regular MPI**, specifies the platform configuration to be
+  simulated. 
+- At the end of the line, one finds the executable name and
+  command-line arguments (if any -- roundtrip does not expect any arguments).
+
+Feel free to tweak the content of the XML platform file and the
+program to see the effect on the simulated execution time. It may be
+easier to compare the executions with the extra option
+``--cfg=smpi/display_timing:yes``.  Note that the simulation accounts
+for realistic network protocol effects and MPI implementation
+effects. As a result, you may see "unexpected behavior" like in the
+real world (e.g., sending a message 1 byte larger may lead to
+significant higher execution time).
+
+Lab 1: Visualizing LU
+---------------------
+
+We will now simulate a larger application: the LU benchmark of the NAS
+suite. The version provided in the code template was modified to
+compile with SMPI instead of the regular MPI. Compare the difference
+between the original ``config/make.def.template`` and the
+``config/make.def`` that was adapted to SMPI. We use ``smpiff`` and
+``smpicc`` as compilers, and don't pass any additional library.
+
+Now compile and execute the LU benchmark, class S (i.e., for `small
+data size
+<https://www.nas.nasa.gov/publications/npb_problem_sizes.html>`_) with
+4 nodes.
+
+.. code-block:: shell
+
+   $ make lu NPROCS=4 CLASS=S
+   (compilation logs)
+   $ smpirun -np 4 -platform ../cluster_backbone.xml bin/lu.S.4
+   (execution logs)
+
+To get a better understanding of what is going on, activate the
+vizualization tracing, and convert the produced trace for later
+use:
+
+.. code-block:: shell
+
+   smpirun -np 4 -platform ../cluster_backbone.xml -trace --cfg=tracing/filename:lu.S.4.trace bin/lu.S.4
+   pj_dump --ignore-incomplete-links lu.S.4.trace | grep State > lu.S.4.state.csv
+
+You can then produce a Gantt Chart with the following R chunk. You can
+either copy/paste it in a R session, or `turn it into a Rscript executable
+<https://swcarpentry.github.io/r-novice-inflammation/05-cmdline/>`_ to
+run it again and again.
+
+.. code-block:: R
+
+   library(ggplot2)
+
+   # Read the data
+   df_state = read.csv("lu.S.4.state.csv", header=F, strip.white=T)
+   names(df_state) = c("Type", "Rank", "Container", "Start", "End", "Duration", "Level", "State");
+   df_state = df_state[!(names(df_state) %in% c("Type","Container","Level"))]
+   df_state$Rank = as.numeric(gsub("rank-","",df_state$Rank))
+
+   # Draw the Gantt Chart
+   gc = ggplot(data=df_state) + geom_rect(aes(xmin=Start, xmax=End, ymin=Rank, ymax=Rank+1,fill=State))
+
+   # Produce the output
+   plot(gc)
+   dev.off()
+
+This produces a file called ``Rplots.pdf`` with the following
+content. You can find more visualization examples `online
+<http://simgrid.gforge.inria.fr/contrib/R_visualization.html>`_.
+
+.. image:: /tuto_smpi/img/lu.S.4.png
+   :align: center
+
+Lab 2: Tracing and Replay of LU
+-------------------------------
+
+Now compile and execute the LU benchmark, class A, with 32 nodes.
+
+.. code-block:: shell
+
+   $ make lu NPROCS=32 CLASS=A
+
+This takes several minutes to to simulate, because all code from all
+processes has to be really executed, and everything is serialized.
+
+SMPI provides several methods to speed things up. One of them is to
+capture a time independent trace of the running application, and
+replay it on a different platform with the same amount of nodes. The
+replay is much faster than live simulation, as the computations are
+skipped (the application must be network-dependent for this to work).
+
+You can even generate the trace during as live simulation, as follows:
+
+.. code-block:: shell
+
+   $ smpirun -trace-ti --cfg=tracing/filename:LU.A.32 -np 32 -platform ../cluster_backbone.xml bin/lu.A.32 
+
+The produced trace is composed of a file ``LU.A.32`` and a folder
+``LU.A.32_files``. To replay this with SMPI, you need to first compile
+the provided ``smpi_replay.cpp`` file, that comes from
+`simgrid/examples/smpi/replay
+<https://framagit.org/simgrid/simgrid/tree/master/examples/smpi/replay>`_.
+
+.. code-block:: shell
+
+   $ smpicxx ../replay.cpp -O3 -o ../smpi_replay
+
+Afterward, you can replay your trace in SMPI as follows:
+
+   $ smpirun -np 32 -platform ../cluster_torus.xml -ext smpi_replay ../smpi_replay LU.A.32
+
+All the outputs are gone, as the application is not really simulated
+here. Its trace is simply replayed. But if you visualize the live
+simulation and the replay, you will see that the behavior is
+unchanged. The simulation does not run much faster on this very
+example, but this becomes very interesting when your application
+is computationally hungry.
+
+.. todo:: smpi_replay should be installed by SimGrid, and smpirun interface could be simplified here.
+
+Lab 3: Execution Sampling on EP
+-------------------------------
+
+The second method to speed up simulations is to sample the computation
+parts in the code.  This means that the person doing the simulation
+needs to know the application and identify parts that are compute
+intensive and take time, while being regular enough not to ruin
+simulation accuracy. Furthermore there should not be any MPI calls
+inside such parts of the code.
+
+Use the EP benchmark, class B, 16 processes.
+
+.. todo:: write this section, and the following ones.
+
+Further Readings
+----------------
+
+We regularly use SimGrid in our teachings on MPI. This way, our
+student can experiment with platforms that they do not have access to,
+and the associated visualisation tools helps them to understand their
+work.  The whole material is available online, in a separate project:
+the `SMPI CourseWare <https://simgrid.github.io/SMPI_CourseWare/>`_.
+
  ..  LocalWords:  SimGrid
  ..  LocalWords:  SimGrid