tuto smpi: start lab 3 (Execution Sampling)

[simgrid.git] / docs / source / tuto_smpi.rst
diff --git a/docs/source/tuto_smpi.rst b/docs/source/tuto_smpi.rst

index 4599f92..7bc1e13 100644 (file)
--- a/docs/source/tuto_smpi.rst
+++ b/docs/source/tuto_smpi.rst
@@ -37,7 +37,7 @@ only plan to debug your application in a reproducible setup, without
  any performance-related analysis.
  
  How does it work?
-^^^^^^^^^^^^^^^^^
+.................
  
  In SMPI, communications are simulated while computations are
  emulated. This means that while computations occur as they would in
@@ -61,9 +61,8 @@ to predict the time taken by each communications. Any computations
  occuring between two MPI calls are benchmarked, and the corresponding
  time is reported into the simulator.
  
-.. image:: /tuto_smpi/img/big-picture.png
-   :align: center
-
+.. image:: /tuto_smpi/img/big-picture.svg
+   :align: center         
  
  Describing Your Platform
  ------------------------
@@ -71,44 +70,403 @@ Describing Your Platform
  As a SMPI user, you are supposed to provide a description of your
  virtual platform, that is mostly a set of simulated hosts and network
  links with some performance characteristics. SimGrid provides a plenty
-of :ref:`documentation <platform>`_ and examples (in the
+of :ref:`documentation <platform>` and examples (in the
  `examples/platforms <https://framagit.org/simgrid/simgrid/tree/master/examples/platforms>`_
  source directory), and this section only shows a small set of introductory
  examples.
  
+Feel free to skip this section if you want to jump right away to usage
+examples.
+
  Simple Example with 3 hosts
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
+...........................
  
  At the most basic level, you can describe your simulated platform as a
  graph of hosts and network links. For instance:
  
-.. image:: /tuto_smpi/img/3hosts.png
+.. image:: /tuto_smpi/3hosts.png
+   :align: center
+
+.. literalinclude:: /tuto_smpi/3hosts.xml
+   :language: xml
+
+Note the way in which hosts, links, and routes are defined in
+this XML. All hosts are defined with a speed (in Gflops), and links
+with a latency (in us) and bandwidth (in MBytes per second). Other
+units are possible and written as expected. Routes specify the list of
+links encountered from one route to another. Routes are symmetrical by
+default.
+
+Cluster with a Crossbar
+.......................
+
+A very common parallel computing platform is a homogeneous cluster in
+which hosts are interconnected via a crossbar switch with as many
+ports as hosts, so that any disjoint pairs of hosts can communicate
+concurrently at full speed. For instance:
+
+.. literalinclude:: ../../examples/platforms/cluster_crossbar.xml
+   :language: xml
+   :lines: 1-3,18-
+
+One specifies a name prefix and suffix for each host, and then give an
+integer range. In the example the cluster contains 65535 hosts (!),
+named ``node-0.simgrid.org`` to ``node-65534.simgrid.org``. All hosts
+have the same power (1 Gflop/sec) and are connected to the switch via
+links with same bandwidth (125 MBytes/sec) and latency (50
+microseconds).
+
+.. todo::
+
+   Add the picture.
+
+Cluster with a Shared Backbone
+..............................
+
+Another popular model for a parallel platform is that of a set of
+homogeneous hosts connected to a shared communication medium, a
+backbone, with some finite bandwidth capacity and on which
+communicating host pairs can experience contention. For instance:
+
+
+.. literalinclude:: ../../examples/platforms/cluster_backbone.xml
+   :language: xml
+   :lines: 1-3,18-
+
+The only differences with the crossbar cluster above are the ``bb_bw``
+and ``bb_lat`` attributes that specify the backbone characteristics
+(here, a 500 microseconds latency and a 2.25 GByte/sec
+bandwidth). This link is used for every communication within the
+cluster. The route from ``node-0.simgrid.org`` to ``node-1.simgrid.org``
+counts 3 links: the private link of ``node-0.simgrid.org``, the backbone
+and the private link of ``node-1.simgrid.org``.
+          
+.. todo::
+
+   Add the picture.
+
+Torus Cluster
+.............
+
+Many HPC facilities use torus clusters to reduce sharing and
+performance loss on concurrent internal communications. Modeling this
+in SimGrid is very easy. Simply add a ``topology="TORUS"`` attribute
+to your cluster. Configure it with the ``topo_parameters="X,Y,Z"``
+attribute, where ``X``, ``Y`` and ``Z`` are the dimension of your
+torus.
+
+.. image:: ../../examples/platforms/cluster_torus.svg
+   :align: center
+
+.. literalinclude:: ../../examples/platforms/cluster_torus.xml
+   :language: xml
+
+Note that in this example, we used ``loopback_bw`` and
+``loopback_lat`` to specify the characteristics of the loopback link
+of each node (i.e., the link allowing each node to communicate with
+itself). We could have done so in previous example too. When no
+loopback is given, the communication from a node to itself is handled
+as if it were two distinct nodes: it goes twice through the private
+link and through the backbone (if any).
+
+Fat-Tree Cluster
+................
+
+This topology was introduced to reduce the amount of links in the
+cluster (and thus reduce its price) while maintaining a high bisection
+bandwidth and a relatively low diameter. To model this in SimGrid,
+pass a ``topology="FAT_TREE"`` attribute to your cluster. The
+``topo_parameters=#levels;#downlinks;#uplinks;link count`` follows the
+semantic introduced in the `Figure 1B of this article
+<http://webee.eedev.technion.ac.il/wp-content/uploads/2014/08/publication_574.pdf>`_.
+
+Here is the meaning of this example: ``2 ; 4,4 ; 1,2 ; 1,2``
+
+- That's a two-level cluster (thus the initial ``2``).
+- Routers are connected to 4 elements below them, regardless of its
+  level. Thus the ``4,4`` component that is used as
+  ``#downlinks``. This means that the hosts are grouped by 4 on a
+  given router, and that there is 4 level-1 routers (in the middle of
+  the figure).
+- Hosts are connected to only 1 router above them, while these routers
+  are connected to 2 routers above them (thus the ``1,2`` used as
+  ``#uplink``).
+- Hosts have only one link to their router while every path between a
+  level-1 routers and level-2 routers use 2 parallel links. Thus the
+  ``1,2`` that is used as ``link count``.
+
+.. image:: ../../examples/platforms/cluster_fat_tree.svg
+   :align: center
+
+.. literalinclude:: ../../examples/platforms/cluster_fat_tree.xml
+   :language: xml
+   :lines: 1-3,10-
+
+
+Dragonfly Cluster
+.................
+
+This topology was introduced to further reduce the amount of links
+while maintaining a high bandwidth for local communications. To model
+this in SimGrid, pass a ``topology="DRAGONFLY"`` attribute to your
+cluster.
+
+.. literalinclude:: ../../examples/platforms/cluster_dragonfly.xml
+   :language: xml
+
+.. todo::
+
+   Add the image, and the documuentation of the topo_parameters.
+
+Final Word
+..........
+
+We only glanced over the abilities offered by SimGrid to describe the
+platform topology. Other networking zones model non-HPC platforms
+(such as wide area networks, ISP network comprising set-top boxes, or
+even your own routing schema). You can interconnect several networking
+zones in your platform to form a tree of zones, that is both a time-
+and memory-efficient representation of distributed platforms. Please
+head to the dedicated :ref:`documentation <platform>` for more
+information.
+
+Hands-on!
+---------
+
+It is time to start using SMPI yourself. For that, you first need to
+install it somehow, and then you will need a MPI application to play with.
+
+Using Docker
+............
+
+The easiest way to take the tutorial is to use the dedicated Docker
+image. Once you `installed Docker itself
+<https://docs.docker.com/install/>`_, simply do the following:
+
+.. code-block:: shell
+
+   docker pull simgrid/tuto-smpi
+   docker run -it --rm --name simgrid --volume ~/smpi-tutorial:/source/tutorial simgrid/tuto-smpi bash
+
+This will start a new container with all you need to take this
+tutorial, and create a ``smpi-tutorial`` directory in your home on
+your host machine that will be visible as ``/source/tutorial`` within the
+container.  You can then edit the files you want with your favorite
+editor in ``~/smpi-tutorial``, and compile them within the
+container to enjoy the provided dependencies.
+
+.. warning::
+
+   Any change to the container out of ``/source/tutorial`` will be lost
+   when you log out of the container, so don't edit the other files!
+
+All needed dependencies are already installed in this container
+(SimGrid, the C/C++/Fortran compilers, make, pajeng and R). Vite being
+only optional in this tutorial, it is not installed to reduce the
+image size. 
+
+The container also include the example platform files from the
+previous section as well as the source code of the NAS Parallel
+Benchmarks. These files are available under
+``/source/simgrid-template-smpi`` in the image. You should copy it to
+your working directory when you first log in:
+
+.. code-block:: shell
+
+   cp -r /source/simgrid-template-smpi/* /source/tutorial
+   cd /source/tutorial
+
+Using your Computer Natively
+............................
+
+To take the tutorial on your machine, you first need to :ref:`install
+SimGrid <install>`, the C/C++/Fortran compilers and also ``pajeng`` to
+visualize the traces. You may want to install `Vite
+<http://vite.gforge.inria.fr/>`_ to get a first glance at the
+traces. The provided code template requires make to compile. On
+Debian and Ubuntu for example, you can get them as follows:
+
+.. code-block:: shell
+
+   sudo apt install simgrid pajeng make gcc g++ gfortran vite
+
+To take this tutorial, you will also need the platform files from the
+previous section as well as the source code of the NAS Parallel
+Benchmarks. Just  clone `this repository
+<https://framagit.org/simgrid/simgrid-template-smpi>`_  to get them all:
+
+.. code-block:: shell
+
+   git clone git@framagit.org:simgrid/simgrid-template-smpi.git
+   cd simgrid-template-smpi/
+
+If you struggle with the compilation, then you should double check
+your :ref:`SimGrid installation <install>`.  On need, please refer to
+the :ref:`Troubleshooting your Project Setup
+<install_yours_troubleshooting>` section.
+
+Lab 0: Hello World
+------------------
+
+It is time to simulate your first MPI program. Use the simplistic
+example `roundtrip.c
+<https://framagit.org/simgrid/simgrid-template-smpi/raw/master/roundtrip.c?inline=false>`_
+that comes with the template.
+
+.. literalinclude:: /tuto_smpi/roundtrip.c
+   :language: c
+
+Compiling and Executing
+.......................
+             
+Compiling the program is straightforward (double check your
+:ref:`SimGrid installation <install>` if you get an error message):
+
+
+.. code-block:: shell
+               
+  $ smpicc -O3 roundtrip.c -o roundtrip
+
+
+Once compiled, you can simulate the execution of this program on 16
+nodes from the ``cluster_crossbar.xml`` platform as follows:
+
+.. code-block:: shell
+
+   $ smpirun -np 16 -platform cluster_crossbar.xml -hostfile cluster_hostfile ./roundtrip
+
+- The ``-np 16`` option, just like in regular MPI, specifies the
+  number of MPI processes to use. 
+- The ``-hostfile cluster_hostfile`` option, just like in regular
+  MPI, specifies the host file. If you omit this option, ``smpirun``
+  will deploy the application on the first machines of your platform.
+- The ``-platform cluster_crossbar.xml`` option, **which doesn't exist
+  in regular MPI**, specifies the platform configuration to be
+  simulated. 
+- At the end of the line, one finds the executable name and
+  command-line arguments (if any -- roundtrip does not expect any arguments).
+
+Feel free to tweak the content of the XML platform file and the
+prorgam to see the effect on the simulated execution time. Note that
+the simulation accounts for realistic network protocol effects and MPI
+implementation effects. As a result, you may see "unexpected behavior"
+like in the real world (e.g., sending a message 1 byte larger may lead
+to significant higher execution time).
+
+Lab 1: Visualizing LU
+---------------------
+
+We will now simulate a larger application: the LU benchmark of the NAS
+suite. The version provided in the code template was modified to
+compile with SMPI instead of the regular MPI. Compare the difference
+between the original ``config/make.def.template`` and the
+``config/make.def`` that was adapted to SMPI. We use ``smpiff`` and
+``smpicc`` as compilers, and don't pass any additional library.
+
+Now compile and execute the LU benchmark, class A (i.e., for small
+data size) with 4 nodes.
+
+.. code-block:: shell
+
+   $ make lu NPROCS=4 CLASS=A
+   (compilation logs)
+   $ smpirun -np 4 -platform ../cluster_backbone.xml bin/lu.A.4
+   (execution logs)
+
+To get a better understanding of what is going on, activate the
+vizualization tracing, and convert the produced trace for later
+use:
+
+.. code-block:: shell
+
+   smpirun -np 4 -platform ../cluster_backbone.xml -trace --cfg=tracing/filename:lu.A.4.trace bin/lu.A.4
+   pj_dump --ignore-incomplete-links lu.A.4.trace | grep State > lu.A.4.state.csv
+
+You can then produce a Gantt Chart with the following R chunk. You can
+either copy/paste it in a R session, or `turn it into a Rscript executable
+<https://swcarpentry.github.io/r-novice-inflammation/05-cmdline/>`_ to
+run it again and again.
+
+.. code-block:: R
+
+   library(ggplot2)
+
+   # Read the data
+   df_state = read.csv("lu.A.4.state.csv", header=F, strip.white=T)
+   names(df_state) = c("Type", "Rank", "Container", "Start", "End", "Duration", "Level", "State");
+   df_state = df_state[!(names(df_state) %in% c("Type","Container","Level"))]
+   df_state$Rank = as.numeric(gsub("rank-","",df_state$Rank))
+
+   # Draw the Gantt Chart
+   gc = ggplot(data=df_state) + geom_rect(aes(xmin=Start, xmax=End, ymin=Rank, ymax=Rank+1,fill=State))
+
+   # Produce the output
+   plot(gc)
+   dev.off()
+
+This produces a file called ``Rplots.pdf`` with the following
+content. You can find more examples of visualization in the `SimGrid
+documentation <http://simgrid.gforge.inria.fr/contrib/R_visualization.html>`_.
+
+.. image:: /tuto_smpi/img/lu.A.4.png
     :align: center
  
-.. hidden-code-block:: xml
-    :starthidden: True
-    :label: See the XML platform description file...
-
-    <?xml version='1.0'?>
-    <!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid/simgrid.dtd">
-    <platform version="4.1">
-      <zone id="AS0" routing="Full">
-        <host id="host0" speed="1Gf"/>
-        <host id="host1" speed="2Gf"/>
-        <host id="host2" speed="40Gf"/>
-        <link id="link0" bandwidth="125MBps" latency="100us"/>
-        <link id="link1" bandwidth="50MBps" latency="150us"/>
-        <link id="link2" bandwidth="250MBps" latency="50us"/>
-        <route src="host0" dst="host1"><link_ctn id="link0"/><link_ctn id="link1"/></route>
-        <route src="host1" dst="host2"><link_ctn id="link1"/><link_ctn id="link2"/></route>
-        <route src="host0" dst="host2"><link_ctn id="link0"/><link_ctn id="link2"/></route>
-      </zone>
-    </platform>
-
-In this XML, note the way in which hosts, links, and routes are
-defined. All hosts are defined with a power (i.e., compute speed in
-Gflops), and links with a latency (in us) and bandwidth (in MBytes per
-second). Other units are possible and written as expected. By default,
-routes are symmetrical.
-       
+Lab 2: Tracing and Replay of LU
+-------------------------------
+
+Now compile and execute the LU benchmark, class A, with 32 nodes.
+
+.. code-block:: shell
+
+   $ make lu NPROCS=32 CLASS=A
+
+This takes several minutes to to simulate, because all code from all
+processes has to be really executed, and everything is serialized.
+
+SMPI provides several methods to speed things up. One of them is to
+capture a time independent trace of the running application, and
+replay it on a different platform with the same amount of nodes. The
+replay is much faster than live simulation, as the computations are
+skipped (the application must be network-dependent for this to work).
+
+You can even generate the trace during as live simulation, as follows:
+
+.. code-block:: shell
+
+   $ smpirun -trace-ti --cfg=tracing/filename:LU.A.32 -np 32 -platform ../cluster_backbone.xml bin/lu.A.32 
+
+The produced trace is composed of a file ``LU.A.32`` and a folder
+``LU.A.32_files``. To replay this with SMPI, you need to first compile
+the provided ``smpi_replay.cpp`` file, that comes from
+`simgrid/examples/smpi/replay
+<https://framagit.org/simgrid/simgrid/tree/master/examples/smpi/replay>`_.
+
+.. code-block:: shell
+
+   $ smpicxx ../replay.cpp -O3 -o ../smpi_replay
+
+Afterward, you can replay your trace in SMPI as follows:
+
+   $ smpirun -np 32 -platform ../cluster_torus.xml -ext smpi_replay ../smpi_replay LU.A.32
+
+All the outputs are gone, as the application is not really simulated
+here. Its trace is simply replayed. But if you visualize the live
+simulation and the replay, you will see that the behavior is
+unchanged. The simulation does not run much faster on this very
+example, but this becomes very interesting when your application
+is computationally hungry.
+
+.. todo:: smpi_replay should be installed by SimGrid, and smpirun interface could be simplified here.
+
+Lab 3: Execution Sampling on EP
+-------------------------------
+
+The second method to speed up simulations is to sample the computation parts in the code.
+This means that the person doing the simulation needs to know the application and identify
+parts that are compute intensive and take time, while being regular enough not to ruin
+simulation accuracy. Furthermore there should not be any MPI calls inside such parts of the
+code.
+
+Use the EP benchmark, class B, 16 processes.
+
  ..  LocalWords:  SimGrid