X-Git-Url: http://info.iut-bm.univ-fcomte.fr/pub/gitweb/simgrid.git/blobdiff_plain/7c0e9b3b86d5978343a5aa37dc37ea484b9214ae..3af3a09062f663d730af98444b314581212a78b4:/docs/source/tuto_smpi.rst diff --git a/docs/source/tuto_smpi.rst b/docs/source/tuto_smpi.rst index 4599f92040..f886d6a060 100644 --- a/docs/source/tuto_smpi.rst +++ b/docs/source/tuto_smpi.rst @@ -37,7 +37,7 @@ only plan to debug your application in a reproducible setup, without any performance-related analysis. How does it work? -^^^^^^^^^^^^^^^^^ +................. In SMPI, communications are simulated while computations are emulated. This means that while computations occur as they would in @@ -61,9 +61,8 @@ to predict the time taken by each communications. Any computations occuring between two MPI calls are benchmarked, and the corresponding time is reported into the simulator. -.. image:: /tuto_smpi/img/big-picture.png - :align: center - +.. image:: /tuto_smpi/img/big-picture.svg + :align: center Describing Your Platform ------------------------ @@ -71,44 +70,392 @@ Describing Your Platform As a SMPI user, you are supposed to provide a description of your virtual platform, that is mostly a set of simulated hosts and network links with some performance characteristics. SimGrid provides a plenty -of :ref:`documentation `_ and examples (in the +of :ref:`documentation ` and examples (in the `examples/platforms `_ source directory), and this section only shows a small set of introductory examples. +Feel free to skip this section if you want to jump right away to usage +examples. + Simple Example with 3 hosts -^^^^^^^^^^^^^^^^^^^^^^^^^^^ +........................... At the most basic level, you can describe your simulated platform as a graph of hosts and network links. For instance: -.. image:: /tuto_smpi/img/3hosts.png +.. image:: /tuto_smpi/3hosts.png :align: center -.. hidden-code-block:: xml - :starthidden: True - :label: See the XML platform description file... - - - - - - - - - - - - - - - - - -In this XML, note the way in which hosts, links, and routes are -defined. All hosts are defined with a power (i.e., compute speed in -Gflops), and links with a latency (in us) and bandwidth (in MBytes per -second). Other units are possible and written as expected. By default, -routes are symmetrical. - +.. literalinclude:: /tuto_smpi/3hosts.xml + :language: xml + +Note the way in which hosts, links, and routes are defined in +this XML. All hosts are defined with a speed (in Gflops), and links +with a latency (in us) and bandwidth (in MBytes per second). Other +units are possible and written as expected. Routes specify the list of +links encountered from one route to another. Routes are symmetrical by +default. + +Cluster with a Crossbar +....................... + +A very common parallel computing platform is a homogeneous cluster in +which hosts are interconnected via a crossbar switch with as many +ports as hosts, so that any disjoint pairs of hosts can communicate +concurrently at full speed. For instance: + +.. literalinclude:: ../../examples/platforms/cluster_crossbar.xml + :language: xml + :lines: 1-3,18- + +One specifies a name prefix and suffix for each host, and then give an +integer range. In the example the cluster contains 65535 hosts (!), +named ``node-0.simgrid.org`` to ``node-65534.simgrid.org``. All hosts +have the same power (1 Gflop/sec) and are connected to the switch via +links with same bandwidth (125 MBytes/sec) and latency (50 +microseconds). + +.. todo:: + + Add the picture. + +Cluster with a Shared Backbone +.............................. + +Another popular model for a parallel platform is that of a set of +homogeneous hosts connected to a shared communication medium, a +backbone, with some finite bandwidth capacity and on which +communicating host pairs can experience contention. For instance: + + +.. literalinclude:: ../../examples/platforms/cluster_backbone.xml + :language: xml + :lines: 1-3,18- + +The only differences with the crossbar cluster above are the ``bb_bw`` +and ``bb_lat`` attributes that specify the backbone characteristics +(here, a 500 microseconds latency and a 2.25 GByte/sec +bandwidth). This link is used for every communication within the +cluster. The route from ``node-0.simgrid.org`` to ``node-1.simgrid.org`` +counts 3 links: the private link of ``node-0.simgrid.org``, the backbone +and the private link of ``node-1.simgrid.org``. + +.. todo:: + + Add the picture. + +Torus Cluster +............. + +Many HPC facilities use torus clusters to reduce sharing and +performance loss on concurrent internal communications. Modeling this +in SimGrid is very easy. Simply add a ``topology="TORUS"`` attribute +to your cluster. Configure it with the ``topo_parameters="X,Y,Z"`` +attribute, where ``X``, ``Y`` and ``Z`` are the dimension of your +torus. + +.. image:: ../../examples/platforms/cluster_torus.svg + :align: center + +.. literalinclude:: ../../examples/platforms/cluster_torus.xml + :language: xml + +Note that in this example, we used ``loopback_bw`` and +``loopback_lat`` to specify the characteristics of the loopback link +of each node (i.e., the link allowing each node to communicate with +itself). We could have done so in previous example too. When no +loopback is given, the communication from a node to itself is handled +as if it were two distinct nodes: it goes twice through the private +link and through the backbone (if any). + +Fat-Tree Cluster +................ + +This topology was introduced to reduce the amount of links in the +cluster (and thus reduce its price) while maintaining a high bisection +bandwidth and a relatively low diameter. To model this in SimGrid, +pass a ``topology="FAT_TREE"`` attribute to your cluster. The +``topo_parameters=#levels;#downlinks;#uplinks;link count`` follows the +semantic introduced in the `Figure 1B of this article +`_. + +Here is the meaning of this example: ``2 ; 4,4 ; 1,2 ; 1,2`` + +- That's a two-level cluster (thus the initial ``2``). +- Routers are connected to 4 elements below them, regardless of its + level. Thus the ``4,4`` component that is used as + ``#downlinks``. This means that the hosts are grouped by 4 on a + given router, and that there is 4 level-1 routers (in the middle of + the figure). +- Hosts are connected to only 1 router above them, while these routers + are connected to 2 routers above them (thus the ``1,2`` used as + ``#uplink``). +- Hosts have only one link to their router while every path between a + level-1 routers and level-2 routers use 2 parallel links. Thus the + ``1,2`` that is used as ``link count``. + +.. image:: ../../examples/platforms/cluster_fat_tree.svg + :align: center + +.. literalinclude:: ../../examples/platforms/cluster_fat_tree.xml + :language: xml + :lines: 1-3,10- + + +Dragonfly Cluster +................. + +This topology was introduced to further reduce the amount of links +while maintaining a high bandwidth for local communications. To model +this in SimGrid, pass a ``topology="DRAGONFLY"`` attribute to your +cluster. + +.. literalinclude:: ../../examples/platforms/cluster_dragonfly.xml + :language: xml + +.. todo:: + + Add the image, and the documuentation of the topo_parameters. + +Final Word +.......... + +We only glanced over the abilities offered by SimGrid to describe the +platform topology. Other networking zones model non-HPC platforms +(such as wide area networks, ISP network comprising set-top boxes, or +even your own routing schema). You can interconnect several networking +zones in your platform to form a tree of zones, that is both a time- +and memory-efficient representation of distributed platforms. Please +head to the dedicated :ref:`documentation ` for more +information. + +Hands-on! +--------- + +It is time to start using SMPI yourself. For that, you first need to +install it somehow, and then you will need a MPI application to play with. + +Using Docker +............ + +The easiest way to take the tutorial is to use the dedicated Docker +image. Once you `installed Docker itself +`_, simply do the following: + +.. code-block:: shell + + docker pull simgrid/tuto-smpi + docker run -it --rm --name simgrid --volume ~/smpi-tutorial:/source/tutorial simgrid/tuto-smpi bash + +This will start a new container with all you need to take this +tutorial, and create a ``smpi-tutorial`` directory in your home on +your host machine that will be visible as ``/source/tutorial`` within the +container. You can then edit the files you want with your favorite +editor in ``~/smpi-tutorial``, and compile them within the +container to enjoy the provided dependencies. + +.. warning:: + + Any change to the container out of ``/source/tutorial`` will be lost + when you log out of the container, so don't edit the other files! + +All needed dependencies are already installed in this container +(SimGrid, the C/C++/Fortran compilers, make, pajeng and R). Vite being +only optional in this tutorial, it is not installed to reduce the +image size. + +The container also include the example platform files from the +previous section as well as the source code of the NAS Parallel +Benchmarks. These files are available under +``/source/simgrid-template-smpi`` in the image. You should copy it to +your working directory when you first log in: + +.. code-block:: shell + + cp -r /source/simgrid-template-smpi/* /source/tutorial + cd /source/tutorial + +Using your Computer Natively +............................ + +To take the tutorial on your machine, you first need to :ref:`install +SimGrid `, the C/C++/Fortran compilers and also ``pajeng`` to +visualize the traces. You may want to install `Vite +`_ to get a first glance at the +traces. The provided code template requires make to compile. On +Debian and Ubuntu for example, you can get them as follows: + +.. code-block:: shell + + sudo apt install simgrid pajeng make gcc g++ gfortran vite + +To take this tutorial, you will also need the platform files from the +previous section as well as the source code of the NAS Parallel +Benchmarks. Just clone `this repository +`_ to get them all: + +.. code-block:: shell + + git clone git@framagit.org:simgrid/simgrid-template-smpi.git + cd simgrid-template-smpi/ + +If you struggle with the compilation, then you should double check +your :ref:`SimGrid installation `. On need, please refer to +the :ref:`Troubleshooting your Project Setup +` section. + +Lab 0: Hello World +------------------ + +It is time to simulate your first MPI program. Use the simplistic +example `roundtrip.c +`_ +that comes with the template. + +.. literalinclude:: /tuto_smpi/roundtrip.c + :language: c + +Compiling and Executing +....................... + +Compiling the program is straightforward (double check your +:ref:`SimGrid installation ` if you get an error message): + + +.. code-block:: shell + + $ smpicc -O3 roundtrip.c -o roundtrip + + +Once compiled, you can simulate the execution of this program on 16 +nodes from the ``cluster_crossbar.xml`` platform as follows: + +.. code-block:: shell + + $ smpirun -np 16 -platform cluster_crossbar.xml -hostfile cluster_hostfile ./roundtrip + +- The ``-np 16`` option, just like in regular MPI, specifies the + number of MPI processes to use. +- The ``-hostfile cluster_hostfile`` option, just like in regular + MPI, specifies the host file. If you omit this option, ``smpirun`` + will deploy the application on the first machines of your platform. +- The ``-platform cluster_crossbar.xml`` option, **which doesn't exist + in regular MPI**, specifies the platform configuration to be + simulated. +- At the end of the line, one finds the executable name and + command-line arguments (if any -- roundtrip does not expect any arguments). + +Feel free to tweak the content of the XML platform file and the +prorgam to see the effect on the simulated execution time. Note that +the simulation accounts for realistic network protocol effects and MPI +implementation effects. As a result, you may see "unexpected behavior" +like in the real world (e.g., sending a message 1 byte larger may lead +to significant higher execution time). + +Lab 1: Visualizing LU +--------------------- + +We will now simulate a larger application: the LU benchmark of the NAS +suite. The version provided in the code template was modified to +compile with SMPI instead of the regular MPI. Compare the difference +between the original ``config/make.def.template`` and the +``config/make.def`` that was adapted to SMPI. We use ``smpiff`` and +``smpicc`` as compilers, and don't pass any additional library. + +Now compile and execute the LU benchmark, class A (i.e., for small +data size) with 4 nodes. + +.. code-block:: shell + + $ make lu NPROCS=4 CLASS=A + (compilation logs) + $ smpirun -np 4 -platform ../cluster_backbone.xml bin/lu.A.4 + (execution logs) + +To get a better understanding of what is going on, activate the +vizualization tracing, and convert the produced trace for later +use: + +.. code-block:: shell + + smpirun -np 4 -platform ../cluster_backbone.xml -trace --cfg=tracing/filename:lu.A.4.trace bin/lu.A.4 + pj_dump --ignore-incomplete-links lu.A.4.trace | grep State > lu.A.4.state.csv + +You can then produce a Gantt Chart with the following R chunk. You can +either copy/paste it in a R session, or `turn it into a Rscript executable +`_ to +run it again and again. + +.. code-block:: R + + library(ggplot2) + + # Read the data + df_state = read.csv("lu.A.4.state.csv", header=F, strip.white=T) + names(df_state) = c("Type", "Rank", "Container", "Start", "End", "Duration", "Level", "State"); + df_state = df_state[!(names(df_state) %in% c("Type","Container","Level"))] + df_state$Rank = as.numeric(gsub("rank-","",df_state$Rank)) + + # Draw the Gantt Chart + gc = ggplot(data=df_state) + geom_rect(aes(xmin=Start, xmax=End, ymin=Rank, ymax=Rank+1,fill=State)) + + # Produce the output + plot(gc) + dev.off() + +This produces a file called ``Rplots.pdf`` with the following +content. You can find more examples of visualization in the `SimGrid +documentation `_. + +.. image:: /tuto_smpi/img/lu.A.4.png + :align: center + +Lab 2: Tracing and Replay of LU +------------------------------- + +Now compile and execute the LU benchmark, class A, with 32 nodes. + +.. code-block:: shell + + $ make lu NPROCS=32 CLASS=A + +This takes several minutes to to simulate, because all code from all +processes has to be really executed, and everything is serialized. + +SMPI provides several methods to speed things up. One of them is to +capture a time independent trace of the running application, and +replay it on a different platform with the same amount of nodes. The +replay is much faster than live simulation, as the computations are +skipped (the application must be network-dependent for this to work). + +You can even generate the trace during as live simulation, as follows: + +.. code-block:: shell + + $ smpirun -trace-ti --cfg=tracing/filename:LU.A.32 -np 32 -platform ../cluster_backbone.xml bin/lu.A.32 + +The produced trace is composed of a file ``LU.A.32`` and a folder +``LU.A.32_files``. To replay this with SMPI, you need to first compile +the provided ``smpi_replay.cpp`` file, that comes from +`simgrid/examples/smpi/replay +`_. + +.. code-block:: shell + + $ smpicxx ../replay.cpp -O3 -o ../smpi_replay + +Afterward, you can replay your trace in SMPI as follows: + + $ smpirun -np 32 -platform ../cluster_torus.xml -ext smpi_replay ../smpi_replay LU.A.32 + +All the outputs are gone, as the application is not really simulated +here. Its trace is simply replayed. But if you visualize the live +simulation and the replay, you will see that the behavior is +unchanged. The simulation does not run much faster on this very +example, but this becomes very interesting when your application +is computationally hungry. + +.. todo:: smpi_replay should be installed by SimGrid, and smpirun interface could be simplified here. + .. LocalWords: SimGrid