image. Once you `installed Docker itself
<https://docs.docker.com/install/>`_, simply do the following:
-.. code-block:: shell
+.. code-block:: console
- docker pull simgrid/tuto-smpi
- docker run -it --rm --name simgrid --volume ~/smpi-tutorial:/source/tutorial simgrid/tuto-smpi bash
+ $ docker pull simgrid/tuto-smpi
+ $ docker run -it --rm --name simgrid --volume ~/smpi-tutorial:/source/tutorial simgrid/tuto-smpi bash
This will start a new container with all you need to take this
tutorial, and create a ``smpi-tutorial`` directory in your home on
``/source/simgrid-template-smpi`` in the image. You should copy it to
your working directory when you first log in:
-.. code-block:: shell
+.. code-block:: console
- cp -r /source/simgrid-template-smpi/* /source/tutorial
- cd /source/tutorial
+ $ cp -r /source/simgrid-template-smpi/* /source/tutorial
+ $ cd /source/tutorial
Using your Computer Natively
............................
traces. The provided code template requires ``make`` to compile. On
Debian and Ubuntu, you can get them as follows:
-.. code-block:: shell
+.. code-block:: console
- sudo apt install simgrid pajeng make gcc g++ gfortran python3 vite
+ $ sudo apt install simgrid pajeng make gcc g++ gfortran python3 vite
For R analysis of the produced traces, you may want to install R
and the `pajengr <https://github.com/schnorr/pajengr#installation/>`_ package.
-.. code-block:: shell
+.. code-block:: console
- sudo apt install r-base r-cran-devtools cmake flex bison
- Rscript -e "library(devtools); install_github('schnorr/pajengr');"
+ # install R and necessary packages
+ $ sudo apt install r-base r-cran-devtools r-cran-tidyverse
+ # install pajengr dependencies
+ $ sudo apt install git cmake flex bison
+ # install the pajengr R package
+ $ Rscript -e "library(devtools); install_github('schnorr/pajengr');"
To take this tutorial, you will also need the platform files from the
previous section as well as the source code of the NAS Parallel
Benchmarks. Just clone `this repository
<https://framagit.org/simgrid/simgrid-template-smpi>`_ to get them all:
-.. code-block:: shell
+.. code-block:: console
- git clone https://framagit.org/simgrid/simgrid-template-smpi.git
- cd simgrid-template-smpi/
+ $ git clone https://framagit.org/simgrid/simgrid-template-smpi.git
+ $ cd simgrid-template-smpi/
If you struggle with the compilation, then you should double-check
your :ref:`SimGrid installation <install>`. On need, please refer to
:ref:`SimGrid installation <install>` if you get an error message):
-.. code-block:: shell
+.. code-block:: console
$ smpicc -O3 roundtrip.c -o roundtrip
Once compiled, you can simulate the execution of this program on 16
nodes from the ``cluster_crossbar.xml`` platform as follows:
-.. code-block:: shell
+.. code-block:: console
$ smpirun -np 16 -platform cluster_crossbar.xml -hostfile cluster_hostfile ./roundtrip
<https://www.nas.nasa.gov/publications/npb_problem_sizes.html>`_) with
4 nodes.
-.. code-block:: shell
+.. code-block:: console
$ make lu NPROCS=4 CLASS=S
(compilation logs)
visualization tracing, and convert the produced trace for later
use:
-.. code-block:: shell
+.. code-block:: console
- smpirun -np 4 -platform ../cluster_backbone.xml -trace --cfg=tracing/filename:lu.S.4.trace bin/lu.S.4
+ $ smpirun -np 4 -platform ../cluster_backbone.xml -trace --cfg=tracing/filename:lu.S.4.trace bin/lu.S.4
You can then produce a Gantt Chart with the following R chunk. You can
either copy/paste it in an R session, or `turn it into a Rscript executable
.. code-block:: R
- library(pajengr)
- library(ggplot2)
-
# Read the data
- df_state = pajeng_read("lu.S.4.trace")
- names(df_state) = c("Type", "Rank", "Container", "Start", "End", "Duration", "Level", "State");
- df_state = df_state[!(names(df_state) %in% c("Type","Container","Level"))]
- df_state$Rank = as.numeric(gsub("rank-","",df_state$Rank))
+ library(tidyverse)
+ library(pajengr)
+ dta <- pajeng_read("lu.S.4.trace")
+
+ # Manipulate the data
+ dta$state %>%
+ # Remove some unnecessary columns for this example
+ select(-Type, -Imbrication) %>%
+ # Create the nice MPI rank and operations identifiers
+ mutate(Container = as.integer(gsub("rank-", "", Container)),
+ Value = gsub("^PMPI_", "MPI_", Value)) %>%
+ # Rename some columns so it can better fit MPI terminology
+ rename(Rank = Container,
+ Operation = Value) -> df.states
# Draw the Gantt Chart
- gc = ggplot(data=df_state) + geom_rect(aes(xmin=Start, xmax=End, ymin=Rank, ymax=Rank+1,fill=State))
-
- # Produce the output
- plot(gc)
- dev.off()
-
-This produces a file called ``Rplots.pdf`` with the following
+ df.states %>%
+ ggplot() +
+ # Each MPI operation is becoming a rectangle
+ geom_rect(aes(xmin=Start, xmax=End,
+ ymin=Rank, ymax=Rank + 1,
+ fill=Operation)) +
+ # Cosmetics
+ xlab("Time [seconds]") +
+ ylab("Rank [count]") +
+ theme_bw(base_size=14) +
+ theme(
+ plot.margin = unit(c(0,0,0,0), "cm"),
+ legend.margin = margin(t = 0, unit='cm'),
+ panel.grid = element_blank(),
+ legend.position = "top",
+ legend.justification = "left",
+ legend.box.spacing = unit(0, "pt"),
+ legend.box.margin = margin(0,0,0,0),
+ legend.title = element_text(size=10)) -> plot
+
+ # Save the plot in a PNG file (dimensions in inches)
+ ggsave("smpi.png",
+ plot,
+ width = 10,
+ height = 3)
+
+This produces a file called ``smpi.png`` with the following
content. You can find more visualization examples `online
<https://simgrid.org/contrib/R_visualization.html>`_.
Now compile and execute the LU benchmark, class A, with 32 nodes.
-.. code-block:: shell
+.. code-block:: console
$ make lu NPROCS=32 CLASS=A
You can even generate the trace during the live simulation as follows:
-.. code-block:: shell
+.. code-block:: console
$ smpirun -trace-ti --cfg=tracing/filename:LU.A.32 -np 32 -platform ../cluster_backbone.xml bin/lu.A.32
``LU.A.32_files``. You can replay this trace with SMPI thanks to ``smpirun``.
For example, the following command replays the trace on a different platform:
-.. code-block:: shell
+.. code-block:: console
$ smpirun -np 32 -platform ../cluster_crossbar.xml -hostfile ../cluster_hostfile -replay LU.A.32
.. literalinclude:: /tuto_smpi/gemm_mpi.cpp
:language: cpp
- :lines: 4-19
+ :lines: 9-24
-.. code-block:: shell
+.. code-block:: console
$ smpicxx -O3 gemm_mpi.cpp -o gemm
$ time smpirun -np 16 -platform cluster_crossbar.xml -hostfile cluster_hostfile --cfg=smpi/display-timing:yes --cfg=smpi/running-power:1000000000 ./gemm
The ``--cfg=smpi/display-timing`` option gives more details about execution
and advises using sampling if the time spent in computing loops seems too high.
-The ``--cfg=smpi/running-power:1000000000`` option sets the speed of the processor used for
+The ``--cfg=smpi/host-speed:1000000000`` option sets the speed of the processor used for
running the simulation. Here we say that its speed is the same as one of the
processors we are simulating (1Gf), so that 1 second of computation is injected
as 1 second in the simulation.
-.. code-block:: shell
+.. code-block:: console
[5.568556] [smpi_kernel/INFO] Simulated time: 5.56856 seconds.
Now run the code again with various sizes and parameters and check the time taken for the
simulation, as well as the resulting simulated time.
-.. code-block:: shell
+.. code-block:: console
[5.575691] [smpi_kernel/INFO] Simulated time: 5.57569 seconds.
The simulation took 1.23698 seconds (after parsing and platform setup)
Once done, you can now run
-.. code-block:: shell
+.. code-block:: console
$ make dt NPROCS=85 CLASS=C
(compilation logs)
and use specific memory for the important parts.
It can be freed afterward with SMPI_SHARED_FREE.
+If allocations are performed with malloc or calloc, SMPI (from version 3.25) provides the option
+``--cfg=smpi/auto-shared-malloc-shared:n`` which will replace all allocations above size n bytes by
+shared allocations. The value has to be carefully selected to avoid smaller control arrays,
+containing data necessary for the completion of the run.
+Try to run the (non modified) DT example again, with values going from 10 to 100,000 to show that
+too small values can cause crashes.
+
+A useful option to identify the largest allocations in the code is ``--cfg=smpi/display-allocs:yes`` (from 3.27).
+It will display at the end of a (successful) run the largest allocations and their locations, helping pinpoint the
+targets for sharing, or setting the threshold for automatic ones.
+For DT, the process would be to run a smaller class of problems,
+
+.. code-block:: console
+
+ $ make dt NPROCS=21 CLASS=A
+ $ smpirun --cfg=smpi/display-allocs:yes -np 21 -platform ../cluster_backbone.xml bin/dt.A.x BH
+
+Which should output:
+
+.. code-block:: console
+
+ [smpi_utils/INFO] Memory Usage: Simulated application allocated 198533192 bytes during its lifetime through malloc/calloc calls.
+ Largest allocation at once from a single process was 3553184 bytes, at dt.c:388. It was called 3 times during the whole simulation.
+ If this is too much, consider sharing allocations for computation buffers.
+ This can be done automatically by setting --cfg=smpi/auto-shared-malloc-thresh to the minimum size wanted size (this can alter execution if data content is necessary)
+
+And from there we can identify dt.c:388 as the main allocation, and the best target to convert to
+shared mallocs for larger simulations. Furthermore, with 21 processes, we see that this particular
+allocation and size was only called 3 times, which means that other processes are likely to allocate
+less memory here (imbalance). Using 3553184 as a threshold value might be unwise, as most processes
+wouldn't share memory, so a lower threshold would be advisable.
+
Further Readings
----------------