docs/source/intro_concepts.rst

   1 .. First introduction
   2
   3 What is SimGrid
   4 ===============
   5
   6 SimGrid is a framework to simulate distributed computer systems.
   7
   8 It can be used to either assess abstract algorithms or to profile and
   9 debug real distributed applications.  SimGrid enables studies in the
  10 domains of (data-)Grids, IaaS Clouds, Clusters, High Performance
  11 Computing, Volunteer Computing, and Peer-to-Peer systems.
  12
  13 Technically speaking, SimGrid is a library. It is neither a graphical
  14 interface nor a command-line simulator running user scripts. The
  15 interaction with SimGrid is done by writing programs with the exposed
  16 functions to build your own simulator.
  17
  18 SimGrid is a Free Software distributed under the LGPLv3 license. You are
  19 thus welcome to use it as you wish or even to modify and distribute
  20 your version (provided that your version is as free as ours). It also
  21 means that SimGrid is developed by a vivid community of users and
  22 developers. We hope that you will come and join us!
  23
  24 SimGrid is the result of almost 20 years of research from several
  25 groups, both in France and in the U.S.A. It benefited of many funding
  26 from various research bodies, including the ANR, Inria, CNRS,
  27 University of Lorraine, University of Hawai'i at Manoa, ENS Rennes, and
  28 many others. Many thanks to our generous sponsors!
  29
  30 SimGrid is a powerful tool, but its learning curve can be rather
  31 steep. This manual will hopefully help and guide you to the features
  32 you want to use. Please report any issue that you see in this manual,
  33 including typos or unclear elements. You can even propose changes by
  34 clicking on the "Edit on GitLab" button at the top of every page.
  35
  36 Typical Study based on SimGrid
  37 ------------------------------
  38
  39 .. raw:: html
  40
  41    <object data="graphical-toc.svg" width="100%" type="image/svg+xml"></object>
  42
  43
  44 Any SimGrid study entails the following components:
  45
  46  - The studied **Application**. This can be either a distributed
  47    algorithm described in our simple APIs, or a full featured real
  48    parallel application using for example the MPI interface
  49    :ref:`(more info) <application>`.
  50
  51  - The **Simulated Platform**. This is a description of a given
  52    distributed system (machines, links, disks, clusters, etc). Most of
  53    the platform files are written in XML althrough a Lua interface is
  54    under development.  SimGrid makes it easy to augment the Simulated
  55    Platform with a Dynamic Scenario where for example the links are
  56    slowed down (because of external usage) or the machines fail. You
  57    have even support to specify the applicative workload that you want
  58    to feed to your application
  59    :ref:`(more info) <platform>`.
  60
  61  - The application's **Deployment Description**. In SimGrid
  62    terminology, the application is an inert set of source files and
  63    binaries. To make it run, you have to describe how your application
  64    should be deployed on the simulated platform. You need to specify
  65    which process is mapped on which machine, along with their parameters
  66    :ref:`(more info) <scenario>`.
  67
  68  - The **Platform Models**. They describe how the simulated platform
  69    reacts to the actions of the application. For example, they compute
  70    the time taken by a given communication on the simulated platform.
  71    These models are already included in SimGrid, and you only need to
  72    pick one and maybe tweak its configuration to get your results
  73    :ref:`(more info) <models>`.
  74
  75 These components are put together to run a **simulation**, that is an
  76 experiment or a probe. The result of one or many simulation provides
  77 an **outcome** (logs, visualization, or statistical analysis) that help
  78 answering the **question** targeted by this study.
  79
  80 Here are some questions on which SimGrid is particularly relevant:
  81
  82  - **Compare an Application to another**. This is the classical use
  83    case for scientists, who use SimGrid to test how the solution that
  84    they contribute to compares to the existing solutions from the
  85    literature.
  86
  87  - **Design the best [Simulated] Platform for a given Application.**
  88    Tweaking the platform file is much easier than building a new real
  89    platform for testing purpose. SimGrid also allows for the co-design
  90    of the platform and the application by modifying both of them.
  91
  92  - **Debug Real Applications**. With real systems, is sometimes
  93    difficult to reproduce the exact run leading to the bug that you
  94    are tracking. With SimGrid, you are *clairvoyant* about your
  95    *reproducible experiments*: you can explore every part of the
  96    system, and your probe will not change the simulated state. It also
  97    makes it easy to mock some parts of the real system that are not
  98    under study.
  99
 100 Depending on the context, you may see some parts of this process as
 101 less important, but you should pay close attention if you want to be
 102 confident in the results coming out of your simulations. In
 103 particular, you should not blindly trust your results but always
 104 strive to double-check them. Likewise, you should question the realism
 105 of your input configuration, and we even encourage you to doubt (and
 106 check) the provided performance models.
 107
 108 To ease such questioning, you really should logically separate these
 109 parts in your experimental setup. It is seen as a very bad practice to
 110 merge the application, the platform, and the deployment all together.
 111 SimGrid is versatile and your mileage may vary, but you should start
 112 with your Application specified as a C++ or Java program, using one of
 113 the provided XML platform file, and with your deployment in a separate
 114 XML file.
 115
 116 SimGrid Execution Modes
 117 -----------------------
 118
 119 Depending on the intended study, SimGrid can be run in several execution modes.
 120
 121 **Simulation Mode**. This is the most common execution mode, where you want
 122 to study how your application behaves on the simulated platform under
 123 the experimental scenario.
 124
 125 In this mode, SimGrid can provide information about the time taken by
 126 your application, the amount of energy dissipated by the platform to
 127 run your application, and the detailed usage of each resource.
 128
 129 **Model-Checking Mode**. This can be seen as a sort of exhaustive
 130 testing mode, where every possible outcome of your application is
 131 explored. In some sense, this mode tests your application for all
 132 possible platforms that you could imagine (and more).
 133
 134 You just provide the application and its deployment (amount of
 135 processes and parameters), and the model-checker will literally
 136 explore all possible outcomes by testing all possible message
 137 interleavings: if at some point a given process can either receive the
 138 message A first or the message B depending on the platform
 139 characteristics, the model-checker will explore the scenario where A
 140 arrives first, and then rewind to the same point to explore the
 141 scenario where B arrives first.
 142
 143 This is a very powerful mode, where you can evaluate the correction of
 144 your application. It can verify either **safety properties** (asserts)
 145 or **liveless properties** stating for example that if a given event
 146 occurs, then another given event will occur in a finite amount of
 147 steps. This mode is not only usable with the abstract algorithms
 148 developed on top of the SimGrid APIs, but also with real MPI
 149 applications (to some extent).
 150
 151 The main limit of Model Checking lays in the huge amount of scenarios
 152 to explore. SimGrid tries to explore only non-redundant scenarios
 153 thanks to classical reduction techniques (such as DPOR and stateful
 154 exploration) but the exploration may well never finish if you don't
 155 carefully adapt your application to this mode.
 156
 157 A classical trap is that the Model Checker can only verify whether
 158 your application fits the provided properties, which is useless if you
 159 have a bug in your property. Remember also that one way for your
 160 application to never violate a given assert is to not start at all
 161 because of a stupid bug.
 162
 163 Another limit of this mode is that it does not use the performance
 164 models of the simulation mode. Time becomes discrete: You can say for
 165 example that the application took 42 steps to run, but there is no way
 166 to know how much time it took or the amount of watts that were dissipated.
 167
 168 Finally, the model checker only explores the interleavings of
 169 computations and communications. Other factors such as thread
 170 execution interleaving are not considered by the SimGrid model
 171 checker.
 172
 173 The model checker may well miss existing issues, as it computes the
 174 possible outcomes *from a given initial situation*. There is no way to
 175 prove the correction of your application in all generality with this
 176 tool.
 177
 178 **Benchmark Recording Mode**. During debug sessions, continuous
 179 integration testing, and other similar use cases, you are often only
 180 interested in the control flow. If your application apply filters to
 181 huge images split in small blocks, the filtered image is probably not
 182 what you are interested in. You are probably looking for a way to run
 183 each computation kernel only once, save on disk the time it takes and
 184 some other metadata. This code block can then be skipped in simulation
 185 and replaced by a synthetic block using the cached information. The
 186 simulated platform will take this block into account without requesting
 187 the real hosting machine to benchmark it.
 188
 189 SimGrid Limits
 190 --------------
 191
 192 This framework is by no means the perfect holly grail able to solve
 193 every problem on earth.
 194
 195 **SimGrid scope is limited to distributed systems.** Real-time
 196 multi-threaded systems are out of scope. You could probably tweak
 197 SimGrid for such studies (or the framework could possibly be extended
 198 in this direction), but another framework specifically targeting such a
 199 use case would probably be more suited.
 200
 201 **There is currently no support for wireless networks**.
 202 The framework could certainly be improved in this direction, but this
 203 still has to be done.
 204
 205 **There is no perfect model, only models adapted to your study.**
 206 The SimGrid models target fast and large studies yet requesting
 207 realistic results. In particular, our models abstract away parameters
 208 and phenomena that are often irrelevant to the realism in our
 209 context.
 210
 211 SimGrid is simply not intended to any study that would mandate the
 212 abstracted phenomenon. Here are some **studies that you should not do
 213 with SimGrid**:
 214
 215  - Studying the effect of L3 vs. L2 cache effects on your application
 216  - Comparing kernel schedulers and policies
 217  - Comparing variants of TCP
 218  - Exploring pathological cases where TCP breaks down, resulting in
 219    abnormal executions.
 220  - Studying security aspects of your application, in presence of
 221    malicious agents.
 222
 223 SimGrid Success Stories
 224 -----------------------
 225
 226 SimGrid was cited in over 1,500 scientific papers (according to Google
 227 Scholar). Among them
 228 `over 200 publications <https://simgrid.org/Usages.html>`_
 229 (written by about 300 individuals) use SimGrid as a scientific
 230 instrument to conduct their experimental evaluation. These
 231 numbers do not include the articles contributing to SimGrid.
 232 This instrument was used in many research communities, such as
 233 `High-Performance Computing <https://hal.inria.fr/inria-00580599/>`_,
 234 `Cloud Computing <http://dx.doi.org/10.1109/CLOUD.2015.125>`_,
 235 `Workflow Scheduling <http://dl.acm.org/citation.cfm?id=2310096.2310195>`_,
 236 `Big Data <https://hal.inria.fr/hal-01199200/>`_ and
 237 `MapReduce <http://dx.doi.org/10.1109/WSCAD-SSC.2012.18>`_,
 238 `Data Grid <http://ieeexplore.ieee.org/document/7515695/>`_,
 239 `Volunteer Computing <http://www.sciencedirect.com/science/article/pii/S1569190X17301028>`_,
 240 `Peer-to-Peer Computing <https://hal.archives-ouvertes.fr/hal-01152469/>`_,
 241 `Network Architecture <http://dx.doi.org/10.1109/TPDS.2016.2613043>`_,
 242 `Fog Computing <http://ieeexplore.ieee.org/document/7946412/>`_, or
 243 `Batch Scheduling <https://hal.archives-ouvertes.fr/hal-01333471>`_
 244 `(more info) <https://simgrid.org/Usages.html>`_.
 245
 246 If your platform description is accurate enough (see
 247 `here <http://hal.inria.fr/hal-00907887>`_ or
 248 `there <https://hal.inria.fr/hal-01523608>`_),
 249 SimGrid can provide high-quality performance predictions. For example,
 250 we determined the speedup achieved by the Tibidabo ARM-based
 251 cluster before its construction
 252 (`paper <http://hal.inria.fr/hal-00919507>`_). In this case,
 253 some differences between the prediction and the real timings were due to
 254 misconfiguration or other problems with the real platform. To some extent,
 255 SimGrid could even be used to debug the real platform :)
 256
 257 SimGrid is also used to debug, improve, and tune several large
 258 applications.
 259 `BigDFT <http://bigdft.org>`_ (a massively parallel code
 260 computing the electronic structure of chemical elements developped by
 261 the CEA), `StarPU <http://starpu.gforge.inria.fr/>`_ (a
 262 Unified Runtime System for Heterogeneous Multicore Architectures
 263 developped by Inria Bordeaux) and
 264 `TomP2P <https://tomp2p.net/dev/simgrid/>`_ (a high performance
 265 key-value pair storage library developed at University of Zurich).
 266 Some of these applications enjoy large user communities themselves.
 267
 268 ..  LocalWords:  SimGrid