docs/source/main_concepts.rst

   1 .. First introduction
   2
   3 What is SimGrid
   4 ===============
   5
   6 SimGrid is a framework to simulate distributed computer systems.
   7
   8 It can be used to either assess abstract algorithms, or to profile and
   9 debug real distributed applications.  SimGrid enables studies in the
  10 domains of (data-)Grids, IaaS Clouds, Clusters, High Performance
  11 Computing, Volunteer Computing and Peer-to-Peer systems.
  12
  13 Technically speaking, SimGrid is a library. It is neither a graphical
  14 interface nor a command-line simulator running user scripts. The
  15 interaction with SimGrid is done by writing programs with the exposed
  16 functions to build your own simulator.
  17
  18 SimGrid offers many features, many options and many possibilities. The
  19 documentation aims at smoothing the learning curve. But nothing's
  20 perfect, and this documentation is really no exception here. Please
  21 help us improving it by reporting any issue that you see and
  22 proposing the content that is still missing.
  23
  24 SimGrid is a Free Software distributed under the LGPL licence. You are
  25 thus welcome to use it as you wish, or even to modify and distribute
  26 your version (as long as your version is as free as ours). It also
  27 means that SimGrid is developed by a vivid community of users and
  28 developers. We hope that you will come and join us!
  29
  30 SimGrid is the result of almost 20 years of research from several
  31 groups, both in France and in the USA. It benefited of many funding
  32 from various research instances, including the ANR, Inria, CNRS,
  33 University of Lorraine, University of Hawai'i at Manoa, ENS Rennes and
  34 many others. Many thanks to our generous sponsors!
  35
  36 Typical Study based on SimGrid
  37 ------------------------------
  38
  39 Any SimGrid study entails the following components:
  40
  41  - The studied **Application**. This can be either a distributed
  42    algorithm described in our simple APIs, or a full featured real
  43    parallel application using for example the MPI interface
  44    @ref application "(more info)".
  45
  46  - The **Virtual Platform**. This is a description of a given
  47    distributed system (machines, links, disks, clusters, etc). Most of
  48    the platform files are written in XML althrough a Lua interface is
  49    under development.  SimGrid makes it easy to augment the Virtual
  50    Platform with a Dynamic Scenario where for example the links are
  51    slowed down (because of external usage), the machines fail. You
  52    have even support to specify the applicative workload that you want
  53    to feed to your application  @ref platform "(more info)".
  54
  55  - The application's **Deployment Description**. In SimGrid
  56    terminology, the application is an inert set of source files and
  57    binaries. To make it run, you have to describe how your application
  58    should be deployed on the virtual platform. You need to specify
  59    which process is mapped on which host, along with their parameters
  60    @ref deployment "(more info)".
  61
  62  - The **Platform Models**. They describe how the virtual platform
  63    reacts to the actions of the application. For example, they compute
  64    the time taken by a given communication on the virtual platform.
  65    These models are already included in SimGrid, and you only need to
  66    pick one and maybe tweak its configuration to get your results
  67    @ref models "(more info)".
  68
  69 These components are put together to run a **simulation**, that is an
  70 experiment or a probe. The result of one or many simulation provides
  71 an **outcome** (logs, visualization, statistical analysis) that help
  72 answering the **question** targeted by this study.
  73
  74 The questions that SimGrid can solve include the following:
  75
  76  - **Compare an Application to another**. This is the classical use
  77    case for scientists, who use SimGrid to test how the solution that
  78    they contribute compares to the existing solutions from the
  79    literature.
  80
  81  - **Design the best Virtual Platform for a given Application.**
  82    Tweaking the platform file is much easier than building a new real
  83    platform for testing purpose. SimGrid also allows co-design of the
  84    platform and the application by modifying both of them.
  85
  86  - **Debug Real Applications**. With real systems, is sometimes
  87    difficult to reproduce the exact run leading to the bug that you
  88    are tracking. SimGrid gives you experimental reproducibility,
  89    clairevoyance (you can explore every part of the system, and your
  90    probe will not change the simulated state). It also makes it easy
  91    to mock some parts of the real system that are not under study.
  92
  93 Depending on the context, you may see some parts of this process as
  94 less important, but you should pay close attention if you want to be
  95 confident in the results coming out of your simulations. In
  96 particular, you should not trust blindly your results but always
  97 strive to double-check them. Likewise, you should question the realism
  98 of your input configuration, and we even encourage you to doubt (and
  99 check) the provided performance models.
 100
 101 To ease such questionning, you really should logically separate these
 102 parts in your experimental setup. It is seen as a very bad practice to
 103 merge the application, the platform and the deployment all together.
 104 SimGrid is versatile and your milleage may vary, but you should start
 105 with your Application specified as a C++ or Java program, using one of
 106 the provided XML platform file, and with your deployment in a separate
 107 XML file.
 108
 109 SimGrid Execution Modes
 110 -----------------------
 111
 112 Depending on the intended study, SimGrid can be run in several execution modes.
 113
 114 ** **Simulation Mode**. This is the most common execution mode, where you want
 115 to study how your application behaves on the virtual platform under
 116 the experimental scenario.
 117
 118 In this mode, SimGrid can provide information about the time taken by
 119 your application, the amount of energy dissipated by the platform to
 120 run your application and the detailed usage of each resource.
 121
 122 ** **Model-Checking Mode**. This can be seen as a sort of exhaustive
 123 testing mode, where every possible outcome of your application is
 124 explored. In some sense, this mode tests your application for all
 125 possible platforms that you could imagine (and more).
 126
 127 You just provide the application and its deployment (amount of
 128 processes and parameters), and the model-checker will litterally
 129 explore all possible outcomes by testing all possible message
 130 interleavings: if at some point a given process can either receive the
 131 message A first or the message B depending on the platform
 132 characteristics, the model-checker will explore the scenario where A
 133 arrives first, and then rewind to the same point to explore the
 134 scenario where B arrives first.
 135
 136 This is a very powerful mode, where you can evaluate the correction of
 137 your application. It can verify either **safety properties** (asserts)
 138 or **liveless properties** stating for example that if a given event
 139 occures, then another given event will occur in a finite amount of
 140 steps. This mode is not only usable with the abstract algorithms
 141 developed on top of the SimGrid APIs, but also with real MPI
 142 applications (to some extend).
 143
 144 The main limit of Model Checking lays in the huge amount of scenarios
 145 to explore. SimGrid tries to explore only non-redundent scenarios
 146 thanks to classical reduction techniques (such as DPOR and stateful
 147 exploration) but the exploration may well never finish if you don't
 148 carefully adapt your application to this mode.
 149
 150 A classical trap is that the Model Checker can only verify whether
 151 your application fits the provided properties, which is useless if you
 152 have a bug in your property. Remember also that one way for your
 153 application to never violate a given assert is to not start at all
 154 because of a stupid bug.
 155
 156 Another limit of this mode is that it does not use the performance
 157 models of the simulation mode. Time becomes discrete: You can say for
 158 example that the application took 42 steps to run, but there is no way
 159 to know the amount of seconds that it took or the amount of watts that
 160 it dissipated.
 161
 162 Finally, the model checker only explores the interleavings of
 163 computations and communications. Other factors such as thread
 164 execution interleaving are not considered by the SimGrid model
 165 checker.
 166
 167 The model checker may well miss existing issues, as it computes the
 168 possible outcomes *from a given initial situation*. There is no way to
 169 prove the correction of your application in all generality with this
 170 tool.
 171
 172 ** **Benchmark Recording Mode**. During debug sessions, continuous
 173 integration testing and other similar use cases, you are often only
 174 interested in the control flow. If your application apply filters to
 175 huge images split in small blocks, the filtered image is probably not
 176 what you are interested in. You are probably looking for a way to run
 177 each computation kernel only once, save on disk the time it takes and
 178 some other metadata. This code block can then be skipped in simulation
 179 and replaced by a synthetic block using the cached information. The
 180 virtual platform will take this block into account without requesting
 181 the real hosting machine to benchmark it.
 182
 183 SimGrid Limits
 184 --------------
 185
 186 This framework is by no means the perfect holly grail able to solve
 187 every problem on earth.
 188
 189 ** **SimGrid scope is limited to distributed systems.** Real-time
 190 multithreaded systems are not in the scope. You could probably tweak
 191 SimGrid for such studies (or the framework could possibily be extended
 192 in this direction), but another framework specifically targeting this
 193 usecase would probably be more suited.
 194
 195 ** **There is currently no support for IoT studies and wireless networks**.
 196 The framework could certainly be improved in this direction, but this
 197 is still to be done.
 198
 199 ** **There is no perfect model, only models adapted to your study.**
 200 The SimGrid models target fast, large studies yet requesting a
 201 realistic results. In particular, our models abstract away parameters
 202 and phenomenon that are often irrelevant to the realism in our
 203 context.
 204
 205 SimGrid is simply not intended to any study that would mandate the
 206 abstracted phenomenon. Here are some **studies that you should not do
 207 with SimGrid**:
 208
 209  - Studying the effect of L3 vs L2 cache effects on your application
 210  - Comparing variantes of TCP
 211  - Exploring pathological cases where TCP breaks down, resulting in
 212    abnormal executions.
 213  - Studying security aspects of your application, in presence of
 214    malicious agents.
 215
 216 SimGrid Success Stories
 217 -----------------------
 218
 219 SimGrid was cited in over 1,500 scientific papers (according to Google
 220 Scholar). Among them
 221 `over 200 publications <http://simgrid.gforge.inria.fr/Usages.php>`_
 222 (written by about 300 individuals) use SimGrid as a scientific
 223 instrument to conduct their experimental evaluation. These
 224 numbers do not count the articles contributing to SimGrid.
 225 This instrument was used in many research communities, such as
 226 `High-Performance Computing <https://hal.inria.fr/inria-00580599/>`_,
 227 `Cloud Computing <http://dx.doi.org/10.1109/CLOUD.2015.125>`_,
 228 `Workflow Scheduling <http://dl.acm.org/citation.cfm?id=2310096.2310195>`_,
 229 `Big Data <https://hal.inria.fr/hal-01199200/>`_ and
 230 `MapReduce <http://dx.doi.org/10.1109/WSCAD-SSC.2012.18>`_,
 231 `Data Grid <http://ieeexplore.ieee.org/document/7515695/>`_,
 232 `Volunteer Computing <http://www.sciencedirect.com/science/article/pii/S1569190X17301028>`_,
 233 `Peer-to-Peer Computing <https://hal.archives-ouvertes.fr/hal-01152469/>`_,
 234 `Network Architecture <http://dx.doi.org/10.1109/TPDS.2016.2613043>`_,
 235 `Fog Computing <http://ieeexplore.ieee.org/document/7946412/>`_, or
 236 `Batch Scheduling <https://hal.archives-ouvertes.fr/hal-01333471>`_
 237 `(more info) <http://simgrid.gforge.inria.fr/Usages.php>`_.
 238
 239 If your platform description is accurate enough (see
 240 `here <http://hal.inria.fr/hal-00907887>`_ or
 241 `there <https://hal.inria.fr/hal-01523608>`_),
 242 SimGrid can provide high-quality performance predictions. For example,
 243 we determined the speedup achieved by the Tibidabo Arm-based
 244 cluster before its construction
 245 (`paper <http://hal.inria.fr/hal-00919507>`_). In this case,
 246 some differences between the prediction and the real timings were due to
 247 misconfiguration or other problems with the real platforms. To some extent,
 248 SimGrid could even be used to debug the real platform :)
 249
 250 SimGrid is also used to debug, improve and tune several large
 251 applications.
 252 `BigDFT <http://bigdft.org>`_ (a massively parallel code
 253 computing the electronic structure of chemical elements developped by
 254 the CEA), `StarPU <http://starpu.gforge.inria.fr/>`_ (a
 255 Unified Runtime System for Heterogeneous Multicore Architectures
 256 developped by Inria Bordeaux) and
 257 `TomP2P <https://tomp2p.net/dev/simgrid/>`_ (a high performance
 258 key-value pair storage library developped at University of Zurich).
 259 Some of these applications enjoy large user communities themselves.
 260
 261 Where to proceed next?
 262 ----------------------
 263
 264 Now that you know about the basic concepts of SimGrid, you can give it
 265 a try. If it's not done yet, first :ref:`install it <install>`. Then,
 266 proceed to the section on @ref application "describing the application" that
 267 you want to study.