doc/gtut-introduction.doc

   1 /**
   2 @page GRAS_tut_intro Introduction to the GRAS framework
   3
   4 \htmlinclude .gtut-introduction.doc.toc
   5
   6 \section GRAS_tut_intro_toc What will you find here
   7
   8  - The section \ref GRAS_tut_intro_what explains what the GRAS framework and how it
   9    relates to other existing solutions.
  10  - The section \ref GRAS_tut_intro_model presents somehow formaly the programmation
  11    model used in GRAS.
  12
  13 \section GRAS_tut_intro_further Further readings
  14
  15 After this page, you may find these one interesting:
  16 \ref GRAS_howto_design. If you're new to GRAS, you may want to read the
  17 initiatic tour first, begining with \ref GRAS_tut_tour_install or
  18 \ref GRAS_tut_tour_setup.
  19
  20 <hr>
  21
  22 \section GRAS_tut_intro_what What is GRAS
  23
  24 GRAS is a framework to implement and study distributed algorithms. It
  25 provides a simple communication API to allow several processes to
  26 interoperate through the exchange of messages. This is quite classical, and
  27 GRAS differs from the other existing messaging API by several points:
  28
  29   - \ref GRAS_tut_intro_what_2ways
  30   - \ref GRAS_tut_intro_what_dist
  31   - \ref GRAS_tut_intro_what_grid
  32   - \ref GRAS_tut_intro_what_target
  33   - \ref GRAS_tut_intro_what_simple
  34
  35 We now detail each of these points.
  36
  37 \subsection GRAS_tut_intro_what_2ways GRAS allows you to run both in simulation mode and on real platforms
  38
  39 We wrote two implementations of the interface: the first one is built on top
  40 of the SimGrid simulator, allowing you to run your application in a
  41 controled environment, which reveals precious to debug and study algorithms.
  42 Everyone who tried to run even simple tests on more than 100 real machines
  43 will consider a simulator as a nirvana.
  44
  45 The experiments can be reproduced in the exact same conditions (which is
  46 somehow hard in real settings), allowing for example to reproduce a bug as
  47 many times as you want while debugging. You can also test your algorithm
  48 under experimental conditions which you couldn't achieve on a real platform
  49 (like a network topology and/or size you do don't have access to). Under
  50 some conditions, SimGrid simulations are also much faster than real
  51 executions, allowing you to run more experiments in less time.
  52
  53 Once you assessed the quality of your algorithm in the simulator, you can
  54 deploy it on real platforms using the second implementation of the library.
  55 Usually, taking an algorithm out of a simulator implies an almost complete
  56 rewrite. There is no need to modify your program for this in GRAS. You don't
  57 even need to recompile it, but simply to relink your program it against the
  58 right library.
  59
  60 GRAS applications running on real hardware deliver high performance.
  61 The sequential parts of your code are not mediated by GRAS or slowed down
  62 anyhow. The communications use advanced data exchange and conversion
  63 mecanism ensuring that you are likely to get performance at least comparable
  64 to other communication solutions (FIXME: cite the paper once it gets
  65 accepted).
  66
  67 GRAS applications are portable on several operating systems (Linux, MacOS X,
  68 Solaris, IRIX, AIX and soon Windows) and several processor architectures
  69 (x86, amd64, ppc, sparc, etc). Moreover, GRAS processes can interoperate
  70 efficiently even when deployed on differing material. You can for example
  71 have a process deployed on ppc/MacOS X interacting transparently with
  72 another one deployed on alpha/Linux.
  73
  74 The simulation mode of GRAS is called usually SG (for SimGrid) while the in
  75 situ execution mode is called RL (for Real Life).
  76
  77 \subsection GRAS_tut_intro_what_dist GRAS was designed for distributed computing, not parallel computing
  78
  79 In GRAS, you build your algorithm as a set of independent processes
  80 interacting through messages. This is the well known MPMD model (multiple
  81 program, multiple data). It contrasts to the SPMD model (simple program,
  82 multiple data) and communications solutions such as MPI or PVM, where you
  83 build an uniq program with conditionnals here and there specifying what each
  84 processes should do (something like "If I'm process number 0, then send data
  85 to the others, else get the data sent to me").
  86
  87 None of these models are inherently better than the other, and there is a
  88 plenty of algorithms betterly expressed in the SPMD paradigm. If your
  89 program falls into that category, then GRAS may not be the right tool for
  90 you. We think however that most non-sequential algorithms can be expressed
  91 gracefully in a MPMD way where some are really difficult to express in a
  92 SPMD way.
  93
  94 There is no parallelism in GRAS, and it is discouraged to introduce threads
  95 in GRAS (althrough it should be possible in some months). This is an explict
  96 choice since threads are so hard to use (see the section \ref
  97 GRAS_tut_intro_what_simple below). The framework itself do use threads to
  98 achieve good performances, but I don't want to impose this to users (FIXME:
  99 actually, GRAS is not multi-threaded yet internally, but I plan to do so
 100 really soon).
 101
 102 \subsection GRAS_tut_intro_what_grid GRAS was designed for large scale computing
 103
 104 Another difference to the MPI communication libraries is that GRAS was not
 105 designed for static small-sized platforms such as clusters, but to dynamic
 106 larger-scale platforms such as grids. That is why GRAS do include static
 107 membership solutions such as the MPI channels. Support for fault-tolerance
 108 is also provided through the timeouts on communication primitives and
 109 through an exception mecanism.
 110
 111 GRAS also comes with a sister library called AMOK containing several usefull
 112 building block for large scale network aware applications. The most
 113 proheminent one allows to assess the network availabilities through active
 114 testing, just like the classical NWS tool in the grid research community. We
 115 are actively working on a network topology discovery mecanism and a
 116 distributed locking solution. Some other modules are planned, such as
 117 reliable broacasting in open environments.
 118
 119 \subsection GRAS_tut_intro_what_target GRAS targets at applicative overlay rather than end-user application
 120
 121 The application class targeted by GRAS is constituted of so called overlays.
 122 They do not constitute a complete application by themselves, but can be seen
 123 as a "distributed library", a thing offering offering a service to another
 124 application through a set of physically distributed entities. An example of
 125 such overlay could be a monitoring system allowing you to retrieve the
 126 available bandwidth between two remote hosts. It could be used in a
 127 network-aware parallel matrix multiplication library assigning more work to
 128 well interconnected nodes. I wouldn't advice to build a physical or
 129 biological compututation program on top of GRAS, even if it would be
 130 possible in theory.
 131
 132 In other words, GRAS is not a grid middleware in the common understanding of
 133 the world, but rather a tool to constitute the building bricks of such a
 134 middleware. GRAS is thus a sort of "underware" ;)
 135
 136 \subsection GRAS_tut_intro_what_simple GRAS tries to remain simple to use
 137
 138 A lot of effort was put into the framework so that it remains simple to the
 139 users. For example, you can exchange structured data (any kind of C data
 140 structure) just by passing its address, and the framework will create the
 141 exact same structure on the receiver side.
 142
 143 There is no threads like the pthread ones in GRAS, and it is not planned to
 144 introduce this in the future. This is an explicit choice since I consider
 145 multi-threading as too complicated for usual users. There is too much
 146 non-determinism, too many race conditions, and too few language-level
 147 constructs to keep yourself from screwing up. This idea is well expressed
 148 by John Ousterhout in <i>Why Threads Are a Bad Idea (for most purposes)</i>,
 149 published at USENIX'96. See section \ref GRAS_tut_intro_what_dist for
 150 platform performance consideration.
 151
 152 For the user code, I plan to allow the co-existance of several "gras
 153 processes" within the same regular unix process. The communication semantic
 154 will still be message-oriented, even if implemented using the shared memory
 155 for efficiency.
 156
 157 Likewise, there is no interuption mecanism in GRAS which could break the
 158 user code execution flow. When you write a function, you can be absolutely
 159 sure that nothing will happen between each lines of it. This assumption
 160 considerably simplify the code written in GRAS. The main use of of
 161 interruptions in a distributed application is to timeout communications when
 162 they fail. GRAS communication calls allow to setup a timeout value, and
 163 handle it internally (see below).
 164
 165 The only interruption mecanism used is constituted by exceptions, just like
 166 in C++ or Java (but implemented directly in C). They are propagated from the
 167 point where they are raised to a point where they will be trapped, if any,
 168 or abort the execution when not trapped. You can still be certain that
 169 nothing will happen between two lines of your code, but the second line may
 170 never be executed if the first one raises an exception ;)
 171
 172 This exception mecanism was introduced because without it, user code has to
 173 be loaded by tons of non-functional code to check whether an operation was
 174 properly performed or whether you have to pass the error condition to your
 175 caller.
 176
 177 <hr>
 178
 179 \section GRAS_tut_intro_model The model provided by GRAS
 180
 181 From a more formal point of view, GRAS overlays (=applications) can be seen
 182 as a set of state machines mainly interacting with messages. Because of the
 183 distributed setting of overlays, the internal state of each process cannot
 184 be accessed or modified directly by other processes. Even when it would be
 185 possible pratically (like in SG), it is forbidden by the model. This makes
 186 it difficult to gain a complete knowledge on the global system state. This
 187 global system state can still be defined by agregating the states of each
 188 processes, but this remains theoretical and impratical because of the
 189 probable combinatorial explosion.
 190
 191  - \ref GRAS_tut_intro_model_events
 192  - \ref GRAS_tut_intro_model_commmodel
 193  - \ref GRAS_tut_intro_model_timing_policy
 194  - \ref GRAS_tut_intro_model_exception
 195  - \ref GRAS_tut_intro_model_rpc
 196
 197 \subsection GRAS_tut_intro_model_events Event types
 198
 199 Two main types of events may change the internal state of a given process:
 200
 201  - <b>Incomming messages</b>. Messages are somehow strongly typed: a message
 202    type is described by its name (a string), and the C datatype of its
 203    payload. Any message of the same type will convey the same datatype, but
 204    of course the actual content of the payload may change from message to
 205    message of the same type.\n
 206    \n
 207    Processes may attach <b>callback functions</b> to the arrival of messages
 208    of a given type. They describe the action to achieve to handle the
 209    messages during the transition associated to this event.\n
 210    \n
 211    Incoming messages are not handled as soon as they arrive, but only when
 212    the process declares to be ready to accept incomming events (using \ref
 213    gras_msg_handle or related functions). It ensures that the treatment of a
 214    given message won't run in parallel to any other callback, so that
 215    process globals (its state) can be accessed and modified without
 216    locking.\n
 217    \n
 218    Messages received when the process is not ready to consume them are
 219    queued, and will be processed in order in the subsequent calls to \ref
 220    gras_msg_handle.\n
 221    \n
 222    Processes can also wait explicitely for incoming messages matching some
 223    given criterions (using \ref gras_msg_wait). Any messages received before the
 224    one matching the criterions will be added to the incomming messages'
 225    queue for further use. This may breaks the message delivery order.
 226    Moreover, there is no restriction on when this can be done. So, a
 227    callback to a given message can consume messages of other types. There is
 228    also no restriction on the criterion: you can specify a function in charge
 229    of examinating the messages either incoming or already in the queue and
 230    decide based on their meta-data (sender and message type) or their actual
 231    content whether they match your criterions.\n
 232    \n
 233    It is even possible to program processes so that they only explicitely
 234    wait for messages without using \ref gras_msg_handle to accept messages
 235    and start the callbacks associated to them. GRAS thus supports both the
 236    pure event-based programming model and the more classical message passing
 237    model.\n
 238
 239  - <b>Internal timers</b>. There is two types of timers: delayed actions and
 240    repetitive actions. The former happen only once when the delay expires
 241    while the second happen regularly each time that a period expires.\n
 242    \n
 243    Like incoming messages, timer treatments are not prehemptive. Ie, the
 244    function attached to a given timer will not start as soon as the period
 245    expires, but only when the process declares to be ready to accept
 246    incoming events. This also done in the \ref gras_msg_handle function, and
 247    expired timers are prioritaire with regard to incoming messages.
 248
 249 Messages are sent using the \ref gras_msg_send function. You should specify
 250 the receiver, the message type and the actual payload. This operation can
 251 happen at any time of your program. Message sending is not considered as a
 252 process state change, but rather as a reaction to an incoming event. It
 253 changes the state of another process, though. Trying to send messages to
 254 yourself will deadlock (althrough it may change in the future).
 255
 256 \subsection GRAS_tut_intro_model_commmodel Communication model
 257
 258 Send operations are <b>as synchronous as possible pratically</b>. They block
 259 the process until the message actually gets delivered to the receiving
 260 process. An acknoledgment is awaited in SG, and we consider the fact that RL
 261 does not the same as a bug to be fixed one day. We thus have an <b>1-port model
 262 in emission</b>. This limitation allows the framework to signal error condition
 263 to the user code in the section which asked for the transmission, without
 264 having to rely on an interuption mecanism to signal errors asynchronously.
 265 This communication model is not completely synchronous in that sense that the
 266 receiver cannot be sure that the acknoledgment has been delivered (this is the
 267 classical byzantin generals problem). Pratically, the acknoledgment is so small
 268 that there is a good probability that the message where delivered. If you need
 269 more guaranty, you will need to implement better solutions in the user space.
 270
 271 As in SimGrid v3.3, receive operations are done in a separated thread, but they
 272 are done sequentially by this thread. The model is thus <b>1-port in
 273 reception</b>, but something like 2-port in general. Moreover, the messages not
 274 matching the criterion in explicite receive (see for example \ref
 275 gras_msg_wait) are queued for further use. Thanks to this specific
 276 thread, the emission and reception are completely decorelated. Ie, the
 277 main thread can perfectly send a message while the listener is
 278 receiving something. We thus have a classical <b>1-port model</b>.
 279
 280 Here is a graphical representation of a scenario involving two processes A and
 281 B.  Both are naturally composed of two threads: the one running user code, and
 282 the listener in charge of listening incoming messages from the network. Both
 283 processes also have a queue for the communication between the two threads, even
 284 if only the queue of process B is depicted in the graph.
 285
 286 The experimental scenario is as follows: <ul>
 287
 288 <li>Process A sends a first message (depicted in red) with gras_msg_send(), do
 289     some more computation, and then send another message (depicted in
 290     yellow). Then, this process handles any incomming message with
 291     gras_msg_handle(). Since no message is already queued in process A at this
 292     point, this is a blocking call until the third message (depicted in
 293     magenta) arrives from the other process.</li>
 294
 295 <li>On its side, the process B explicitely wait for the second message with
 296     gras_msg_wait(), do some computation with it, and then call
 297     gras_msg_handle() to handle any incomming message. This will pop the red
 298     message from the queue, and start the callback attached to that kind of
 299     messages. This callback sends back a new message (depicted in magenta) back
 300     to process A.</li>
 301 </ul>
 302
 303 <img src="gras_comm.png">
 304
 305 This figure is a bit dense, and there is several point to detail here:<ul>
 306
 307 <li>The timings associated to a given data exchange are detailed for the first
 308 message. The time (1) corresponds to the network latency. That is the time to
 309 reach the machine on which B is running from the machine running on A. The time
 310 (2) is mainly given by the network bandwidth. This is the time for all bytes of
 311 the messages to travel from one machine to the other. Please note that the
 312 models used by SimGrid are a bit more complicated to keep realistic, as
 313 explained in <a href="http://www.loria.fr/~quinson/articles/simgrid-tutorial.pdf">the
 314 tutorial slides</a>, but this not that important here. The time (3) is mainly
 315 found in the SG version and not in RL (and that's a bug). This is the time to
 316 make sure that message were received on machine B. In real life, some buffering
 317 at system and network level may give the illusion to machine A that the message
 318 were already received before it's actually delivered to the listener of machine
 319 B (this would reduce the time (3)). To circumvent this, machine B should send a
 320 little acknoledgment message when it's done, but this is not implemented yet.</li>
 321
 322 <li>As you can see on the figure, sending is blocking until the message is
 323 received by the listener on the other side, but the main thread of the receiver
 324 side is not involved in this operation. Sender will get released from its send
 325 even if the main thread of receiver is occuped elsewhere.</li>
 326
 327 <li>Incomming messages not matching the expectations of a gras_msg_wait() (such
 328 as the red one) are queued for further use. The next message receiving
 329 operation will explore this queue in order, and if empty, block on the
 330 network. The order of unexpected messages and subsequent ones is thus preserved
 331 from the receiver point of view.</li>
 332
 333 <li>gras_msg_wait() and gras_msg_handle() accept timeouts as argument to
 334 specify how long you are willing to wait at most for incomming messages. These
 335 were ignored here to not complexify the example any further. It is worth
 336 mentionning that the send operation cannot be timeouted. The existance of the
 337 listener should make it useless.</li>
 338
 339 </ul>
 340
 341 \subsection GRAS_tut_intro_model_timing_policy Timing policy
 342
 343 All communication primitives allow 3 timout policies: one can only poll for
 344 incomming events (using timeout=0), wait endlessly for the communication to
 345 be performed (using timeout<0) or specify a maximal delay to wait for the
 346 communication to proceed (using timeout>0, being a number of seconds).
 347
 348 Again, this describes the targeted model. The current implementation does
 349 not allow to specify a delay for the outgoing communication. In SG, the
 350 delay is then hardcoded to 60 seconds while outgoing communication wait for
 351 ever to proceed in RL.
 352
 353 Another timing policy we plan to implement in the future is "adaptative
 354 timeouts", where the timeout is computed automatically by the framework
 355 according to performance of previous communications. This was demonstrated
 356 for example in the NWS tool.
 357
 358 \subsection GRAS_tut_intro_model_exception Error handling through exceptions
 359
 360 As explained in section \ref GRAS_tut_intro_what_simple, any function may
 361 raise exceptions breaking their execution. No support is provided by the
 362 framework to ensure that the internal state remains consistent when
 363 exceptions are raised. Changing this would imply that we are able to
 364 checkpoint the internal state to provide a transaction service, which seems
 365 quite difficult to achieve efficiently.
 366
 367 \subsection GRAS_tut_intro_model_rpc RPC messaging
 368
 369 In addition to the one-way messages described above, GRAS supports RPC
 370 communication. Using this, a client process asks for the execution of a
 371 callback on a server process. RPC types are close to regular message types:
 372 they are described by a type (a string), a payload type for the request, but
 373 in addition, they also have a payload type for the answer from the server to
 374 the client.
 375
 376 RPC can be either synchronous (the function blocks until an answer is
 377 received) or asynchronous (you send the request and wait later for the
 378 anwer). They accept the same timing policies than regular messages.
 379
 380 If the callback raises an exception on the server side, this exception will
 381 be trapped by the framework on the server side, sent back to the client
 382 side, and revived on the client side. So, if the client calls a RPC which
 383 raises an error, it will have to deal with the exception itself. No
 384 provision is given concerning the state consistency on the server side when
 385 an exception arise. The <tt>host</tt> fields of the exception structure
 386 indicates the name of the host on which it was raised.
 387
 388 The callback performing the treatment associated to a RPC can perform any
 389 kind of communication itself, including RPC. In the case where A calls a RPC
 390 on B, leading to B calling a RPC on C (ie, A->B->C), if an exception is
 391 raised on C, it will be forwarded back to A. The <tt>host</tt> field will
 392 indicate C.
 393
 394 <hr>
 395
 396 \section GRAS_tut_intro_next What's next?
 397
 398 Now that you know what GRAS is and the communication model used, it is time
 399 to move to the \ref GRAS_tut_tour section. There, you will build
 400 incrementally a full-featured GRAS application demonstrating most of the
 401 aspects of the framework.
 402
 403 */