Extend the description of the IB configuration

[simgrid.git] / docs / source / Configuring_SimGrid.rst
diff --git a/docs/source/Configuring_SimGrid.rst b/docs/source/Configuring_SimGrid.rst

index 08e9d63..a5497a8 100644 (file)
--- a/docs/source/Configuring_SimGrid.rst
+++ b/docs/source/Configuring_SimGrid.rst
@@ -347,7 +347,6 @@ and you should use the last one, which is the maximal size.
     cat /proc/sys/net/ipv4/tcp_rmem # gives the sender window
     cat /proc/sys/net/ipv4/tcp_wmem # gives the receiver window
  
     cat /proc/sys/net/ipv4/tcp_rmem # gives the sender window
     cat /proc/sys/net/ipv4/tcp_wmem # gives the receiver window
  
-.. _cfg=smpi/IB-penalty-factors:
  .. _cfg=network/bandwidth-factor:
  .. _cfg=network/latency-factor:
  .. _cfg=network/weight-S:
  .. _cfg=network/bandwidth-factor:
  .. _cfg=network/latency-factor:
  .. _cfg=network/weight-S:
@@ -370,15 +369,63 @@ exchange.  By default SMPI uses factors computed on the Stampede
  Supercomputer at TACC, with optimal deployment of processes on
  nodes. Again, only hardcore experts should bother about this fact.
  
  Supercomputer at TACC, with optimal deployment of processes on
  nodes. Again, only hardcore experts should bother about this fact.
  
-InfiniBand network behavior can be modeled through 3 parameters
-``smpi/IB-penalty-factors:"βe;βs;γs"``, as explained in `this PhD
-thesis
-<http://mescal.imag.fr/membres/jean-marc.vincent/index.html/PhD/Vienne.pdf>`_.
  
  .. todo:: This section should be rewritten, and actually explain the
           options network/bandwidth-factor, network/latency-factor,
           network/weight-S.
  
  
  .. todo:: This section should be rewritten, and actually explain the
           options network/bandwidth-factor, network/latency-factor,
           network/weight-S.
  
+.. _cfg=smpi/IB-penalty-factors:
+
+Infiniband model
+^^^^^^^^^^^^^^^^
+
+InfiniBand network behavior can be modeled through 3 parameters
+``smpi/IB-penalty-factors:"βe;βs;γs"``, as explained in `this PhD
+thesis
+<http://mescal.imag.fr/membres/jean-marc.vincent/index.html/PhD/Vienne.pdf>`_ (in French)
+or more concisely in `this paper <https://hal.inria.fr/hal-00953618/document>`_,
+even if that paper does only describe models for myrinet and ethernet.
+You can see in Fig 2 some results for Infiniband, for example. This model
+may be outdated by now for modern infiniband, anyway, so a new
+validation would be good. 
+
+The three paramaters are defined as follows:
+
+- βs: penalty factor for outgoing messages, computed by running a simple send to
+  two nodes and checking slowdown compared to a single send to one node,
+  dividing by 2
+- βe: penalty factor for ingoing messages, same computation method but with one
+  node receiving several messages
+- γr: slowdown factor when communication buffer memory is saturated. It needs a
+  more complicated pattern to run in order to be computed (5.3 in the thesis,
+  page 107), and formula in the end is γr = time(c)/(3×βe×time(ref)), where
+  time(ref) is the time of a single comm with no contention).
+
+Once these values are computed, a penalty is assessed for each message (this is
+the part implemented in the simulator) as shown page 106 of the thesis. Here is
+a simple translation of this text. First, some notations:
+
+- ∆e(e) which corresponds to the incoming degree of node e, that is to say the number of communications having as destination node e.
+- ∆s (s) which corresponds to the degree outgoing from node s, that is to say the number of communications sent by node s.
+- Φ (e) which corresponds to the number of communications destined for the node e but coming from a different node.
+- Ω (s, e) which corresponds to the number of messages coming from node s to node e. If node e only receives communications from different nodes then Φ (e) = ∆e (e). On the other hand if, for example, there are three messages coming from node s and going from node e then Φ (e) 6 = ∆e (e) and Ω (s, e) = 3
+
+To determine the penalty for a communication, two values need to be calculated. First, the penalty caused by the conflict in transmission, noted ps.
+
+
+- if ∆s (i) = 1 then ps = 1. 
+- if ∆s (i) ≥ 2 and ∆e (i) ≥ 3 then ps = ∆s (i) × βs × γr
+- else, ps = ∆s (i) × βs 
+
+
+Then,  the penalty caused by the conflict in reception (noted pe) should be computed as follows:
+
+- if ∆e (i) = 1 then pe = 1
+- else, pe = Φ (e) × βe × Ω (s, e) 
+
+Finally, the penalty associated with the communication is:
+p = max (ps ∈ s, pe)
+
  .. _cfg=network/crosstraffic:
  
  Simulating Cross-Traffic
  .. _cfg=network/crosstraffic:
  
  Simulating Cross-Traffic
@@ -1082,6 +1129,7 @@ https://framagit.org/simgrid/platform-calibration/
  https://simgrid.org/contrib/smpi-saturation-doc.html
  
  .. _cfg=smpi/display-timing:
  https://simgrid.org/contrib/smpi-saturation-doc.html
  
  .. _cfg=smpi/display-timing:
+
  Reporting Simulation Time
  .........................
  
  Reporting Simulation Time
  .........................
  
@@ -1099,8 +1147,9 @@ in application code and in SMPI internals, to provide hints about the
  need to use sampling to reduce simulation time.
  
  .. _cfg=smpi/display-allocs:
  need to use sampling to reduce simulation time.
  
  .. _cfg=smpi/display-allocs:
+
  Reporting memory allocations
  Reporting memory allocations
-.........................
+............................
  
  **Option** ``smpi/display-allocs`` **Default:** 0 (false)
  
  
  **Option** ``smpi/display-allocs`` **Default:** 0 (false)
  
@@ -1457,8 +1506,9 @@ Then, you can pass the option
  actually activate the huge page support in shared mallocs.
  
  .. _cfg=smpi/auto-shared-malloc-thresh:
  actually activate the huge page support in shared mallocs.
  
  .. _cfg=smpi/auto-shared-malloc-thresh:
+
  Automatically share allocations
  Automatically share allocations
-.........................
+...............................
  
  **Option** ``smpi/auto-shared-malloc-thresh:`` **Default:** 0 (false)
     This value in bytes represents the size above which all allocations
  
  **Option** ``smpi/auto-shared-malloc-thresh:`` **Default:** 0 (false)
     This value in bytes represents the size above which all allocations