X-Git-Url: http://info.iut-bm.univ-fcomte.fr/pub/gitweb/simgrid.git/blobdiff_plain/d356b692a6a8031a987fbff621e90b372c805e89..07c3e8b3a49e2fa7f2f28df9e40a19ae3f55c090:/doc/doxygen/module-smpi.doc diff --git a/doc/doxygen/module-smpi.doc b/doc/doxygen/module-smpi.doc index 71e68a0b6d..b51f5dfe04 100644 --- a/doc/doxygen/module-smpi.doc +++ b/doc/doxygen/module-smpi.doc @@ -1,64 +1,79 @@ /** @addtogroup SMPI_API -This programming environment permits to study existing MPI application -by emulating them on top of the SimGrid simulator. In other words, it -will constitute an emulation solution for parallel codes. You don't -even have to modify your code for that, although that may help, as -detailed below. +This programming environment enables the study of MPI application by +emulating them on top of the SimGrid simulator. This is particularly +interesting to study existing MPI applications within the comfort of +the simulator. The motivation for this work is detailed in the +reference article (available at http://hal.inria.fr/inria-00527150). + +Our goal is to enable the study of *unmodified* MPI applications, +although we are not quite there yet (see @ref SMPI_what). In +addition, you can modify your code to speed up your studies or +otherwise increase their scalability (see @ref SMPI_adapting). \section SMPI_who Who should use SMPI (and who shouldn't) -You should use this programming environment of the SimGrid suite if -you want to study existing MPI applications. If you want to study -some heuristics for a given problem (and if your goal is to produce -theorems and publications, not code), have a look at the \ref MSG_API -environment, or the \ref SD_API one if you need to use DAGs. If none -of those programming environments fits your needs, you may consider -implementing your own directly on top of \ref SURF_API (but you -probably want to contact us before). +SMPI is now considered as stable and you can use it in production. You +may probably want to read the scientific publications that detail the +models used and their limits, but this should not be absolutely +necessary. If you already fluently write and use MPI applications, +SMPI should sound very familiar to you. Use smpicc instead of mpicc, +and smpirun instead of mpirun (see below for more details). + +Of course, if you don't know what MPI is, the documentation of SMPI +will seem a bit terse to you. You should pick up a good MPI tutorial +on the Internet (or a course in your favorite university) and come +back to SMPI once you know a bit more about MPI. Alternatively, you +may want to turn to the other SimGrid interfaces such as the +\ref MSG_API environment, or the \ref SD_API one. \section SMPI_what What can run within SMPI? You can run unmodified MPI applications (both C and Fortran) within -SMPI, provided you only use MPI calls that we implemented in MPI. Our -coverage of the interface is not bad, but will probably never be -complete. One sided communications and I/O primitives are not targeted -for now. The full list of not yet implemented functions is available -in file include/smpi/smpi.h of the archive, between two lines +SMPI, provided that (1) you only use MPI calls that we implemented in +MPI and (2) you don't use any globals in your application. + +\subsection SMPI_what_coverage MPI coverage of SMPI + +Our coverage of the interface is very decent, but still incomplete; +Given the size of the MPI standard, it may well be that we never +implement absolutely all existing primitives. One sided communications +and I/O primitives are not targeted for now. Our current state is +still very decent: we pass most of the MPICH coverage tests. + +The full list of not yet implemented functions is documented in the +file include/smpi/smpi.h of the archive, between two lines containing the FIXME marker. If you really need a missing feature, please get in touch with us: we can guide you though the SimGrid code to help you implementing it, and we'd glad to integrate it in the main project afterward if you contribute them back. -\section SMPI_adapting Adapting your MPI code to the use of SMPI - -As detailed in the reference article (available at -http://hal.inria.fr/inria-00527150), you may want to adapt your code -to improve the simulation performance. But these tricks may seriously -hinder the result qualtity (or even prevent the app to run) if used -wrongly. We assume that if you want to simulate an HPC application, -you know what you are doing. Don't prove us wrong! +\subsection SMPI_what_globals Issues with the globals -If you get short on memory (the whole app is executed on a single node -when simulated), you should have a look at the SMPI_SHARED_MALLOC and -SMPI_SHARED_FREE macros. It allows to share memory areas between -processes. For example, matrix multiplication code may want to store -the blocks on the same area. Of course, the resulting computations -will useless, but you can still study the application behavior this -way. Of course, if your code is data-dependent, this won't work. +Concerning the globals, the problem comes from the fact that usually, +MPI processes run as real UNIX processes while they are all folded +into threads of a unique system process in SMPI. Global variables are +usually private to each MPI process while they become shared between +the processes in SMPI. This point is rather problematic, and currently +forces to modify your application to privatize the global variables. -If your application is too slow, try using SMPI_SAMPLE_LOCAL, -SMPI_SAMPLE_GLOBAL and friends to indicate which computation loops can -be sampled. Some of the loop iterations will be executed to measure -their duration, and this duration will be used for the subsequent -iterations. These samples are done per processor with -SMPI_SAMPLE_LOCAL, and shared between all processors with -SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution -time of your loop iteration are not stable. +We tried several techniques to work this around. We used to have a +script that privatized automatically the globals through static +analysis of the source code, but it was not robust enough to be used +in production. This issue, as well as several potential solutions, is +discussed in this article: "Automatic Handling of Global Variables for +Multi-threaded MPI Programs", +available at http://charm.cs.illinois.edu/newPapers/11-23/paper.pdf +(note that this article does not deal with SMPI but with a concurrent +solution called AMPI that suffers of the same issue). -Yes, that's right, these macros are not documented yet, but we'll fix -it as soon as time permits. Sorry about that -- patch welcomed! -Meanwhile, grep for them on the examples for more information. +Currently, we have no solution to offer you, because all proposed solutions will +modify the performance of your application (in the computational +sections). Sacrificing realism for usability is not very satisfying, so we did +not implement them yet. You will thus have to modify your application if it uses +global variables. We are working on another solution, leveraging distributed +simulation to keep each MPI process within a separate system process, but this +is far from being ready at the moment. \section SMPI_compiling Compiling your code @@ -93,4 +108,353 @@ by running smpirun -help \endverbatim -*/ \ No newline at end of file +\section SMPI_adapting Adapting your MPI code to the use of SMPI + +As detailed in the reference article (available at +http://hal.inria.fr/inria-00527150), you may want to adapt your code +to improve the simulation performance. But these tricks may seriously +hinder the result quality (or even prevent the app to run) if used +wrongly. We assume that if you want to simulate an HPC application, +you know what you are doing. Don't prove us wrong! + +\section SMPI_adapting_size Reducing your memory footprint + +If you get short on memory (the whole app is executed on a single node when +simulated), you should have a look at the SMPI_SHARED_MALLOC and +SMPI_SHARED_FREE macros. It allows to share memory areas between processes: The +purpose of these macro is that the same line malloc on each process will point +to the exact same memory area. So if you have a malloc of 2M and you have 16 +processes, this macro will change your memory consumption from 2M*16 to 2M +only. Only one block for all processes. + +If your program is ok with a block containing garbage value because all +processes write and read to the same place without any kind of coordination, +then this macro can dramatically shrink your memory consumption. For example, +that will be very beneficial to a matrix multiplication code, as all blocks will +be stored on the same area. Of course, the resulting computations will useless, +but you can still study the application behavior this way. + +Naturally, this won't work if your code is data-dependent. For example, a Jacobi +iterative computation depends on the result computed by the code to detect +convergence conditions, so turning them into garbage by sharing the same memory +area between processes does not seem very wise. You cannot use the +SMPI_SHARED_MALLOC macro in this case, sorry. + +This feature is demoed by the example file +examples/smpi/NAS/DT-folding/dt.c + +\section SMPI_adapting_speed Toward faster simulations + +If your application is too slow, try using SMPI_SAMPLE_LOCAL, +SMPI_SAMPLE_GLOBAL and friends to indicate which computation loops can +be sampled. Some of the loop iterations will be executed to measure +their duration, and this duration will be used for the subsequent +iterations. These samples are done per processor with +SMPI_SAMPLE_LOCAL, and shared between all processors with +SMPI_SAMPLE_GLOBAL. Of course, none of this will work if the execution +time of your loop iteration are not stable. + +This feature is demoed by the example file +examples/smpi/NAS/EP-sampling/ep.c + + +\section SMPI_collective_algorithms Simulating collective operations + +MPI collective operations can be implemented very differently from one library +to another. Actually, all existing libraries implement several algorithms +for each collective operation, and by default select at runtime which one +should be used for the current operation, depending on the sizes sent, the number + of nodes, the communicator, or the communication library being used. These +decisions are based on empirical results and theoretical complexity estimation, +but they can sometimes be suboptimal. Manual selection is possible in these cases, +to allow the user to tune the library and use the better collective if the +default one is not good enough. + +SMPI tries to apply the same logic, regrouping algorithms from OpenMPI, MPICH +libraries, and from StarMPI (STAR-MPI). +This collection of more than a hundred algorithms allows a simple and effective + comparison of their behavior and performance, making SMPI a tool of choice for the +development of such algorithms. + +\subsection Tracing_internals Tracing of internal communications + +For each collective, default tracing only outputs only global data. +Internal communication operations are not traced to avoid outputting too much data +to the trace. To debug and compare algorithm, this can be changed with the item +\b tracing/smpi/internals , which has 0 for default value. +Here are examples of two alltoall collective algorithms runs on 16 nodes, +the first one with a ring algorithm, the second with a pairwise one : + +\htmlonly +

+
+\endhtmlonly + +\subsection Selectors + +The default selection logic implemented by default in OpenMPI (version 1.7) +and MPICH (version 3.0.4) has been replicated and can be used by setting the +\b smpi/coll_selector item to either ompi or mpich. The code and details for each +selector can be found in the src/smpi/colls/smpi_(openmpi/mpich)_selector.c file. +As this is still in development, we do not insure that all algorithms are correctly + replicated and that they will behave exactly as the real ones. If you notice a difference, +please contact SimGrid developers mailing list + +The default selector uses the legacy algorithms used in versions of SimGrid + previous to the 3.10. they should not be used to perform performance study and +may be removed in the future, a different selector being used by default. + +\subsection algos Available algorithms + +For each one of the listed algorithms, several versions are available, + either coming from STAR-MPI, MPICH or OpenMPI implementations. Details can be + found in the code or in STAR-MPI for STAR-MPI algorithms. + +Each collective can be selected using the corresponding configuration item. For example, to use the pairwise alltoall algorithm, one should add \b --cfg=smpi/alltoall:pair to the line. This will override the selector (for this algorithm only) if provided, allowing better flexibility. + +Warning: Some collective may require specific conditions to be executed correctly (for instance having a communicator with a power of two number of nodes only), which are currently not enforced by Simgrid. Some crashes can be expected while trying these algorithms with unusual sizes/parameters + +\subsubsection MPI_Alltoall + +Most of these are best described in STAR-MPI + + - default : naive one, by default + - ompi : use openmpi selector for the alltoall operations + - mpich : use mpich selector for the alltoall operations + - automatic (experimental) : use an automatic self-benchmarking algorithm + - 2dmesh : organizes the nodes as a two dimensional mesh, and perform allgather +along the dimensions + - 3dmesh : adds a third dimension to the previous algorithm + - rdb : recursive doubling : extends the mesh to a nth dimension, each one +containing two nodes + - pair : pairwise exchange, only works for power of 2 procs, size-1 steps, +each process sends and receives from the same process at each step + - pair_light_barrier : same, with small barriers between steps to avoid contention + - pair_mpi_barrier : same, with MPI_Barrier used + - pair_one_barrier : only one barrier at the beginning + - ring : size-1 steps, at each step a process send to process (n+i)%size, and receives from (n-i)%size + - ring_light_barrier : same, with small barriers between some phases to avoid contention + - ring_mpi_barrier : same, with MPI_Barrier used + - ring_one_barrier : only one barrier at the beginning + - basic_linear :posts all receives and all sends, +starts the communications, and waits for all communication to finish + +\subsubsection MPI_Alltoallv + + - default : naive one, by default + - ompi : use openmpi selector for the alltoallv operations + - mpich : use mpich selector for the alltoallv operations + - automatic (experimental) : use an automatic self-benchmarking algorithm + - bruck : same as alltoall + - pair : same as alltoall + - pair_light_barrier : same as alltoall + - pair_mpi_barrier : same as alltoall + - pair_one_barrier : same as alltoall + - ring : same as alltoall + - ring_light_barrier : same as alltoall + - ring_mpi_barrier : same as alltoall + - ring_one_barrier : same as alltoall + - ompi_basic_linear : same as alltoall + + +\subsubsection MPI_Gather + + - default : naive one, by default + - ompi : use openmpi selector for the gather operations + - mpich : use mpich selector for the gather operations + - automatic (experimental) : use an automatic self-benchmarking algorithm +which will iterate over all implemented versions and output the best + - ompi_basic_linear : basic linear algorithm from openmpi, each process sends to the root + - ompi_binomial : binomial tree algorithm + - ompi_linear_sync : same as basic linear, but with a synchronization at the + beginning and message +cut into two segments. + +\subsubsection MPI_Barrier + - default : naive one, by default + - ompi : use openmpi selector for the barrier operations + - mpich : use mpich selector for the barrier operations + - automatic (experimental) : use an automatic self-benchmarking algorithm + - ompi_basic_linear : all processes send to root + - ompi_two_procs : special case for two processes + - ompi_bruck : nsteps = sqrt(size), at each step, exchange data with rank-2^k and rank+2^k + - ompi_recursivedoubling : recursive doubling algorithm + - ompi_tree : recursive doubling type algorithm, with tree structure + - ompi_doublering : double ring algorithm + + +\subsubsection MPI_Scatter + - default : naive one, by default + - ompi : use openmpi selector for the scatter operations + - mpich : use mpich selector for the scatter operations + - automatic (experimental) : use an automatic self-benchmarking algorithm + - ompi_basic_linear : basic linear scatter + - ompi_binomial : binomial tree scatter + + +\subsubsection MPI_Reduce + - default : naive one, by default + - ompi : use openmpi selector for the reduce operations + - mpich : use mpich selector for the reduce operations + - automatic (experimental) : use an automatic self-benchmarking algorithm + - arrival_pattern_aware : root exchanges with the first process to arrive + - binomial : uses a binomial tree + - flat_tree : uses a flat tree + - NTSL : Non-topology-specific pipelined linear-bcast function + 0->1, 1->2 ,2->3, ....., ->last node : in a pipeline fashion, with segments + of 8192 bytes + - scatter_gather : scatter then gather + - ompi_chain : openmpi reduce algorithms are built on the same basis, but the + topology is generated differently for each flavor +chain = chain with spacing of size/2, and segment size of 64KB + - ompi_pipeline : same with pipeline (chain with spacing of 1), segment size +depends on the communicator size and the message size + - ompi_binary : same with binary tree, segment size of 32KB + - ompi_in_order_binary : same with binary tree, enforcing order on the +operations + - ompi_binomial : same with binomial algo (redundant with default binomial +one in most cases) + - ompi_basic_linear : basic algorithm, each process sends to root + +\subsubsection MPI_Allreduce + - default : naive one, by default + - ompi : use openmpi selector for the allreduce operations + - mpich : use mpich selector for the allreduce operations + - automatic (experimental) : use an automatic self-benchmarking algorithm + - lr : logical ring reduce-scatter then logical ring allgather + - rab1 : variations of the Rabenseifner algorithm : reduce_scatter then allgather + - rab2 : variations of the Rabenseifner algorithm : alltoall then allgather + - rab_rsag : variation of the Rabenseifner algorithm : recursive doubling +reduce_scatter then recursive doubling allgather + - rdb : recursive doubling + - smp_binomial : binomial tree with smp : 8 cores/SMP, binomial intra +SMP reduce, inter reduce, inter broadcast then intra broadcast + - smp_binomial_pipeline : same with segment size = 4096 bytes + - smp_rdb : 8 cores/SMP, intra : binomial allreduce, inter : Recursive +doubling allreduce, intra : binomial broadcast + - smp_rsag : 8 cores/SMP, intra : binomial allreduce, inter : reduce-scatter, +inter:allgather, intra : binomial broadcast + - smp_rsag_lr : 8 cores/SMP, intra : binomial allreduce, inter : logical ring +reduce-scatter, logical ring inter:allgather, intra : binomial broadcast + - smp_rsag_rab : 8 cores/SMP, intra : binomial allreduce, inter : rab +reduce-scatter, rab inter:allgather, intra : binomial broadcast + - redbcast : reduce then broadcast, using default or tuned algorithms if specified + - ompi_ring_segmented : ring algorithm used by OpenMPI + +\subsubsection MPI_Reduce_scatter + - default : naive one, by default + - ompi : use openmpi selector for the reduce_scatter operations + - mpich : use mpich selector for the reduce_scatter operations + - automatic (experimental) : use an automatic self-benchmarking algorithm + - ompi_basic_recursivehalving : recursive halving version from OpenMPI + - ompi_ring : ring version from OpenMPI + - mpich_pair : pairwise exchange version from MPICH + - mpich_rdb : recursive doubling version from MPICH + - mpich_noncomm : only works for power of 2 procs, recursive doubling for noncommutative ops + + +\subsubsection MPI_Allgather + + - default : naive one, by default + - ompi : use openmpi selector for the allgather operations + - mpich : use mpich selector for the allgather operations + - automatic (experimental) : use an automatic self-benchmarking algorithm + - 2dmesh : see alltoall + - 3dmesh : see alltoall + - bruck : Described by Bruck et.al. in +Efficient algorithms for all-to-all communications in multiport message-passing systems + - GB : Gather - Broadcast (uses tuned version if specified) + - loosely_lr : Logical Ring with grouping by core (hardcoded, default +processes/node: 4) + - NTSLR : Non Topology Specific Logical Ring + - NTSLR_NB : Non Topology Specific Logical Ring, Non Blocking operations + - pair : see alltoall + - rdb : see alltoall + - rhv : only power of 2 number of processes + - ring : see alltoall + - SMP_NTS : gather to root of each SMP, then every root of each SMP node +post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message, +using logical ring algorithm (hardcoded, default processes/SMP: 8) + - smp_simple : gather to root of each SMP, then every root of each SMP node +post INTER-SMP Sendrecv, then do INTRA-SMP Bcast for each receiving message, +using simple algorithm (hardcoded, default processes/SMP: 8) + - spreading_simple : from node i, order of communications is i -> i + 1, i -> + i + 2, ..., i -> (i + p -1) % P + - ompi_neighborexchange : Neighbor Exchange algorithm for allgather. +Described by Chen et.al. in Performance Evaluation of Allgather Algorithms on Terascale Linux Cluster with Fast Ethernet + + +\subsubsection MPI_Allgatherv + - default : naive one, by default + - ompi : use openmpi selector for the allgatherv operations + - mpich : use mpich selector for the allgatherv operations + - automatic (experimental) : use an automatic self-benchmarking algorithm + - GB : Gatherv - Broadcast (uses tuned version if specified, but only for +Bcast, gatherv is not tuned) + - pair : see alltoall + - ring : see alltoall + - ompi_neighborexchange : see allgather + - ompi_bruck : see allgather + - mpich_rdb : recursive doubling algorithm from MPICH + - mpich_ring : ring algorithm from MPICh - performs differently from the +one from STAR-MPI + +\subsubsection MPI_Bcast + - default : naive one, by default + - ompi : use openmpi selector for the bcast operations + - mpich : use mpich selector for the bcast operations + - automatic (experimental) : use an automatic self-benchmarking algorithm + - arrival_pattern_aware : root exchanges with the first process to arrive + - arrival_pattern_aware_wait : same with slight variation + - binomial_tree : binomial tree exchange + - flattree : flat tree exchange + - flattree_pipeline : flat tree exchange, message split into 8192 bytes pieces + - NTSB : Non-topology-specific pipelined binary tree with 8192 bytes pieces + - NTSL : Non-topology-specific pipelined linear with 8192 bytes pieces + - NTSL_Isend : Non-topology-specific pipelined linear with 8192 bytes pieces, asynchronous communications + - scatter_LR_allgather : scatter followed by logical ring allgather + - scatter_rdb_allgather : scatter followed by recursive doubling allgather + - arrival_scatter : arrival pattern aware scatter-allgather + - SMP_binary : binary tree algorithm with 8 cores/SMP + - SMP_binomial : binomial tree algorithm with 8 cores/SMP + - SMP_linear : linear algorithm with 8 cores/SMP + - ompi_split_bintree : binary tree algorithm from OpenMPI, with message split in 8192 bytes pieces + - ompi_pipeline : pipeline algorithm from OpenMPI, with message split in 128KB pieces + + +\subsection auto Automatic evaluation + +(Warning : This is experimental and may be removed or crash easily) + +An automatic version is available for each collective (or even as a selector). This specific +version will loop over all other implemented algorithm for this particular collective, and apply +them while benchmarking the time taken for each process. It will then output the quickest for +each process, and the global quickest. This is still unstable, and a few algorithms which need +specific number of nodes may crash. + + +\subsection add Add an algorithm + +To add a new algorithm, one should check in the src/smpi/colls folder how other algorithms +are coded. Using plain MPI code inside Simgrid can't be done, so algorithms have to be +changed to use smpi version of the calls instead (MPI_Send will become smpi_mpi_send). Some functions may have different signatures than their MPI counterpart, please check the other algorithms or contact us using SimGrid developers mailing list. + +Example: adding a "pair" version of the Alltoall collective. + + - Implement it in a file called alltoall-pair.c in the src/smpi/colls folder. This file should include colls_private.h. + + - The name of the new algorithm function should be smpi_coll_tuned_alltoall_pair, with the same signature as MPI_Alltoall. + + - Once the adaptation to SMPI code is done, add a reference to the file ("src/smpi/colls/alltoall-pair.c") in the SMPI_SRC part of the DefinePackages.cmake file inside buildtools/cmake, to allow the file to be built and distributed. + + - To register the new version of the algorithm, simply add a line to the corresponding macro in src/smpi/colls/cools.h ( add a "COLL_APPLY(action, COLL_ALLTOALL_SIG, pair)" to the COLL_ALLTOALLS macro ). The algorithm should now be compiled and be selected when using --cfg=smpi/alltoall:pair at runtime. + + - To add a test for the algorithm inside Simgrid's test suite, juste add the new algorithm name in the ALLTOALL_COLL list found inside buildtools/cmake/AddTests.cmake . When running ctest, a test for the new algorithm should be generated and executed. If it does not pass, please check your code or contact us. + + - Feel free to push this new algorithm to the SMPI repository using Git. + + + + +*/