CSERD


  • userhome
  • catalog
  • resources
  • help

Basic MPI Tutorial


Shodor > CSERD > Resources > Tutorials > Basic MPI Tutorial

  


Basic MPI

It is often said that there are two views of MPI. One view is that MPI is a lightweight protocol with only 6 commands. The other view is that it is a in depth protocol with hundreds of specialized commands.

This document is for the 6 command people.

The 6 Commands

  • MPI_Init
  • MPI_Comm_size
  • MPI_Comm_rank
  • MPI_Send
  • MPI_Recv
  • MPI_Finalize

In short, set up an MPI program, get the number of processes participating in the program, determine which of those processes corresponds to the one calling the command, send messages, receive messages, and stop participating in a parallel program.

MPI_Init(int *argc, char ***argv)

Takes the command line arguments to a program, checks for any MPI options, and passes remaining command line arguments to the main program.

MPI_Comm_size( MPI_Comm comm, int *size )

Determines the size of a given MPI Communicator. A communicator is a set of processes that work together. For typical programs this is the default MPI_COMM_WORLD, which is the communicator for all processes available to an MPI program.

MPI_Comm_rank( MPI_Comm comm, int *rank )

Determine the rank of the current process within a communicator. Typically, if a MPI program is being run on N processes, the communicator would be MPI_COMM_WORLD, and the rank would be an integer from 0 to N-1.

MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )

Send the contents of buf, which contains count elements of type datatype to a process of rank dest in the communicator comm, flagged with the message tag. Typically, the communicator is MPI_COMM_WORLD.

MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status )

Read into buf count values of type datatype from process source in communicator comm if a message is sent flagged with tag. Also receive information about the transfer into status.

MPI_Finalize()

Handles anything that the current MPI protocol will need to do before exiting a program. Typically should be the final or near final line of a program.

Deadlock - what happens if you don't match your sends and receives?

Suppose you have an MPI code where two processes are going to "trade" some information. Each process wants to send a message to the other, who will receive it. If you have each process send and then receive, it may happen that the send fills the computers' buffers before the send is complete. Without a receive in place to start clearing the computers' buffers, the program simply stops running. It typically will not crash, but just hang.

If you receive first, then send, then it is very likely that the program will hang when trying to receive information that has not yet been sent.

This assumes you are using what are called "blocking" sends and receives, that is, sends and receives that will not let the program proceed until the message has bent sent or received. ( Look here for a description of non-blocking communications.)

The phenomenon of a "stopped" program waiting for a send or receive that will never happen is known as "deadlock".

Click the following link for an example code which can be used to demonstrate deadlock.

Putting it into practice

As an example of how to implement MPI in a simple program, consider the following example. You have a code to simulate the spreading of a fire in a large forest. The simulation is a Monte Carlo simulation, and every run is different. You want to compile average results for a large number of runs per input parameter, for a large number of input parameters, and you want to parallelize this across N machines. (For an example of a serial code with visualization, try Interactivate's Fire Applet.)

The Model

Using the serial code fire.c, you can run a single instance of a forest fire simulation in which a forest is modeled as an NxN grid of trees. One tree starts to smolder, and each iteration nearby trees have some chance of catching fire. The model follows the following rules:

  • Burning trees burn down.
  • Smoldering trees catch fire.
  • Unburnt trees next to (N, S, E, W) a burning tree catch fire with some constant probability.
  • Repeat until fire burns out.

The main input parameter for the model is the chance of the fire spreading from a burning tree to a nearby unburnt tree.

The main output for the model is the percentage of additional trees burned beyond the first tree.

The desired outcome of the parallel version is to produce a plot of average percent burns as a function of probability of spreading, as quickly and as accurately as possible. This should take into account that the probability of the fire spreading will affect not only how long it takes for the fire to burn out but also the number of iterations required to reached an accurate representation of the average.

One Approach

DISCLAIMER: This is not presented as the optimum solution to this problem, but an example. It is left for the reader to attempt to find more efficient ways of parallelizing the problem

The most straight forward, but not the most efficient approach, would be to assume that Niter iterations could be run on Nprob probabilities, and that each process would perform a calculation for an equal number of probabilities.

Each process would run the exact same program, and each process would determine which subset of the range of probabilities to use as input based on its rank and the size of MPI_COMM_WORLD. The process with rank 0 would compile all results and output them to the screen.

The pseudocode for the program would be

  • Start program
  • Choose subset of work
  • For prob = min to max do
    • burn forest
    • sum+=Percent burned
  • average = sum / n_prob
  • if (rank = 0) then
    • receive averages
    • print output
  • else
    • send averages

The main() routine

int main() {
    // initial conditions and variable definitions
    int forest_size=20;
    double prob_spread=0.5;
    int **forest;
    int i;

    // setup problem
    forest=allocate_forest(forest_size);
    initialize_forest(forest_size,forest);
    light_tree(forest_size,forest,10,10);

    // burn until fire is gone
    while(forest_is_burning(forest_size,forest)) {
        forest_burns(forest_size,forest,prob_spread);
    }

    // print output and clean up
    print_forest(forest_size,forest);
    delete_forest(forest_size,forest);

}

Looking at the main routine, what will need to be changed to parallelize this? Notice that each process will need to allocate and free memory. No process will need to print a picture of the forest being burned. Each process will have to initialize the array, light a tree on fire, and run through the while loop multiple times. Each process will need to compute an average of those runs. Each process will need to do this for multiple probabilities.

One natural step might be to put the process of initializing data and calculating a single run into a subroutine (fire_2.c).

The next step might be to put a loop for a number of trials around that subroutine (fire_3.c).

The third step might be to introduce a loop over probability (fire_4.c).

All of the previous steps have been setting up the code to make it ready to be parallelized. The code has been structured so that the main loop consists of processes which can be run concurrently. What remains is to put in our standard start and finish routines, to determine the starting and finishing probability for each process, and to have the rank 0 process collect and output the data (fire_mpi.c).

On a test cluster using 600 trials and 100 probabilities, on 3 2GHz Pentium 4 Dells running off of the Bootable Cluster CD, a speedup of 1.8 was seen for 3 machines, giving an efficiency of 62%.

This is not the optimum solution.

The low efficiency of this solution lies in the difference in running times for models which burn out quickly and models which burn slowly. In this case, one process gets almost all probabilities which burn out immediately, one process gets almost all processes which burn for a long period of time, and one process gets almost all processes which burn for an intermediate period of time.

What improvements could you suggest for improving the efficiency of this code?

Non-mpi fire code with Makefile and X-Windows graph


Suggestion from Kay Kussman, McNeese State University, Math, Computer Science, and Statistics at NCSI PVAMU Workshop, July 2003: One could keep the outer loop over probability, but have processes take each in a "round-robin". For three processes, process 0 would perform probability 0, 3, 6, ..., process 1 would perform probability 1, 4, 7, ..., and process 2 would compute probability 2, 5, 8, ..., and so on. This was implemented on the same 3 node cluster as above, with the same parameters. Speedup for this method was 2.8, and efficiency 92% (fire_mpi_2.c).

Solution discussed at the NCSI workshop at Oklahoma University, August 2004: One could also split the job up so that each of the nodes do some of the trials, but for every probability. An implementation of that can be found at fire_mpi_3.c. A FORTRAN version can be found at fire_mpi.f and fire.inc.


©1994-2014 Shodor