Profiling CABLE with IPM

by Scott Wales

A recent job request that came to the CMS helpdesk climate_help@nci.org.au asked us to look at why the new Soil-Litter-Iso scheme in the CABLE land surface model was performing slowly, and to see what we could do to improve its performance.

Investigating this gives us a great chance to explore some of the profiling tools that are available to everyone at NCI. These tools allow us to measure and reason about why a model is running slowly and what steps we might be able to take to get it to run faster.

If you'd like to follow along and have access to the CABLE repository, check out the branch `https://trac.nci.org.au/svn/cable/branches/Users/saw562/profile`. Run `make` in the top directory to build the model in a couple of test configurations, the `run` directory has run scripts for a variety of profiling tools.

When profiling a model there are two objectives you need to keep in mind.

  • Firstly the profiled run needs to be representative of the real run - for example you don't want to be measuring just the model start-up time and a single time-step.
  • Secondly, you need to be able to respond to measurements quickly since we'll be doing quite a few runs - you don't want a configuration that will take a day to run.

Ideally you want a run that covers all the functions of your production run, including the input and output. It should be compiled with optimisations, and the total run time should be in the order of magnitude of 5 minutes.

For our example we'll start running CABLE with MPI enabled for a year of model time, following GSWP input data. With 8 CPUs this takes a bit under 2 minutes to run with SLI disabled.

The first profiling tool we normally use is the Integrated Performance Monitoring (IPM) tool. This is simple to use and provides basic but extremely valuable information on an MPI program. However it only works with models compiled with Open MPI.

IPM gives a high-level overview of MPI communications in a model run. It shows how much time is spent in calculation and communication for each rank, the percentage of communication time spent by each MPI function (recv, bcast etc) and which processors are talking to each other.

Using IPM is a simple matter of loading the 'ipm/2.0.5' module before calling 'mpirun'.

If following along try submitting the 'run-ipm.sh' script from within the 'run' directory. This will create a file starting with your user name, a bunch of numbers, then the extension '*.ipm.xml'

The `ipm_view` command run on this output file allows you to see graphs of aggregate MPI statistics, such as how much the communication varies across ranks and the amount of time spent in each MPI function. The communication load can vary depending on the model decomposition, so it's a good idea to use the same number of processors as a normal run.

 

 

 

For CABLE with eight processors IPM shows that most of the ranks are around equal, with rank 0 doing less computation and more system time (e.g. reading and writing files) than the other ranks. About half of the total communication time is spent in the blocking `MPI_Send()` function, and a quarter in `MPI_Recv()`.

When the SLI scheme is enabled by setting 'cable_user%SOIL_STRUC' in the namelist file 'cable.nml' to 'sli' we need to use a larger number of MPI ranks to get a result in a reasonable amount of time. Unlike the non-SLI case – where even though the rank 0 process did less computation all were in the same ballpark – with SLI enabled rank 0 has barely any work to do compared to the other ranks.

 

 

A lot more time is spent in `MPI_Recv()` as well - around 80% of the total communication time. Perhaps this is where rank 0 is spending most of its time, waiting for data from the other nodes.

Let’s see if we can solve the imbalance.

In CABLE, rank 0 handles all of the input and output, while the other ranks do data processing. Since the IO rank is sitting around waiting, perhaps more compute nodes will help.

By performing successive runs, each time doubling the number of processors, we find that all the ranks are balanced at around 128 processors. This is also the point where the run's walltime doesn't linearly decrease with the number of processors, so we know that at some points the compute ranks must be waiting for rank 0.

 

Blocking sends and receives are taking up the vast majority of communications time, so it's likely that some of the communications algorithms can be improved. To go much further we'll need to see a timeline of all of the MPI functions, which we can get using an MPI profiler like ITAC or Vampir.

After running the model through IPM we now know that while the SLI scheme is much more computationally expensive than the default soil, it scales much better to higher numbers of processors (try running 128 processors with SLI disabled and check the IPM graph). This is because CABLE is using a multiple-program (MPMD) MPI model, with rank 0 performing a separate function to the other ranks.  Having two different programs running means the load between them needs to be balanced for maximum efficiency - we see this in coupled models as well. 

UNSW logo ANU logo Monash logo UMelb logo UTAS logo