Parallelization in octopus

Octopus can run calculations in parallel to leverage the computing power of clusters and it can utilize GPUs as well, but this is covered in another tutorial. Octopus employs a hybrid parallelization of MPI and OpenMP to speed up calculations. Using MPI implies that the code follows the ‘‘distributed-memory’’ paradigm: the data the code works on (in our case the Kohn-Sham wavefunctions) are distributed over the processes and each process works on its local part of the data. The OpenMP parallelization follows the ‘‘shared-memory’’ paradigm, where work is distributed among threads that can access the same data.

In addition to the parallelization, Octopus employs vectorization which is a capability of modern CPUs to execute several floating point operations (additions and multiplications) in one instruction. How the data structures are designed to make use of vectorization is part of an advanced tutorial.

Parallelization strategies

To distribute the data and the computations with MPI, Octopus follows different strategies that are orthogonal to each other and that can be combined:

In addition to the MPI strategies, OpenMP can be used to exploit ‘‘shared memory’’ parallelization. When using OpenMP, each MPI process spawns a certain number of OpenMP threads to parallelize computations and all threads have access to the same memory of the process. In Octopus, this is used to parallelize loops over the grid points, so it can be an alternative or addition to the parallelization in domains.

First parallel runs

Ground state

For our first tests with parallel runs, we use the following input file, similar to the time-dependent propagation tutorial:


CalculationMode = gs
UnitsOutput = eV_Angstrom
FromScratch = yes

Radius = 3.5*angstrom
Spacing = 0.18*angstrom

CH = 1.097*angstrom
%Coordinates
  "C" |           0 |          0 |           0
  "H" |  CH/sqrt(3) | CH/sqrt(3) |  CH/sqrt(3)
  "H" | -CH/sqrt(3) |-CH/sqrt(3) |  CH/sqrt(3)
  "H" |  CH/sqrt(3) |-CH/sqrt(3) | -CH/sqrt(3)
  "H" | -CH/sqrt(3) | CH/sqrt(3) | -CH/sqrt(3)
%

Please use the batch script at the bottom of the tutorial to run octopus on one core. You will get the output as described in the basic tutorials. Now modify the batch script and submit it again to run on two cores. In case you run this on your local PC, simply start octopus with mpiexec -np <N> octopus, where <N> is the number of cores to be used.

In the output, you will now see a new section on parallelization:


************************** Parallelization ***************************
Info: Octopus will run in *parallel*

      Number of processes           :       2
      Number of threads per process :       1

Info: Number of nodes in ParDomains  group:     2 (       55241)
Info: Octopus will waste at least  0.00% of computer time.
**********************************************************************

As you can see, Octopus tells us that it was run with two processes and that it used parallelization in domains (“ParDomains”). Below that, there is some information on the parallelization in ions and then, more important, some more information on the domain parallelization:


Info: Using ParMETIS multilevel k-way algorithm to partition the mesh.
Info: Mesh Partition restart information will be written to 'restart/partition'.
Info: Finished writing information to 'restart/partition'.
Info: Mesh partition:

      Partition quality:    0.924702E-08

                 Neighbours         Ghost points
      Average  :          1                 7340
      Minimum  :          1                 7303
      Maximum  :          1                 7377

      Nodes in domain-group      1
        Neighbours     :         1        Local points    :     27584
        Ghost points   :      7377        Boundary points :     14781
      Nodes in domain-group      2
        Neighbours     :         1        Local points    :     27657
        Ghost points   :      7303        Boundary points :     15649

First, the output tells you how the partitioning of the mesh was done, in this case using ParMETIS, the parallel version of METIS. Very importantly, you can see the information on the mesh partitioning: for each process (“node”), it shows the number of neighbours, and the number of local, ghost, and boundary points.

This information is useful to judge the efficiency of the domain parallelization: as a rule of thumb, if the number of ghost points approaches 25% of the number of local points, the scheme usually becomes inefficient.

The number of boundary points only depends on the size of the stencil and can usually not be changed. For finite systems, their impact is small; for periodic systems they need to be updated which can impact the performance.

Time-dependent calculation

Now run a time-dependent calculation for the same system using the following input file:


CalculationMode = td
UnitsOutput = eV_Angstrom
FromScratch = yes

Radius = 3.5*angstrom
Spacing = 0.18*angstrom

CH = 1.097*angstrom
%Coordinates
  "C" |           0 |          0 |           0
  "H" |  CH/sqrt(3) | CH/sqrt(3) |  CH/sqrt(3)
  "H" | -CH/sqrt(3) |-CH/sqrt(3) |  CH/sqrt(3)
  "H" |  CH/sqrt(3) |-CH/sqrt(3) | -CH/sqrt(3)
  "H" | -CH/sqrt(3) | CH/sqrt(3) | -CH/sqrt(3)
%

You can use the same batch script as for the ground-state run to execute octopus on two cores.

When you look at the output, you should see the parallelization section as follows:


************************** Parallelization ***************************
Info: Octopus will run in *parallel*

      Number of processes           :       2
      Number of threads per process :       1

Info: Number of nodes in ParStates   group:     2 (           4)
Info: Octopus will waste at least  0.00% of computer time.
**********************************************************************

As you can see, octopus has used 2 processes as expected. Both processes are used for parallelization in states (“ParStates”). Slightly further down you see how the states are distributed:


Info: Parallelization in atoms
Info: Node in group    0 will manage      3 atoms:     1 -      3
Info: Node in group    1 will manage      2 atoms:     4 -      5
Info: Parallelization in Ions
Info: Node in group    0 will manage      3 Ions:     1 -      3
Info: Node in group    1 will manage      2 Ions:     4 -      5
Info: Parallelization in states
Info: Node in group    0 will manage      2 states:     1 -      2
Info: Node in group    1 will manage      2 states:     3 -      4

This means that the both processes have 2 states that they manage.

Because this system is quite small, you will probably not see a big change in run time for the different parallelization options.

Control parallelization strategies

By default, octopus chooses a certain combination of parallelization strategies. However, this can (and should) be changed by the user who knows best about the system and also about the machine where the code is executed.

The following variables control the parallelization strategies:

When you set the number of processes for the different parallelization strategies, ‘‘‘their product must be equal to the total number of processes’’’ used in this computation. This is due to the fact that the strategies are orthogonal to each other. Using different strategies corresponds to dividing a hypercube in different dimensions.

Let’s run the TD example from above again using two processes, but this time with domain parallelization. For this, you need to add ParDomains = 2 to the input file and submit the job script to run on two processes. The parallelization section in the output will then look like


************************** Parallelization ***************************
Info: Octopus will run in *parallel*

      Number of processes           :       4
      Number of threads per process :       1

Info: Number of nodes in ParStates   group:     2 (           4)
Info: Number of nodes in ParDomains  group:     2 (       55241)
Info: Octopus will waste at least  0.00% of computer time.
**********************************************************************

This shows you that indeed the parallelization in domains has been used. A bit further down you can again check how the mesh is distributed (which should be the same as in the GS run).

Now you can combine parallelization in states and domains: add ParStates = 2 to the input file and change the slurm batch script to start 4 processes (i.e. use --ntasks=4). Then submit the job. The parallelization section in the output will be:


Info: Parallelization in atoms
Info: Node in group    0 will manage      3 atoms:     1 -      3
Info: Node in group    1 will manage      2 atoms:     4 -      5
Info: Parallelization in Ions
Info: Node in group    0 will manage      2 Ions:     1 -      2
Info: Node in group    1 will manage      1 Ions:     3 -      3
Info: Node in group    2 will manage      1 Ions:     4 -      4
Info: Node in group    3 will manage      1 Ions:     5 -      5
Info: Parallelization in states
Info: Node in group    0 will manage      2 states:     1 -      2
Info: Node in group    1 will manage      2 states:     3 -      4
Info: Mesh Partition restart information will be read from 'restart/partition'.
Info: Finished reading information from 'restart/partition'.
Info: Using ParMETIS multilevel k-way algorithm to partition the mesh.
Info: Mesh Partition restart information will be written to 'restart/partition'.
Info: Finished writing information to 'restart/partition'.
Info: Mesh partition:

      Partition quality:    0.924702E-08

                 Neighbours         Ghost points
      Average  :          1                 7340
      Minimum  :          1                 7303
      Maximum  :          1                 7377

      Nodes in domain-group      1
        Neighbours     :         1        Local points    :     27584
        Ghost points   :      7377        Boundary points :     14781
      Nodes in domain-group      2
        Neighbours     :         1        Local points    :     27657
        Ghost points   :      7303        Boundary points :     15649

From this output you can see that octopus used 2 processes for states parallelization and 2 processes for domain parallelization. This means that each process will handle 2 states and about half of the grid.

OpenMP parallelization

To try out the OpenMP parallelization, you can use the same input file as before, but delete all the parallelization variables. To use OpenMP, get the correct batch script (see at the end of the tutorial), which sets OMP_NUM_THREADS environment variable. If you run octopus on your laptop, you need to set this variable to the number of OpenMP threads you would like to start.

If you run this, you will see the following information on the parallelization:


************************** Parallelization ***************************
Info: Octopus will run in *parallel*

      Number of processes           :       1
      Number of threads per process :       2

**********************************************************************

This indicates that one MPI process is used with two OpenMP threads per process.

The OpenMP parallelization can be combined with the MPI parallelization - their product must be equal to the total number of CPU cores used by the batch job.

Now you can repeat the last example from the previous section, but in addition use 2 OpenMP threads (so add again the Par* variables and set --ntasks=4). The output will show:


************************** Parallelization ***************************
Info: Octopus will run in *parallel*

      Number of processes           :       4
      Number of threads per process :       2

Info: Number of nodes in ParStates   group:     2 (           4)
Info: Number of nodes in ParDomains  group:     2 (       55241)
Info: Octopus will waste at least  0.00% of computer time.
**********************************************************************

As expected, there are 4 processes, 2 for states and 2 for domains parallelization and there are 2 OpenMP threads per MPI process. In total, this run uses 8 CPU cores.

Compare performance of different strategies

Let us compare the performance of the different strategies for a different molecule that is slightly larger. Save the following as inp file:


CalculationMode = gs
FromScratch = yes

XYZCoordinates = "1ala.xyz"

Spacing = 0.2*angstrom
Radius = 4.0*angstrom

TDPropagator = aetrs
TDMaxSteps = 20
TDTimeStep = 0.05

and the following as “1ala.xyz”:



  23
 units: A
      N                   -1.801560    0.333315   -1.308298
      C                   -1.692266    1.069227    0.012602
      C                   -0.217974    1.151372    0.425809
      O                    0.256888    2.203152    0.823267
      C                   -2.459655    0.319513    1.077471
      H                   -1.269452    0.827043   -2.046396
      H                   -1.440148   -0.634968   -1.234255
      H                   -2.791116    0.267637   -1.602373
      H                   -2.104621    2.114111   -0.129280
      H                   -2.391340    0.844513    2.046396
      H                   -2.090378   -0.708889    1.234538
      H                   -3.530691    0.246022    0.830204
      N                    0.476130   -0.012872    0.356408
      C                    1.893957   -0.046600    0.735408
      C                    2.681281    0.990593   -0.107455
      O                    3.486946    1.702127    0.516523
      O                    2.498931    1.021922   -1.333241
      C                    2.474208   -1.425485    0.459844
      H                    0.072921   -0.880981    0.005916
      H                    1.975132    0.211691    1.824463
      H                    1.936591   -2.203152    1.019733
      H                    3.530691   -1.461320    0.761975
      H                    2.422706   -1.683153   -0.610313

Then, run the ground state calculation.

Now, run the TD calculation (change the CalculationMode variable to td) using one process. After the initialization you see something like the following output:


  *  *  *  *  *  *  *  *  *  ** Time-Dependent Simulation   *  *  *  *  *  *  *  *  *  *  *
  Iter           Time        Energy   SC Steps    Elapsed Time

  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *    *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *
      1       0.050000   -109.463446         1         1.087
      2       0.100000   -109.463446         1         1.076
      3       0.150000   -109.463446         1         1.077
      4       0.200000   -109.463446         1         1.089
      5       0.250000   -109.463446         1         1.082
      6       0.300000   -109.463446         1         1.064
      7       0.350000   -109.463446         1         1.070
      8       0.400000   -109.463446         1         1.100
      9       0.450000   -109.463446         1         1.086
     10       0.500000   -109.463446         1         1.186
[...]

The last column indicates how long one timestep took in real time, in this case about 1.1 s.

To assess the performance of the different parallelization strategies, use your knowledge from the previous sections and run the TD calculation:

Compare the time of each of those runs with the serial one. How big is the speed-up? The speed-up is defined as the time for the serial run divided by the time for the run on two processes. Ideally this number should be two. How far away from that are your runs? Which mode is most efficient in this case?

For the domain-parallel run: what is the ratio of ghost points to local points? Is it still ok?

For the states-parallel run: how many states per process do you have? Is that still ok? Is the distribution balanced?

If you still have time, compare the following runs on 4 cores in the same way:

Tips and tricks

If you run on a cluster, make sure to read the documentation to understand how many cores per node it has. Try to use the full number of processes available on the number of nodes you request. Do not use hyperthreading, octopus will not profit from this.

For k-point-parallelization, you can go down to one k point per process. For states parallelization, you should still have 4 states per process as a minimum to effectively use vectorization. For both of these modes, the best distribution is balanced, i.e., each process has the same number of k points or states.

For domain parallelization, usually the number of ghost points should be at maximum 25% of the local points. For large grids, it can be more efficient to use up to 10 OpenMP threads instead of processes in the domain parallelization. As an example, if you use ParDomains = 32, you can instead try ParDomains = 8 and set OMP_NUM_THREADS=4 to run with 4 OpenMP threads per process. Be aware that the number of OpenMP threads should be smaller than the number of cores per socket in a node to avoid problems with non-uniform memory access (NUMA). For small grids (< about 10^5 points), it is usually not worth to use either OpenMP or domain parallelization.

It is often worth trying to find a good factorization of the total number of processes to use the different parallelization strategies efficiently.

Slurm batch scripts for MPCDF systems

To run on one core of the reserved resources, please use the following batch script:


#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./tjob.out.%j
#SBATCH -e ./tjob.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J octopus_course
#
# Reservation:
#SBATCH --reservation=mpsd_course
#
# Number of MPI Tasks, e.g. 1:
#SBATCH --ntasks=1
#SBATCH --ntasks-per-core=1
# Memory usage [MB] of the job is required, 2200 MB per task:
#SBATCH --mem=2200
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit:
#SBATCH --time=00:05:00

# Run the program:
module purge
module load octopus/12
srun octopus

To run on two cores, please replace --ntasks=1 by --ntasks=2.

To utilize OpenMP, please use the following batch script:


#!/bin/bash -l
# Standard output and error:
#SBATCH -o ./tjob.out.%j
#SBATCH -e ./tjob.err.%j
# Initial working directory:
#SBATCH -D ./
# Job Name:
#SBATCH -J octopus_course
#
# Reservation:
#SBATCH --reservation=mpsd_course
#
# Number of MPI Tasks, e.g. 1:
#SBATCH --ntasks=1
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=2
# Memory usage [MB] of the job is required, 2200 MB per task:
#SBATCH --mem=2200
#
#SBATCH --mail-type=none
#SBATCH --mail-user=userid@example.mpg.de
#
# Wall clock limit:
#SBATCH --time=00:05:00

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# For pinning threads correctly:
export OMP_PLACES=cores

# Run the program:
module purge
module load octopus/12
srun octopus

This will run with one MPI rank and 2 OpenMP threads. To change the number of OpenMP threads per MPI rank, adapt --cpus-per-task accordingly.