In this tutorial we will see how to run a relatively big system in the Hopper supercomputer (at NERSC in California), and how to measure its performance. There are a few key things you need to know about how to interact with the machine. To log in, run ssh trainX@hopper.nersc.gov in your terminal, substituting the actual name of your training account. Be aware that since this machine is far away, you should not try running X-Windows programs! You submit jobs by the qsub command, ‘‘e.g.’’ qsub job.scr, which will put them in the queue for execution when there is free space. You can see what jobs you currently have in the queue by executing qstat -u $USER, so you can see when your job finishes. A status code will be shown: Q = waiting in the queue, R = running, C = complete. You can cancel a job by qdel + the job number, as written by qstat. The job script (‘‘e.g.’’ job.scr) specifies parameters to the PBS/Torque queuing system about how many cores to use, what commands to run, etc. # Running the ground state We will need the input file, job submission script, coordinates file, and a pseudopotential for Mg (for the other elements we will use the default ones that come with Octopus). The pseudopotential is available in the Octopus directory at jube/input/Mg.fhi. You can copy the files on hopper directly from /global/homes/d/dstrubbe/octopus_tutorial to your scratch directory as follows:  cd$SCRATCH
cp -r /global/homes/d/dstrubbe/octopus_tutorial .


The input file (inp ):


CalculationMode = gs

#### System size and parameters

Spacing = 0.20

Units = ev_angstrom
XYZCoordinates = "xyz"
%Species
"Mg" | 24.305 | spec_ps_fhi | 12 | 3 | 2
%
ExcessCharge = 0

XCFunctional = gga_x_pbe + gga_c_pbe

ExtraStates = 18
Eigensolver = rmmdiis
LCAOAlternative = yes
SmearingFunction = fermi_dirac
Smearing = 0.1
Mixing = 0.15

#### GS
MaximumIter = 300
EigensolverTolerance = 1e-8
ConvRelDens = 5e-8

#### Saving memory

SymmetriesCompute = no
PartitionPrint = no
MeshPartitionPackage = metis

ExperimentalFeatures = yes


Submission script job.scr with 24 CPU processor cores:

 #!/bin/bash
#PBS -q regular
#PBS -l mppwidth=24
#PBS -l walltime=0:30:00
#PBS -N testing_chl
#PBS -V

cd $PBS_O_WORKDIR aprun -n 24 octopus_mpi &> output_gs_24  To run:  qsub job.scr  Coordinates file xyz . Take a look at it (on your local machine) with visualization software such as xcrysden to see what kind of molecule we are dealing with. Expand: xyz file When your job finishes, take a look at the output to see what happened and make sure it completed successfully. Then we can do time-propagation. # Running the time-dependent profiling We change the input file accordingly. Change the CalculationMode from gs to td, and add the following lines:  ##### TD T = 18 dt = 0.003 TDPropagator = aetrs TDTimeStep = dt # Profiling ProfilingMode = prof_memory TDMaxSteps = 30 FromScratch = yes  Now it is the time to do exactly the same TD run, changing the number of CPU processor cores. You have to change the XXX to powers of 2, 2^x. Start at 64 (which will be fastest) and divide by 2, in steps down to 4. (Running on 2 or 1 cores may not work.)  #!/bin/bash #PBS -q regular #PBS -l mppwidth=XXX #PBS -l advres=benasque.348 #PBS -l walltime=0:30:00 #PBS -N testing_chl #PBS -V module load octopus/4.1.2 cd$PBS_O_WORKDIR
aprun -n XXX octopus_mpi &> output_td_XXX


Different profiling.000xxx folders will be created with each execution. We need to process them, mainly to be able to plot the information they contain. For that we can run the next script. It runs fine without any argument, but we can have more control in the files that it is going to process by using the following arguments: “analyze.sh 64 000004 2”. The first argument is the biggest number of CPU processor cores that is going to be considered. The second optional argument is the number of the reference file that is going to be used. The third one is the starting number, i.e. the smallest number of CPU cores to consider.

 #!/bin/bash

## Copyright (C) 2012,2014 J. Alberdi-Rodriguez
##
## This program is free software; you can redistribute it and/or modify
## the Free Software Foundation; either version 2, or (at your option)
## any later version.
##
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with this program; if not, write to the Free Software
## Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
## 02111-1307, USA.
##
## analyze.sh

# Define the biggest number of processors.
if [ -z "$1" ]; then last_proc=64 else last_proc=$1
fi

# Define the reference file/folder
if [ -z "$2" ]; then ref=000004 else ref=$2
fi

# Define the starting value
if [ -z "$3" ]; then start=1 else start=$3
fi

#Initialise the output file
echo "-  " > profile_$start for ((num=$start;num<=$last_proc;num*=2)); do echo$num >> profile_$start done rm -f tmp1 # Analyze all profiling.XXXXXX/time.000000 to get the time per subroutine count=0 for function_name in$(less profiling.$ref/time.000000 | awk '{print$1}')
do
if [ $count -lt 4 ]; then count=$((count + 1))
else
echo $function_name >> tmp1 -iterate over power of two profilings for ((num=$start;num<=$last_proc;num*=2)); do folder=printf 'profiling.%06d\n'$num
x=$(less$folder/time.000000 | grep "^$function_name " | awk '{print$3}' )
zero=_"$x"_ if [ "$zero" != "__" ]; then
echo $x >> tmp1 else echo "0" >> tmp1 fi done paste profile_$start tmp1 > tmp2
rm tmp1
cp tmp2 profile_$start fi done echo "The result is in the \"profile_$start\" file"


At this point we should run “analyze.sh 64 000004 2”. Thus, we will create files named “profile_2”. You can take a look at the following columns in the profiling data:

• TIME_STEP; the iteration time. It has a good scaling.
• COMPLETE_DATASET; the whole time of the execution. In general it decreases, it is more obvious in a real execution, where the initialization time is the same and execution one is bigger.
• SYSTEM_INIT; initialization time. We were able to stop the increasing time, and now is almost constant independently of the number of processes.
• POISSON_SOLVER; execution time for the Poisson. It is somehow constant in this case, but now with the other solvers and domain parallelization.
• RESTART_WRITE; time for writing the restart files. It depends much in the system status, more than in the number of running processes. Could be heavily decreased if it is written to the local drive.

Now we can plot it using the following script:

 #!/bin/bash

## Copyright (C) 2014 J. Alberdi-Rodriguez
##
## This program is free software; you can redistribute it and/or modify
## the Free Software Foundation; either version 2, or (at your option)
## any later version.
##
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with this program; if not, write to the Free Software
## Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
## 02111-1307, USA.
##
## plot_function.sh

if [ $- -eq 0 ]; then function="TIME_STEP" else function=$1
fi
echo $function column_number=$( awk -v fun=$function ' { for(i=1;i<=NF;i++){ if ($i == fun)
{print i+1 }
}
}' profile_2 )

script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" sed "s/REF/$column_number/g" $script_dir/plot_ref > plot_base sed -i "s/FUNCTION/$function/g" plot_base
gnuplot plot_base


We also need this auxiliary file:

 ## Copyright (C) 2014 J. Alberdi-Rodriguez
##
## This program is free software; you can redistribute it and/or modify
## the Free Software Foundation; either version 2, or (at your option)
## any later version.
##
## This program is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
## GNU General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with this program; if not, write to the Free Software
## Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
## 02111-1307, USA.
##
## plot_ref

set t postscript eps enhanced color solid
set output "gnuplot.eps"

set xlabel "MPI processes"
set ylabel "t (s)"
set logscale yx 2

plot "profile_2" u 1:REF w linespoint t "FUNCTION 2^x"


Something else you can try is 12, 24, 48 and 96 cores, because each of the nodes has 24 cores. In this case, you would need “analyze.sh 96 000003 3” to make “profile_3”, and then in the plotting script,

 plot "profile_2" u 1:REF w linespoint t "FUNCTION 2^x", "profile_3" u 1:REF w lp t "FUNCTION 3·(2^x)"


## Parallelization in domains vs states

We can divide up the work among the processors in different ways, by dividing up the points into domains for each processor, or dividing the states into groups for each processor, or a combination of both. Try out different combinations by adding to your input file


ParStates = 2
ParDomains = 12


and run on 24 cores, with different numbers in the first two fields whose product is the total number of processors (‘‘e.g.’’ 6 x 4, 3 x 8, …).

## PFFT Poisson solver

Another thing you can try is to compare the PFFT (parallel FFT) Poisson solver against the one we were using before (look in the output file to see which one it was). You will need to use this aprun line in your job script instead of the previous one:

 aprun -n XXX /global/homes/j/joseba/octopus/bin/octopus_mpi &> output_td_XXX


in order to use a different octopus compilation that uses that library, and add these lines to your input file:


PoissonSolver = fft
FFTLibrary = pfft


Compare some runs against one on a similar number of processors that you did previously. How does the time for this solver compare? You can also try ordinary FFTW (not parallel) with


FFTLibrary = fftw


## Parallelization of the ground state

We can also try different parameters and algorithms to see their effect on the speed of the ground-state calculation, for 24, 48, or 96 processors. Look each up in the variable reference to see what they mean, and see which of the options you were using in the previous runs.

• parallelization in domains vs states (as above)
• Eigensolver = rmmdiis, plan, cg, cg_new, lobpcg.
• StatesOrthogonalization = cholesky_serial, cholesky_parallel, mgs, qr
• SubspaceDiagonalization = standard, scalapack
• linear-combination of atomic orbitals (LCAO) for initial guess: LCAOAlternative = yes, no. In this case, add MaximumIter = 0 to do just LCAO rather than the whole calculations.