HPC systems provide large complex shared resources to many users in parallel. Therefore, central workload management is mandatory for users and admins. This is typically achieved by using a resource manager and a job scheduler. The resource manager knows about all the resources available on a system and monitors the availability and load of each node. It basically manages all resources like CPUs, memory, and GPUs in a cluster. The job scheduler assigns compute tasks to resources and manages queues of compute tasks and their priority. It is typically configured for best possible utilization of the resources by given typical workloads.
Slurm is a resource manager and job scheduler that is used by the majority of the TOP 500 HPC systems and that is used on all HPC systems and clusters at MPCDF. It is open source software with commercial support (Documentation: https://slurm.schedmd.com, MPCDF HPC documentation: https://docs.mpcdf.mpg.de/doc/computing/).
Some slurm terminology:
- Job: Reservation of resources on the system to run job steps
- Job step: program/command to be run within a job, initiated via
- Node: physical multi-core shared-memory computer, a cluster is composed of many nodes
- CPU: single processing unit (core), a node contains multiple CPUs
- Task: process (i.e. instance of a program being executed), may use one or more CPU up to all CPUs available a node, a job step may run multiple tasks in parallel over several nodes
- Partition: a “queue”, where to run jobs, defines specific resource limits or access control
Important slurm commands are:
sinfo: show state of partitions and resources managed by slurm
sbatch job_script.sh: submit a job script for later execution, obtain job_id
scancel job_id: cancel a job, or send signals to tasks
squeue: show state of jobs or job steps in priority order
srun executable: Initiate job step, launch executable (typically used in job scripts)
sacct: show information for finished jobs
You can get a list of waiting and running jobs of yourself with
You can display a concise list of partitions with
sinfo -s (A/I/O/T means
Slurm jobs are submitted by users from the login node and then scheduled by slurm to be executed on the compute nodes of the cluster.
Any slurm job requires:
- Specification of the resources – “what does the job need?”
- Number of CPUs
- Amount of memory
- other resources or constraints
- Definition of the job steps – “what should the job do?”
- commands/programs to be executed via
- typically, first some module are loaded, then the program is executed using srun
- commands/programs to be executed via
All this information is bundled in job scripts that are submitted to slurm
Submitting a first job script
Let’s create a job script to run a simple octopus calculation.
First, generate the input file called inp with a text editor (the same as in the very first tutorial):
CalculationMode = gs %Coordinates 'H' | 0 | 0 | 0 % Spacing = 0.25 * angstrom Radius = 4.0 * angstrom
Second, create a job script file called job_script.sh :
#!/bin/bash -l # Standard output and error: #SBATCH -o ./tjob.out.%j #SBATCH -e ./tjob.err.%j # Initial working directory: #SBATCH -D ./ # Job Name: #SBATCH -J octopus_course # # Reservation: #SBATCH --reservation=mpsd_course # # Number of MPI Tasks, e.g. 1: #SBATCH --ntasks=1 #SBATCH --ntasks-per-core=1 # Memory usage [MB] of the job is required, 2200 MB per task: #SBATCH --mem=2200 # #SBATCH --mail-type=none #SBATCH --firstname.lastname@example.org # # Wall clock limit: #SBATCH --time=00:01:00 # Run the program: module purge module octopus/main srun octopus
For this job, output will be written to tjob.out.XXX
(XXX is the job id), error
output to tjob.err.XXX
. The job will run using one MPI task on one core,
requesting 2200 MB of memory. You can change the
--mail-type option to
--mail-user option to your email address to get email notifications
about changes in the job status (i.e. when the job starts and ends). The job
requests a time of one minute (
--time option). In the script, a octopus
module is loaded and the octopus executable is started with srun. The
--reservation option is only needed for this course to use dedicated
resources reserved for us.
Now submit the job with
sbatch job_script.sh. You should see an output like
Submitted batch job XXX
where XXX is the job id. You can check the status by running
Once the calculation has finished, you can check the output by opening the file tjob.out.XXX . Moreover, you should see the folders and files that octopus has created, as in the first tutorial.
More job script examples
To submit a parallel job on a few cores, but still on one node (cobra has 40
cores per node), you can use the options
--mem=17600 to run
on 8 cores, for example.
To run octopus on a full node (or several full nodes) in pure MPI mode, please use the following job script:
#!/bin/bash -l # Standard output and error: #SBATCH -o ./tjob.out.%j #SBATCH -e ./tjob.err.%j # Initial working directory: #SBATCH -D ./ # Job Name: #SBATCH -J octopus_course # # Reservation: #SBATCH --reservation=mpsd_course # # Number of nodes and MPI tasks per node: #SBATCH --nodes=1 #SBATCH --ntasks-per-node=40 # #SBATCH --mail-type=none #SBATCH --email@example.com # # Wall clock limit: #SBATCH --time=00:01:00 # Run the program: module purge module load octopus/main srun octopus
Save this file as job_script_mpi.sh
and submit it with
sbatch job_script_mpi.sh. This will run octopus on all 40 cores of one node. To run
on multiple nodes, adapt the
--nodes option accordingly.
To run octopus in hybrid mode (MPI + OpenMP), which is suitable for large grids, you can employ the following script:
#!/bin/bash -l # Standard output and error: #SBATCH -o ./tjob.out.%j #SBATCH -e ./tjob.err.%j # Initial working directory: #SBATCH -D ./ # Job Name: #SBATCH -J octopus_course # # Reservation: #SBATCH --reservation=mpsd_course # # # Number of nodes and MPI tasks per node: #SBATCH --nodes=1 #SBATCH --ntasks-per-node=10 # for OpenMP: #SBATCH --cpus-per-task=4 # #SBATCH --mail-type=none #SBATCH --firstname.lastname@example.org # # Wall clock limit: #SBATCH --time=00:01:00 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # For pinning threads correctly: export OMP_PLACES=cores # Run the program: module purge module load octopus/main srun octopus
This will run octopus on one full node, using 10 MPI ranks with 4 OpenMP threads each. Exporting the environment variables is necessary to ensure correct pinning of all processes and threads.