Job script examples

adopted from https://hpc-uit.readthedocs.io/en/latest/jobs/examples.html and customized for our cluster

Basic examples

General blueprint for a jobscript

You can save the following example to a file (e.g. run.sh) on Ciircluster. Comment the two cp commands that are just for illustratory purpose (see further for explainging scratch space) and change the SBATCH directives where applicable. You can then run the script by typing:

$ sbatch run.sh

Please note that all values that you define with SBATCH directives are hard values. When you, for example, ask for 6000 MB of memory (--mem=6000MB) and your job uses more than that, the job will be automatically killed by the manager.

#!/bin/bash -l

##############################
#       Job blueprint        #
##############################

# Give your job a name, so you can recognize it in the queue overview
#SBATCH --job-name=example

# Define, how many nodes you need. Here, we ask for 1 node.
# 
# See 'System configuration' part of this manual for information about
# available cores.
#SBATCH --nodes=1
# You can further define the number of tasks with --ntasks-per-*
# See "man sbatch" for details. e.g. --ntasks=4 will ask for 4 cpus.

# Define, how long the job will run in real time. This is a hard cap meaning
# that if the job runs longer than what is written here, it will be
# force-stopped by the server. If you make the expected time too long, it will
# take longer for the job to start. Here, we say the job will take 5 minutes.
#              d-hh:mm:ss
#SBATCH --time=0-00:05:00

# Define the partition on which the job shall run. May be omitted.
# See 'System configuration' for info about available partitions
#SBATCH --partition gpu

# How much memory you need.
# --mem will define memory per node and
# --mem-per-cpu will define memory per CPU/core. Choose one of those.
#SBATCH --mem-per-cpu=1500MB
##SBATCH --mem=5GB    # this one is not in effect, due to the double hash

# Turn on mail notification. There are many possible self-explaining values:
# NONE, BEGIN, END, FAIL, ALL (including all aforementioned)
# For more values, check "man sbatch"
#SBATCH --mail-type=END,FAIL

# You may not place any commands before the last SBATCH directive

# Define and create a unique scratch directory for this job
# /lscratch is local ssd disk on particular node which is faster
# than your network home dir
SCRATCH_DIRECTORY=/lscratch/${USER}/${SLURM_JOBID}.stallo-adm.uit.no
mkdir -p ${SCRATCH_DIRECTORY}
cd ${SCRATCH_DIRECTORY}

# You can copy everything you need to the scratch directory
# ${SLURM_SUBMIT_DIR} points to the path where this script was
# submitted from (usually in your network home dir)
cp ${SLURM_SUBMIT_DIR}/myfiles*.txt ${SCRATCH_DIRECTORY}

# This is where the actual work is done. In this case, the script only waits.
# The time command is optional, but it may give you a hint on how long the
# command worked
time sleep 10
#sleep 10

# After the job is done we copy our output back to $SLURM_SUBMIT_DIR
cp ${SCRATCH_DIRECTORY}/my_output ${SLURM_SUBMIT_DIR}

# In addition to the copied files, you will also find a file called
# slurm-1234.out in the submit directory. This file will contain all output that
# was produced during runtime, i.e. stdout and stderr.

# After everything is saved to the home directory, delete the work directory to
# save space on /lscratch
# old files in /lscratch will be deleted automatically after some time
cd ${SLURM_SUBMIT_DIR}
rm -rf ${SCRATCH_DIRECTORY}

# Finish the script
exit 0

Running many sequential jobs in parallel using job arrays

In this example we wish to run many similar sequential jobs in parallel using job arrays. We take Python as an example but this does not matter for the job arrays:

#!/usr/bin/env python

import time

print('start at ' + time.strftime('%H:%M:%S'))

print('sleep for 10 seconds ...')
time.sleep(10)

print('stop at ' + time.strftime('%H:%M:%S'))

Save this to a file called “test.py” and try it out:

$ python test.py

start at 15:23:48
sleep for 10 seconds ...
stop at 15:23:58

Good. Now we would like to run this script 16 times at the same time. For this we use the following script:

#!/bin/bash -l

#####################
# job-array example #
#####################

#SBATCH --job-name=example

# 16 jobs will run in this array at the same time
#SBATCH --array=1-16

# run for five minutes
#              d-hh:mm:ss
#SBATCH --time=0-00:05:00

# 500MB memory per core
# this is a hard limit
#SBATCH --mem-per-cpu=500MB

# you may not place bash commands before the last SBATCH directive

# define and create a unique scratch directory
SCRATCH_DIRECTORY=/lscratch/${USER}/job-array-example/${SLURM_JOBID}
mkdir -p ${SCRATCH_DIRECTORY}
cd ${SCRATCH_DIRECTORY}

cp ${SLURM_SUBMIT_DIR}/test.py ${SCRATCH_DIRECTORY}

# each job will see a different ${SLURM_ARRAY_TASK_ID}
echo "now processing task id:: " ${SLURM_ARRAY_TASK_ID}
python test.py > output_${SLURM_ARRAY_TASK_ID}.txt

# after the job is done we copy our output back to $SLURM_SUBMIT_DIR
cp output_${SLURM_ARRAY_TASK_ID}.txt ${SLURM_SUBMIT_DIR}

# we step out of the scratch directory and remove it
cd ${SLURM_SUBMIT_DIR}
rm -rf ${SCRATCH_DIRECTORY}

# happy end
exit 0

Submit the script and after a short while you should see 16 output files in your submit directory:

$ ls -l output*.txt

-rw------- 1 user user 60 Oct 14 14:44 output_1.txt
-rw------- 1 user user 60 Oct 14 14:44 output_10.txt
-rw------- 1 user user 60 Oct 14 14:44 output_11.txt
-rw------- 1 user user 60 Oct 14 14:44 output_12.txt
-rw------- 1 user user 60 Oct 14 14:44 output_13.txt
-rw------- 1 user user 60 Oct 14 14:44 output_14.txt
-rw------- 1 user user 60 Oct 14 14:44 output_15.txt
-rw------- 1 user user 60 Oct 14 14:44 output_16.txt
-rw------- 1 user user 60 Oct 14 14:44 output_2.txt
-rw------- 1 user user 60 Oct 14 14:44 output_3.txt
-rw------- 1 user user 60 Oct 14 14:44 output_4.txt
-rw------- 1 user user 60 Oct 14 14:44 output_5.txt
-rw------- 1 user user 60 Oct 14 14:44 output_6.txt
-rw------- 1 user user 60 Oct 14 14:44 output_7.txt
-rw------- 1 user user 60 Oct 14 14:44 output_8.txt
-rw------- 1 user user 60 Oct 14 14:44 output_9.txt

Packaging smaller parallel jobs into one large parallel job

There are several ways to package smaller parallel jobs into one large parallel job. The preferred way is to use Job Arrays. Browse the web for many examples on how to do it. Here we want to present a more pedestrian alternative which can give a lot of flexibility.

In this example we imagine that we wish to run 5 MPI jobs at the same time, each using 4 tasks, thus totalling to 20 tasks. Once they finish, we wish to do a post-processing step and then resubmit another set of 5 jobs with 4 tasks each:

#!/bin/bash

#SBATCH --job-name=example
#SBATCH --ntasks=20
#SBATCH --time=0-00:05:00
#SBATCH --mem-per-cpu=500MB

cd ${SLURM_SUBMIT_DIR}

# first set of parallel runs
mpirun -n 4 ./my-binary &
mpirun -n 4 ./my-binary &
mpirun -n 4 ./my-binary &
mpirun -n 4 ./my-binary &
mpirun -n 4 ./my-binary &

wait

# here a post-processing step
# ...

# another set of parallel runs
mpirun -n 4 ./my-binary &
mpirun -n 4 ./my-binary &
mpirun -n 4 ./my-binary &
mpirun -n 4 ./my-binary &
mpirun -n 4 ./my-binary &

wait

exit 0

The wait commands are important here - the run script will only continue once all commands started with & have completed.

GPU jobs

Tensor Flow

Prepare enviroment:

module load Anaconda3/5.0.1
. /opt/apps/software/Anaconda3/5.0.1/etc/profile.d/conda.sh
conda create -y -n tf_gpu_testing tensorflow-gpu python=3.6

Job declaration:

#!/bin/bash
#SBATCH --job-name=tf_test
#SBATCH --output=train_nn_%A.log
#SBATCH --cpus-per-task=10
#SBATCH --gres=gpu:1
#SBATCH --mem=51G
#SBATCH --time=12:00:00
#SBATCH --partition=gpu

module purge
module load CUDA/9.0.176-GCC-6.4.0-2.28
module load cuDNN/7.1.4.18-fosscuda-2018b
module load Anaconda3/5.0.1

. /opt/apps/software/Anaconda3/5.0.1/etc/profile.d/conda.sh
conda activate tf_gpu_testing

nvidia-smi

python tf_test.py

echo finish

Running Jupyter at CIIRC Cluster

Assuming that you have conda environment jupyter-env with jupyter-lab installed you can use the following template to run jupyter-lab using SLURM engine:

#!/bin/bash
#SBATCH --job-name=jupyterlab
#SBATCH -o jupyter-lab-%j.log
#SBATCH --ntasks=6
#SBATCH --mem=16G
#SBATCH --time=5-00:00:00

# get tunneling info
XDG_RUNTIME_DIR=""
port=$(shuf -i8000-9999 -n1)
node=$(hostname -s)
user=$(whoami)
cluster=$(hostname -f | awk -F"." '{print $2}')

echo -e "

Terminal command to create your ssh tunnel
ssh -N -L ${port}:${node}:${port} ${user}@${cluster}.ciirc.cvut.cz

Use a Browser on your local machine to go to:
localhost:${port}  (prefix w/ https:// if using password)
"

module purge
module load Anaconda3/5.0.1
. /opt/apps/software/Anaconda3/5.0.1/etc/profile.d/conda.sh
conda activate jupyter-env

jupyter-lab --no-browser --port=${port} --ip=${node}

After submitting job with sbatch wait few seconds and then do cat jupyter-lab-%j.log. First, you will see the ssh command needed to forward connection from CIIRC Cluster to your local machine, for example it might look like this

ssh -N -L 8253:node-04:8253 username@cluster.ciirc.cvut.cz

Then, you will need to copy the following link:

http://127.0.0.1:8253/?token=f5a695a83d976bca5496504f22b82224ccf6cfe56276a64d

And you are good to go.

Running multi-node training

If you want to run multi-node training, first thing to do is to ensure, that your code supports it. Generally, there are 2 options for now - using built-in functionality of your library of choice or using Horovod. In case that you use PyTorch you can find some information here, for Horovod take a look at Horovod docs to make sure, that you code is apropriatelly modified.

Pure PyTorch multi-node training

For example, let's consider the PyTorch ImageNet script. To run it in the multi-node regime you can use the following script

#!/bin/bash
#SBATCH --job-name=multinode
#SBATCH --output=multinode.out
#SBATCH --time=05:00:00
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --gres=gpu:Volta100:8
#SBATCH --mem=256G

module purge
module load Anaconda3/5.0.1 NCCL

. /opt/apps/software/Anaconda3/5.0.1/etc/profile.d/conda.sh

conda activate pytorch

declare -a nodes=()
declare -a ips=()
while read line; do
    nodes+=( $line )
    ips+=( $(getent hosts $line | cut -d' ' -f1) )
done <<< $(scontrol show hostnames $SLURM_JOB_NODELIST)

PORT=$(shuf -i30000-60000 -n1)

srun -N1 -n1 --gres=gpu:Volta100:8 python main.py -a resnet50 \
                                  --dist-url "tcp://${ips[0]}:$PORT" \
                                  --dist-backend 'nccl' \
                                  --multiprocessing-distributed \
                                  --rank 0 \
                                  --world-size 2 \
                                  /nfs/datasets/imagenet/imagenet12/ &


srun -N1 -n1 --gres=gpu:Volta100:8 python main.py -a resnet50 \
                                  --dist-url "tcp://${ips[0]}:$PORT" \
                                  --dist-backend 'nccl' \
                                  --multiprocessing-distributed \
                                  --rank 1 \
                                  --world-size 2 \
                                  /nfs/datasets/imagenet/imagenet12/ &

wait

In case you would like to use Horovod, you can use the following template with Horovod ImageNet script

#!/bin/bash
#SBATCH --job-name=multinode
#SBATCH --output=multinode.out
#SBATCH --time=05:00:00
#SBATCH --partition=gpu
#SBATCH --nodes=4 # Number of nodes
#SBATCH --ntasks-per-node=8 # Number of MPI process per node
#SBATCH --gres=gpu:Volta100:8 # Number of GPUs per node
#SBATCH --mem=256G

module purge
module load NCCL OpenMPI Anaconda3
. /opt/apps/software/Anaconda3/5.0.1/etc/profile.d/conda.sh
conda activate pytorch

declare -a nodes=()
while read line; do
    nodes+=( $line )
done <<< $(scontrol show hostnames $SLURM_JOB_NODELIST)

NODES=""
for i in ${!nodes[@]}; do
    if [ "$i" -eq "0" ]; then
        NODES+="${nodes[$i]}:8"
    else
        NODES+=",${nodes[$i]}:8"
    fi
done

horovodrun -np $SLURM_NTASKS --mpi -H $NODES /home/ponimgeo/.conda/envs/pytorch/bin/python pytorch_imagenet_resnet50.py --train-dir /nfs/datasets/imagenet/imagenet12/train --val-dir /nfs/datasets/imagenet/imagenet12/val --batch-size 256