Frequently Use{d|ful} Cluster Commands
Table of Contents
A subset of useful commands when interacting with (SLURM/Linux based) HPC systems. Examples originate from the man pages, documentation, emails, and so forth, credit due to the original sources.
Loading software #
Software tends to be precompiled to leverage the cluster, and provided in multiple versions using the module system:
The first command any session should run, ensures the environment is reset to the bare essentials.
module purge
Loading a specific version
module load X/version.y
The answer to ‘What do I need to do to make X load?’
module spider X
Filesystems #
On Clusters you typically have 2 kinds of storage:
- Glacial, robust, safe: to store big datasets archived in compressed form, with legally binding backup policies, but extremely slow, because they’re distributed and optimized for large files, can hold 100+Terabytes with ease.
- Fast, ephemeral : Local to the compute node, very fast for any file, but limited in size. Access to the filesystem dies with your job.
In essence, the first one is the one you’ll curse the most at when you’re doing it wrong. To review files in a system like that quickly, do not use standard tools, instead I recommend the async filesystem browser written in Rust: broot
broot --sizes .
This will interactively build a list of contents, as results come in. I rarely need to know everything, so I don’t need the hour long wait an “ls -alsht .” would take.
Lustre #
When the cluster storage is using LustreFS, you can use lfs to get a more optimized version of find
lfs find . -group $USER
Scheduler #
View queue lengths
partition-stats
Check quota
diskusage_report
Interactive mode #
Requesting interactive compute time
salloc --time=3:0:0 --ntasks=1 --cpus-per-task=8 --mem=32G --account=<lab>
A lot of users confuse ’tasks’ with ‘cpus-per-task’. The first refers to processes to be run in parallel, the second to the number of cores/threads a single process can run.
Useful optional options #
Oversubscribe
-s
Minimum local disk
--tmp=x{G|T|K}
Get notified near end of job. If you have a job that will exceed its time limits, getting a heads up when that will happen is useful so you can save working state. You can ask SLURM to send you a predefined signal x seconds -+ 60 seconds ahead of deadline.
--signal=<sig>[@<sig_time>]
Set preferred nodes
-w | --nodes=na, nb,
--nodes=nod[1-2,9] # range
--exclude=... #Negative
Batch mode #
An example sbatch (submit and schedule) script
#!/bin/bash
#SBATCH --account=<LAB>
#SBATCH --mem=120G
#SBATCH --cpus-per-task=6
#SBATCH --time=18:00:00
#SBATCH --mail-user=<YOU>
#SBATCH --mail-type=BEGIN
#SBATCH --mail-type=END
#SBATCH --mail-type=FAIL
#SBATCH --mail-type=REQUEUE
#SBATCH --mail-type=ALL
#SBATCH --array=1-X%N
## Array batch script for use on clusters.
## As an example, opens X directories from file.txt, and lists the contents of each.
## In useful jobs each directory would be holding data to process
set -euo pipefail
echo "Starting task $SLURM_ARRAY_TASK_ID"
IDIR=$(sed -n "${SLURM_ARRAY_TASK_ID}p" file.txt)
ls -alsht $IDIR
Highlighting a few key points:
The lifesaver for shell scripts. Shell scripts are executed line by line, and you can get away with a lot of things that in other computing contexts would be fatal errors (for good reason).
set -euo pipefail
- -e : Stop on first/any error, immediately. On a scheduled job, you really want this to be on, if your 2 week job has an error at the first line, but continues on with garbage for 2 straight weeks, you will not be all that productive, and you will lose compute quota.
- -u : Any undefined variable is an error. If you think you don’t need this, consider what happens with “rm -rf $DIR” when $DIR due to a typo was not set. For fun, imagine the script runs with write rights in a 2e6$ dataset.
- -o pipefail : Linux pipes are very powerful, but when one part fails, please, do not propagate the corrupt output or error messages through the pipe
Array support (optional) replaces most “for x in huge data” snippets, or independent parameter sweeps.
#SBATCH --array=1-X%N
This asks the scheduler to run an array of X jobs, where the array id is encoded in $SLURM_ARRAY_TASK_ID. Each job will use the resources you request, not resources divided by jobs. At most N will be run in parallel (optional).
Schedule it
sbatch myscript.sh
By the way, if you want to see if your scripts has no typos or stupid bugs that would crash it in the first 5 minutes, and you want to avoid having to wait 3 days of queue time to find that out, do
chmod u+x myscript.sh
salloc --time=0:10:0 --ntasks=1 --cpus-per-task=8 --mem=32G --account=<lab>
./myscript.sh
<CTRL+C> #once you're certain it runs as expected
This uses the fact that sbatch scripts are shell scripts, so can run as is. Interactively the #SBATCH entries are just comments.
sq
squeue -u $USER
Find out what jobs are in the queue (or running).
Find which job failed in an array (and define it as a function)
findfailed () { sacct -X -j $1 -o JobID,State,ExitCode,DerivedExitCode | grep -v COMPLETED ; }
That will give you the JobID, then you can do
less slurm-<JID>.out
to get the trails of crumbs leading you to the cause of the unexpected failure.
Cluster access works by quota, and usually with annealing and ammortized assigned usage, meaning for a financial period you’ll get K CPU years. That means, on average, you’ll be able to use K years of CPU non-stop. In reality, it means no such thing, because your usage is probably highly variable. You and all your competitor researchers will have spiking workloads right before major conferences, and less compute needs when you’re still sketching out the algorithm. When you use resources, your account’s score is decreased, the score determines the priority in a queue. Say user A and B have scores of 0.2 and 1.1. Say the cluster is somewhat busy, then B’s jobs will go in first, but because there’s not that much activity, the difference in queue time will be negligible. Now we’re a week before a major machine learning conference submission deadline, and the cluster is overloaded. User A and B schedule the same amount of work, but while A is indeed way behind B in the queue, B will still be waiting a week or more, simply because over high overload.
Long story short, to find a partial answer to ‘How long will I have to wait?’
sshare -l -A ACCOUNT
Look for LEVELFS, that score is the priority weight.
Singularity #
Singularity by default with write cache and local objects to $HOME/.singularity, on HPC $HOME is the very last filesystem to write heavy data to, so tell Singularity to not do that.
$ mkdir -p /scratch/$USER/singularity/{cache,tmp}
$ export SINGULARITY_CACHEDIR="/<FASTLOCALDISK>/singularity/cache"
$ export SINGULARITY_TMPDIR="/<FASTLOCALDISK>/singularity/tmp"
Setting environment variables inside container #
echo MYVAR="MYVALUE" > envfile
singularity exec --env-file envfile image.sif
Using CUDA #
module load cuda
singulariy exec --nv myimage.sif
Bind mounts #
singularity exec -B <outside>:<inside> myimage.sif
The
Singularity instances #
singularity instance.{start|stop} image.sif <sessionname>
singularity exec image.simg instance://<sessionname> <command>
singularity instance.list
Modify images (NOT recommended) #
Create a sandbox #
sudo singularity build -s <sandboxdirectory> image.sif
#or
singularity build --sandbox IMAGE_NAME/ <source::library>
Modify image #
sudo singularity shell -w myimage-dir
Save to image #
sudo singularity build image.sif <sandboxdirectory>