SLURM
SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager that is widely used in high-performance computing (HPC) environments, such as clusters and supercomputers. It manages and schedules jobs (tasks or processes) on these systems, ensuring that resources like CPU, memory, and GPUs are allocated efficiently and fairly among users.
Create a script
Running a script on a SLURM cluster involves several steps, including preparing your script, creating a SLURM job submission script, and submitting it to the SLURM workload manager. Here’s a step-by-step guide:
First, make sure you have the script you want to run. This could be a Python script, a shell script, or any other executable script. For example, let’s assume you have a Python script named my_script.py.
To run your script on a SLURM cluster, you need to create a job submission script, which is a simple shell script containing SLURM directives that specify how your job should be executed.
Here’s an example of a SLURM job submission script for my_script.py:
#!/bin/bash
#SBATCH --job-name=my_job # Job name
#SBATCH --output=output_%j.txt # Standard output and error log (%j will be replaced by job ID)
#SBATCH --ntasks=1 # Run a single task (useful for serial jobs)
#SBATCH --time=01:00:00 # Time limit hrs:min:sec (1 hour)
#SBATCH --mem=4GB # Memory required per node (4 GB)
#SBATCH --partition=normal # Partition (queue) to submit to
# Load any necessary modules
module load python/3.8
# Run your Python script
python my_script.py
Note
Ask your system administrator for the partition name you should use and the names of the available modules.
Run a script
Once you’ve created your job submission script (e.g., submit.sh), you can submit it to the SLURM scheduler using the sbatch command:
$ sbatch submit.sh
Inspect progress of your jobs
After submitting, SLURM will assign your job a unique job ID and place it in a queue. The job will start running when the required resources are available. You can monitor the status of your job using the squeue command:
$ squeue -u your_username
If you are interested in all jobs that run on a specific partition you can get these with the following command:
$ squeue -p partition_name
This can be helpful if you want to see how many capacity is left on this partition. For example if you want to train a deep learning model on GPUs you can see how many jobs are currently running that use those GPUs.
This will show the status of all your jobs. The main states are:
- PD (Pending): Waiting in the queue.
- R (Running): Currently executing.
- CG (Completing): Finishing up.
- CD (Completed): Finished successfully.
When your job completes, SLURM will write the output and any errors to the file you specified with the --output directive (output_%j.txt). You can inspect this file to see the results of your script.
Canceling a Job
If you need to cancel a job, use the scancel command followed by the job ID:
scancel <job_id>