SLURM#
Overview#
The Slurm Workload Manager (Simple Linux Utility for Resource Management) or Slurm for short, is a free and open-source job scheduler for Linux.
It runs on many of the world’s supercomputers and computer clusetrs.
Slurm has three key functions.
First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.
Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
Finally, it arbitrates contention for resources by managing a queue of pending work.
Architecture#
Slurm has a centralized manager, slurmctld, to monitor resources and work.
Each compute server (node) has a slurmd daemon, which can be compared to a remote shell: it waits for work, executes that work, returns status, and waits for more work.
Here is an image of it’s structure