A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://www.hpc.caltech.edu/documentation/slurm-commands below:

SLURM Commands | HPC Center

For a general introduction to using SLURM, watch the video tutorial that BYU put together. 

Here's a useful cheatsheet of many of the most common Slurm commands.

Example submission scripts are available at our Git repository. 

https://bitbucket.org/caltechimss/central-hpc-public/src/master/slurm-scripts/

To pull down the extended example script, run the following from cluster login node. 

wget https://bitbucket.org/caltechimss/central-hpc-public/raw/master/slurm-scripts/extended-slurm-submission

Job Submission

Use the Script Generator to check for syntax. Each #SBATCH line contains a parameter that you can use on the command-line (e.g. --time=1:00:00).

sbatch is used to submit batch (non interactive) jobs. The output is sent by default to a file in your local directory: slurm-$SLURM_JOB_ID.out. 

Most of you jobs will be submitted this way: 

sbatch -A accounting_group your_batch_script

salloc is used to obtain a job allocation that can then be used for running within

srun is used to obtain a job allocation if needed and execute an application.  It can also be used for distribute mpi processes in your job.

Environment Variables:

Resource Requests

To run you job, you will need to specify what resources you need.  These can be memory, cores, nodes, gpus, etc. There is a lot of flexibility in the scheduler to get specifically the resources you need.

Examples:

Request a single node with 2 P100 GPU's

#SBATCH --nodes=1

#SBATCH --gres=gpu:2

#SBATCH --partition=gpu

Request a single node with 1 V100 GPU. (Either 16GB or 32GB V100)

#SBATCH --nodes=1

#SBATCH --gres=gpu:v100:1

#SBATCH --partition=gpu

Request a single node with 1 V100 GPU. (Specifically, 32GB V100's. As the (4) 32GB V100 GPU's are on a cascade lake node, we need to constrain to that.)

#SBATCH --nodes=1

#SBATCH --gres=gpu:v100:1

#SBATCH --constraint="cascadelake"

#SBATCH --partition=gpu

Request that your job only runs on skylake or cascadelake cpu's. 

#SBATCH --constraint=skylake|cascadelake 

Important Notes on Job Submission:

Your jobs must specify a wallclock time using the "-t" option when submitting your jobs.  If this time is exceeded, you job will be killed.  It is best to give this up to the maximum time allowed at first to get an idea of how long it runs.  After you know that, it is best to give it a more reasonable time limit.  Setting a reasonable timelimit will increase your chance of running quickly based on the backfill algorithm used.

Your job will be charged to the account specified. We do not force you to set an account since many users will be in just one.  If you are in more than one group, make sure that you specify the group that you are wanting to charge the job to.  This is done by using the "-A" option when submitting the job.  

You can see the accounts you are in using:

sacctmgr show user myusername accounts

You can change you default account using:

sacctmgr modify user myusername set defaultaccount=account

Note:  Please choose wisely while setting your jobs wall time. As cluster policy we do not typically increase a running jobs wall time as it is both unfair to other users and could alter the reported start times of existing jobs in the queue. If you are unfamiliar with your codes performance we strongly recommending padding the wall time specified then work backwards.

Job/Queue Management

squeue is used to show the queue. By default it shows all jobs, regardless of state:

scancel is used to cancel (i.e. kill) a job.  Here are some options to use:

you can stack these options to get a particular set of jobs.  For example, "scancle -u foor --state=pending" will kill all penging jobs for user "foo"

scontrol show job is used to display job information for pending and running jobs. This displays information such as hold, resource requests, resource allocations, etc.  This is agreat first step in chcking a job.

scontrol hold holds a job. Pass it a job ID (e.g. "scontrol hold 1234").

scontrol release releases a held job. Pass it a job ID (e.g. "scontrol release 1234").

Checking Usage sreport

is a good option for showing historical job usage by username or group.

To obtain usage of entire group. 

sreport  -T gres/gpu,cpu   cluster accountutilizationbyuser start=01/01/18T00:00:00 end=now    -t hours account=<group-account-name>

To obtain usage of a single account. 

sreport  -T gres/gpu,cpu   cluster accountutilizationbyuser start=01/01/18T00:00:00 end=now    -t hours user=<username>

sacct shows current and historical job information in more detail that sreport. Important options:

Example

sacct  -u <username> -S 0101 --format JobId,AllocCPUs,UserCPU

Launching tasks within a job MPI Jobs mpirun

Both OpenMPI and Intel MPI have support for the slurm scheduler.  It should take no special effort to run you job under the scheduler.  They look for the environment variables set by Slurm when your job is allocated and it then able to use those to start the processes on the correct number of nodes and the specific hosts:

mpirun executable options

srun

srun is the task launcher for slurm.  It is built with PMI support, so it is a great way to start processes on the nodes for you mpi workflow.  srun launches the processes more  efficiently and faster than mpirun. All processes launched by srun will be consolidated into one job step which makes it easier to see where time was spent in a job.  When using mpirun it sees each process at its own step,

Typically you can just use srun as you would mpirun since it is aware of mpi, and the allocations for your job:

srun executable options

srun will run processes on all nodes and task processors allocated within the job.  You can specifcy differently though if you prefer.

Embarrassingly Parallel Jobs

Embarrassingly parallel jobs is a term used to indicate jobs that can be run independently from each other, but benefit by being run a large numbers of times.  It is not a term of derision. Monte Carlo simulations fall into this category and are a very common use case in high throughput computing.  Depending on you use case, you may use srun, surm arrays, gnu parallel, or some other framework to launch jobs.

To run get a shell on a compute node with allocated resources to use interactively you can use the following command, specifying the information needed such as queue, time, nodes, and tasks: 

Keep in mind that this is likely to be slow and the session will end if the ssh connection is terminated. A more robust solution is to use FastX. Click here for FastX tutorial. 


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4