Partitions & Queues
Understanding SLURM partitions and priority queues on Prometheus
Categories:
Overview
The Prometheus cluster has two partitions with different priority queues (QoS) that control resource limits and scheduling priority. All limits are applied per group, and the default time limit is 4 hours for all partitions.
Partition Architecture
defq Partition (Default)
- Nodes: 8 compute nodes (
gpu[01-08]) - GPU Type: NVIDIA A5000 (24GB VRAM each)
- Total GPUs: 64 (8 GPUs per node)
- Default partition: Jobs submitted without specifying partition go here
a6000 Partition
- Nodes: 1 compute node (
gpu09) - GPU Type: NVIDIA RTX A6000 Ada Generation (48GB VRAM each)
- Total GPUs: 4
- Use case: High-memory GPU workloads
Priority Queues (QoS)
Resource Limits
All resource limits are applied per group, not per user. Coordinate with your group members to avoid conflicts.defq Partition Queues
| Priority Queue | Time Limit | Max CPUs | Max GPUs | Max RAM | Max Jobs | Priority |
|---|---|---|---|---|---|---|
normal | 1 day | 384 | 48 | 3TB | 30 | High |
long | 7 days | 384 | 48 | 3TB | 20 | Medium |
preemptive | Infinite | All* | All* | All* | 10 | Low |
a6000 Partition Queues
| Priority Queue | Time Limit | Max CPUs | Max GPUs | Max RAM | Max Jobs | Priority |
|---|---|---|---|---|---|---|
normal-a6000 | 1 day | 48 | 3 | 384GB | 6 | High |
long-a6000 | 7 days | 48 | 3 | 384GB | 4 | Medium |
preemptive-a6000 | Infinite | All* | All* | All* | 2 | Low |
* Preemptive queues can use all available resources but jobs may be automatically terminated when higher-priority jobs need resources.
Queue Selection Guidelines
Use normal or normal-a6000 for:
- Interactive development and testing
- Short training runs (< 24 hours)
- Production jobs that need guaranteed completion
- Debugging and experimentation
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --time=12:00:00
Use long or long-a6000 for:
- Extended training (1-7 days)
- Large model training requiring multiple days
- Parameter sweeps with many iterations
- Production workloads with longer time requirements
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --time=3-00:00:00 # 3 days
Use preemptive queues sparingly for:
- Low-priority background jobs
- Opportunistic computing when cluster is idle
- Jobs that can handle interruption (with checkpointing)
- Testing with unlimited time
Warning
Preemptive jobs can be automatically terminated at any time! Use therequeue option and implement checkpointing.#SBATCH --partition=defq
#SBATCH --qos=preemptive
#SBATCH --requeue # Automatically resubmit if preempted
#SBATCH --time=7-00:00:00
Choosing the Right Partition
Use defq partition when:
- Your models fit in 24GB GPU memory
- You need multiple GPUs (up to 8 per node)
- Running distributed training across nodes
- Working with standard deep learning models
Use a6000 partition when:
- Your models require > 24GB GPU memory
- Training large language models (70B+ parameters)
- Working with high-resolution images or long sequences
- Need maximum GPU memory per device
Example Job Submissions
Standard Training Job (defq/normal)
#!/bin/bash
#SBATCH -J standard_training
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:2
#SBATCH --mem=64000
#SBATCH --time=18:00:00
# Your training code here
python train_model.py --gpus 2
Long Training Job (defq/long)
#!/bin/bash
#SBATCH -J long_training
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:4
#SBATCH --mem=128000
#SBATCH --time=5-00:00:00 # 5 days
# Long-running training with checkpointing
python train_model.py --gpus 4 --checkpoint-freq 1000
Large Model Training (a6000/normal-a6000)
#!/bin/bash
#SBATCH -J large_model
#SBATCH --partition=a6000
#SBATCH --qos=normal-a6000
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:2
#SBATCH --mem=256000
#SBATCH --time=20:00:00
# Large model requiring high GPU memory
python train_llm.py --model-size 70B --gpus 2
Preemptive Job with Checkpointing
#!/bin/bash
#SBATCH -J preemptive_job
#SBATCH --partition=defq
#SBATCH --qos=preemptive
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --mem=32000
#SBATCH --time=7-00:00:00
#SBATCH --requeue
#SBATCH --signal=SIGUSR1@90 # Signal 90 seconds before termination
# Handle preemption gracefully
trap 'echo "Job preempted, saving checkpoint..."; python save_checkpoint.py' SIGUSR1
python train_model.py --resume-from-checkpoint
Monitoring Queue Status
Check Partition Information
# View all partitions and their status
sinfo
# Example output:
# PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
# defq* up infinite 6 idle~ gpu[03-08]
# defq* up infinite 1 mix gpu02
# defq* up infinite 1 idle gpu01
# a6000 up infinite 1 mix gpu09
Check Queue Status
# View current job queue
squeue
# View jobs by partition
squeue -p defq
squeue -p a6000
# View jobs by QoS
squeue --qos=normal
squeue --qos=long
Check Your Resource Usage
# View your running jobs
squeue -u $USER
# Check job priorities
sprio -u $USER
# View resource limits
sacctmgr show qos format=Name,MaxWall,MaxTRES,MaxJobs
Resource Planning
Calculate Resource Needs
Before submitting jobs, consider:
GPU Memory Requirements:
- Small models (< 1B params): A5000 (24GB)
- Large models (> 10B params): A6000 (48GB)
Training Time Estimates:
- Quick experiments:
normalqueue (< 1 day) - Full training:
longqueue (1-7 days)
- Quick experiments:
Number of GPUs:
- Single GPU: Any node
- Multi-GPU: Consider node topology
- Distributed: Multiple nodes in
defq
Group Coordination
Since limits are per group:
- Communicate with group members
- Check current group usage:
squeue -A your-group-name - Plan resource allocation to avoid conflicts
Best Practices
Queue Selection Strategy
- Start with
normalqueues for development - Use
longqueues only when necessary - Avoid preemptive queues unless jobs can handle interruption
- Test on smaller resources before scaling up
Resource Efficiency
Don’t over-allocate resources:
# Bad: Requesting 8 GPUs for single-GPU code #SBATCH --gres=gpu:8 # Good: Request what you actually use #SBATCH --gres=gpu:1Use appropriate memory:
# Calculate actual memory needs #SBATCH --mem=32000 # 32GB, not 500GBEstimate time accurately:
# Add buffer but don't overestimate #SBATCH --time=18:00:00 # 18 hours, not 7 days
Troubleshooting
Job Stuck in Queue
# Check why job is pending
scontrol show job <job_id> | grep Reason
# Common reasons:
# - Resources: Requesting more than available
# - Priority: Lower priority than other jobs
# - QoSMaxJobsPerUser: Too many jobs running
Resource Limit Exceeded
# Check current group usage
squeue -A your-group
# Reduce resource requests or wait for jobs to complete
Wrong Partition Choice
# Cancel and resubmit with correct partition
scancel <job_id>
# Edit script and resubmit
sbatch corrected_script.slurm
Next Steps
- Learn about storage systems: Storage Guide
- Set up your environment: Environment Modules
- Configure VS Code: VS Code Setup