Prometheus Cluster Documentation
Complete guide to using the Prometheus deep learning cluster at CYENS
This section contains comprehensive documentation for the Prometheus cluster - a high-performance computing environment for deep learning research at CYENS.
Overview
The Prometheus cluster is a state-of-the-art deep learning computing facility featuring:
- 64 NVIDIA A5000 GPUs (24GB each) across 8 compute nodes
- 4 NVIDIA A6000 Ada GPUs (48GB each) on a dedicated node
- 4.6TB total GPU memory for large-scale model training
- High-performance Lustre storage with 305TB capacity
- SLURM job scheduler for efficient resource management
This documentation will guide you through:
Quick Start
- Generate SSH keys and request cluster access
- Connect via SSH to
prometheus.cyens.org.cy - Submit your first job using SLURM
- Set up development environment with modules or containers
Cluster Specifications
Compute Resources
- 9 compute nodes total
- GPU nodes
gpu[01-08]: 8×A5000 GPUs each (64 total GPUs) - GPU node
gpu09: 4×A6000 Ada GPUs (48GB VRAM each) - 512GB RAM per compute node
- 32 CPU cores per node (AMD EPYC 7313)
Storage
- Home directories: 20GB SSD per user
- Shared storage: 30TB Lustre filesystem per group
- Local storage: 1TB NVMe SSD per compute node
Networking Infrastructure
- Management Network: Netgear M4300-52G switch with 48×1G ports plus 2×10GBASE-T and 2×SFP+
- High-Performance Interconnect: Mellanox HDR InfiniBand switch with 40×QSFP56 ports
- InfiniBand Speed: 200Gb/s HDR connectivity with hybrid copper cables
- Low Latency: Sub-microsecond messaging for distributed computing workloads
Software Environment
- Rocky Linux 8.5 operating system
- SLURM workload manager
- Lmod environment modules
- CUDA 11.3+ with deep learning frameworks
Support & Resources
- System administrators: Contact your MRG leader
- Documentation: This site and
/opt/cluster/docs/ - Cluster status: Monitor with
sinfo and squeue
For detailed instructions, start with the Getting Started guide.
1 - Getting Started
SSH access and initial setup for the Prometheus cluster
Prerequisites
Before accessing the Prometheus cluster, you need:
- A valid cluster account (contact your MRG leader)
- SSH client installed on your local machine
- Basic familiarity with Linux command line
Generate SSH Keys
The Prometheus cluster uses RSA key authentication for secure access. You need to generate a public/private key pair:
Step 1: Create SSH Key Pair
Open your terminal and run:
Follow the on-screen instructions. This creates:
- Private key:
~/.ssh/id_rsa (keep this secure!) - Public key:
~/.ssh/id_rsa.pub (share this with administrators)
Step 2: Secure Your Private Key
For Linux/Mac users, set proper permissions:
Step 3: Request Cluster Access
- Send your public key to your MRG leader
- Request a Prometheus account
- Wait for account confirmation
Step 4: Add Passphrase (Optional but Recommended)
For additional security, add a passphrase to your key:
ssh-keygen -p -f ~/.ssh/id_rsa
Connect to Prometheus
Create or edit ~/.ssh/config file with the following content:
Host prometheus
Hostname prometheus.cyens.org.cy
User <your-username>
IdentityFile ~/.ssh/id_rsa
Replace <your-username> with your actual cluster username.
Connect via SSH
Once your account is activated, connect using:
You should now be logged into the Prometheus head node!
First Login Setup
Check Your Environment
# Check current directory
pwd
# List available partitions
sinfo
# Check your groups
groups
# View your home directory quota
quota -us
Understand the File System
# Your home directory (20GB limit)
ls -la /trinity/home/$USER
# Shared group storage (30TB per group)
ls -la /lustreFS/data/
# Check group quota
lfs quota -gh <group-name> /lustreFS/
Cluster Architecture Overview
The Prometheus cluster consists of:
Head Node
- Login and job submission point
- DO NOT run compute jobs here
- Used for file management and job scheduling
Compute Nodes
gpu[01-08]: 8 nodes with A5000 GPUs (8 GPUs each)gpu09: 1 node with A6000 Ada GPUs (4 GPUs)- 512GB RAM and 32 CPU cores per node
Storage Systems
/trinity/home/: Personal home directories (SSD, 20GB limit)/lustreFS/data/: Shared group storage (305TB Lustre filesystem)- Local storage: 1TB NVMe on each compute node
Important Usage Rules
Critical
NEVER run compute jobs directly on the head node! Always use SLURM to submit jobs to compute nodes.Checking Cluster Status
# View partition information
sinfo
# Check job queue
squeue
# Check your running jobs
squeue -u $USER
# View detailed node information
scontrol show nodes
Example sinfo output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
defq* up infinite 6 idle~ gpu[03-08]
defq* up infinite 1 mix gpu02
defq* up infinite 1 idle gpu01
a6000 up infinite 1 mix gpu09
Next Steps
Now that you’re connected to Prometheus:
- Set up your development environment - See Environment Modules
- Learn about partitions and queues - Read Partitions & Queues
- Submit your first job - Follow Job Submission
- Configure VS Code (optional) - See VS Code Setup
Getting Help
- Cluster status: Use
sinfo and squeue commands - Documentation: Check
/opt/cluster/docs/ on the cluster - Support: Contact your MRG leader
- System issues: Report to cluster administrators
Common First-Time Issues
“Permission denied (publickey)”
- Verify your public key was added to the cluster
- Check your SSH config file syntax
- Ensure private key permissions are correct (
chmod 600)
“Connection refused”
- Verify the hostname:
prometheus.cyens.org.cy - Check if you’re connected to the internet
- Confirm your account is activated
“Quota exceeded”
- Home directory has a 20GB limit
- Use group storage in
/lustreFS/data/ for large files - Check usage with
quota -us
2 - Hardware Specifications
Detailed hardware specifications of the Prometheus cluster
Cluster Overview
The Prometheus cluster features a modern architecture optimized for deep learning workloads with high-performance GPUs, abundant memory, and fast storage systems.
Head Node
Management and login node for the cluster
Hardware Configuration
- Chassis: GIGABYTE R182-Z90-00
- Motherboard: GIGABYTE MZ92-FS0-00
- CPU: 2× AMD EPYC 7313 (16 cores/32 threads each)
- Total CPU Cores: 32 cores / 64 threads
- RAM: 16× 32GB Samsung M393A4K40EB3-CWE
- Total RAM: 512GB DDR4
- Storage: 2× 1.92TB Intel SSDSC2KB019T8 SSD
- File System:
/trinity/home (400GB allocated)
Purpose
- SSH login and job submission
- File management and transfers
- SLURM job scheduling
- Not for compute workloads
Compute Nodes
GPU Nodes gpu[01-08] (8 nodes)
Primary compute nodes with NVIDIA A5000 GPUs
Hardware Configuration
- Chassis: Supermicro AS-4124GS-TNR
- Motherboard: Supermicro H12DSG-O-CPU
- CPU: 2× AMD EPYC 7313 (16 cores/32 threads each)
- Total CPU Cores: 32 cores / 64 threads per node
- RAM: 16× 32GB SK Hynix HMAA4GR7AJR8N-XN
- Total RAM: 512GB DDR4 per node
- Local Storage: 1× 1TB Samsung SSD 980 NVMe
GPU Specifications
- GPU Model: NVIDIA A5000
- GPU Count: 8 GPUs per node (64 total across all nodes)
- GPU Memory: 24GB GDDR6 per GPU
- CUDA Cores: 8,192 per GPU
- Tensor Cores: 256 RT Cores (2nd gen)
- Peak Performance: 27.8 TFLOPS FP32 per GPU
- Memory Bandwidth: 768 GB/s per GPU
Total Resources (gpu[01-08])
- Total GPUs: 64× NVIDIA A5000
- Total GPU Memory: 1,536GB (1.5TB)
- Total CPU Cores: 256 cores / 512 threads
- Total System RAM: 4TB DDR4
GPU Node gpu09 (1 node)
High-memory GPU node with NVIDIA A6000 Ada
Hardware Configuration
- Chassis: ASUS RS720A-E11-RS12
- Motherboard: ASUS KMPP-D32
- CPU: 2× AMD EPYC 7313 (16 cores/32 threads each)
- Total CPU Cores: 32 cores / 64 threads
- RAM: 16× 32GB SK Hynix HMAA4GR7AJR8N-XN
- Total RAM: 512GB DDR4
- Local Storage: 1× 1TB Samsung SSD 980 NVMe
GPU Specifications
- GPU Model: NVIDIA RTX A6000 Ada Generation
- GPU Count: 4 GPUs
- GPU Memory: 48GB GDDR6 per GPU
- CUDA Cores: 18,176 per GPU
- Tensor Cores: 568 (4th gen)
- Peak Performance: 91.06 TFLOPS FP32 per GPU
- Memory Bandwidth: 960 GB/s per GPU
Total Resources (gpu09)
- Total GPUs: 4× NVIDIA A6000 Ada
- Total GPU Memory: 192GB
- Total CPU Cores: 32 cores / 64 threads
- Total System RAM: 512GB DDR4
Storage Nodes
Storage Architecture
High-performance parallel file system
Hardware Configuration (2 storage nodes)
- Chassis: Supermicro Super Server
- Motherboard: Supermicro H12SSL-i
- CPU: 1× AMD EPYC 7302P (16 cores/32 threads each)
- RAM: 8× 16GB Samsung M393A2K40DB3-CWE
- Total RAM: 256GB DDR4 per node
- OS Storage: 2× 240GB Intel SSDSC2KB240G7 SSD
- Data Storage: 24× 7.68TB Samsung MZILT7T6HALA/007 NVMe SSD
Storage Specifications
- File System: Lustre parallel file system
- Mount Point:
/lustreFS - Raw Capacity: 368TB per storage node
- Total Raw Capacity: 736TB across both nodes
- Usable Capacity: ~305TB (after RAID and file system overhead)
- Performance: High-throughput parallel I/O
Software Environment
Operating System
- Distribution: Rocky Linux 8.5 (Green Obsidian)
- Kernel Version: 4.18.0-348.23.1.el8_5.x86_64
- Architecture: x86_64
Management Software
- Job Scheduler: SLURM Workload Manager
- Module System: Lmod (Lua-based Environment Modules)
- File System: Lustre for parallel storage
- CUDA Toolkit: 11.3+ with cuDNN
- Compilers: GCC, Intel, NVCC
- MPI: OpenMPI, MPICH
- Python: Multiple versions with conda/pip
- Deep Learning: PyTorch, TensorFlow, JAX
- Containers: Singularity/Apptainer support
Network Architecture
Interconnect
- Compute Network: High-speed Ethernet
- Storage Network: Dedicated Lustre network
- Management Network: Separate administrative network
Bandwidth
- Node-to-Node: High-bandwidth for distributed training
- Storage Access: Optimized for parallel I/O workloads
- External Access: Internet connectivity for downloads
- Total GPU Performance:
- A5000 nodes: 1,779 TFLOPS FP32 (64 × 27.8)
- A6000 node: 364 TFLOPS FP32 (4 × 91.06)
- Combined: ~2,143 TFLOPS FP32
- Memory Bandwidth:
- A5000 total: 49,152 GB/s
- A6000 total: 3,840 GB/s
- Combined: ~53TB/s GPU memory bandwidth
- Lustre File System: High-throughput parallel I/O
- Local NVMe: High IOPS for temporary data
- Home Directories: SSD-backed for fast access
Resource Allocation
Per-Node Resources
- CPU Cores: 32 physical / 64 logical per node
- System Memory: 512GB DDR4 per node
- GPU Memory:
- A5000 nodes: 192GB per node (8 × 24GB)
- A6000 node: 192GB (4 × 48GB)
- Local Storage: 1TB NVMe SSD per compute node
Total Cluster Resources
- Compute Nodes: 9 total
- CPU Cores: 288 physical / 576 logical
- System Memory: 4.5TB DDR4
- GPUs: 68 total (64 A5000 + 4 A6000 Ada)
- GPU Memory: 1.728TB total
- Shared Storage: 305TB Lustre + local NVMe
Use Cases and Workloads
Optimized For
- Large Language Models: High GPU memory for transformer models
- Computer Vision: Parallel training on multiple GPUs
- Distributed Training: Multi-node deep learning
- High-throughput Computing: Batch processing workflows
- Interactive Development: Jupyter notebooks and VS Code
- Memory-bound workloads: Benefit from A6000’s 48GB VRAM
- Compute-intensive tasks: Leverage A5000’s efficiency
- Data-intensive jobs: Utilize high-performance Lustre storage
- Multi-GPU training: Scale across nodes with SLURM
3 - Environment Setup
Configure your development environment on the Prometheus cluster
Development Environment Options
The Prometheus cluster supports multiple development environments:
- Container-based (Recommended)
- Module-based (Traditional HPC)
- Custom Python environments
Container-Based Setup
Using Pre-built Containers
The cluster provides optimized containers for common deep learning frameworks:
# List available containers
ls /shared/containers/
# Use PyTorch container
singularity shell --nv /shared/containers/pytorch-gpu.sif
# Use TensorFlow container
singularity shell --nv /shared/containers/tensorflow-gpu.sif
Building Custom Containers
Create a definition file (pytorch-custom.def):
Bootstrap: docker
From: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
%post
apt-get update && apt-get install -y \
git \
vim \
htop \
tmux
pip install \
transformers \
datasets \
wandb \
jupyter \
matplotlib \
seaborn
%environment
export CUDA_VISIBLE_DEVICES=0,1,2,3
export PYTHONPATH=/opt/code:$PYTHONPATH
%runscript
exec "$@"
Build the container:
singularity build pytorch-custom.sif pytorch-custom.def
Python Environment Setup
Using Conda
# Load conda module
module load conda
# Create environment
conda create -n myenv python=3.9
# Activate environment
conda activate myenv
# Install packages
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c conda-forge jupyter matplotlib pandas
Using pip with virtual environments
# Load Python module
module load python/3.9
# Create virtual environment
python -m venv ~/venvs/deeplearning
source ~/venvs/deeplearning/bin/activate
# Install packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install jupyter notebook jupyterlab
pip install transformers datasets wandb
GPU Environment Configuration
Checking GPU Availability
# Check available GPUs
nvidia-smi
# Check CUDA version
nvcc --version
# Test PyTorch GPU access
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
Setting GPU Visibility
# Use specific GPUs
export CUDA_VISIBLE_DEVICES=0,1
# Use all available GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Jupyter Notebook Setup
Local Jupyter on Compute Node
Request an interactive session:
srun --partition=gpu --gres=gpu:1 --time=4:00:00 --pty bash
Start Jupyter:
module load python/3.9
source ~/venvs/deeplearning/bin/activate
jupyter notebook --no-browser --port=8888 --ip=0.0.0.0
Set up SSH tunnel (from your local machine):
ssh -L 8888:compute-node:8888 username@prometheus-cluster.example.com
JupyterHub Access
If available, access JupyterHub directly:
https://jupyter.prometheus-cluster.example.com
VS Code Remote Development
- Install VS Code with Remote-SSH extension
- Configure SSH connection in VS Code
- Connect to cluster and open your project folder
tmux for Session Management
# Start new session
tmux new-session -s training
# Detach session (Ctrl+b, then d)
# Reattach session
tmux attach-session -t training
# List sessions
tmux list-sessions
Storage and Data Access
Home Directory Setup
# Create project structure
mkdir -p ~/projects/{experiments,datasets,models,scripts}
mkdir -p ~/logs
Using Shared Storage
# Link shared datasets
ln -s /shared/datasets ~/datasets
# Copy models to your space
cp -r /shared/models/pretrained ~/models/
# Use scratch space for temporary files
export TMPDIR=/scratch/$USER
mkdir -p $TMPDIR
Environment Variables
Create ~/.cluster_env:
# CUDA settings
export CUDA_VISIBLE_DEVICES=0,1,2,3
export CUDA_CACHE_PATH=/scratch/$USER/cuda_cache
# Python settings
export PYTHONPATH=$HOME/projects:$PYTHONPATH
export JUPYTER_CONFIG_DIR=$HOME/.jupyter
# Weights & Biases
export WANDB_DIR=$HOME/logs/wandb
export WANDB_CACHE_DIR=/scratch/$USER/wandb_cache
# Hugging Face
export HF_DATASETS_CACHE=/scratch/$USER/hf_cache
export TRANSFORMERS_CACHE=/scratch/$USER/transformers_cache
Source it in your .bashrc:
echo 'source ~/.cluster_env' >> ~/.bashrc
Troubleshooting
Common Issues
CUDA out of memory:
# Clear GPU memory
nvidia-smi --gpu-reset
# Monitor GPU usage
watch -n 1 nvidia-smi
Module not found:
# Check loaded modules
module list
# Reload environment
source ~/.bashrc
Permission denied:
# Check file permissions
ls -la
# Fix permissions
chmod 755 script.py
Next Steps
4 - Job Submission
Submit and manage jobs using SLURM on the Prometheus cluster
SLURM Job Scheduler
The Prometheus cluster uses SLURM (Simple Linux Utility for Resource Management) for job scheduling and resource allocation. SLURM ensures fair resource sharing and efficient cluster utilization.
Important
NEVER run compute jobs directly on the head node! Always use SLURM to submit jobs to compute nodes.Interactive Jobs
Interactive jobs are perfect for development, testing, and debugging. Use srun to request resources immediately.
Basic Interactive Session
# Request 1 GPU, 2 CPUs, 1GB RAM for 1 hour
srun -c 1 -n 1 -p defq --qos=normal --mem=100 --gres=gpu:1 -t 01:00 --pty /bin/bash
Interactive Session Parameters
# Request specific resources
srun --partition=defq \
--qos=normal \
--cpus-per-task=4 \
--gres=gpu:2 \
--mem=20000 \
--time=04:00:00 \
--pty /bin/bash
A6000 Interactive Session
# Request A6000 GPU with high memory
srun --partition=a6000 \
--qos=normal-a6000 \
--cpus-per-task=8 \
--gres=gpu:1 \
--mem=50000 \
--time=02:00:00 \
--pty /bin/bash
Batch Jobs
Batch jobs run unattended and are ideal for training models, parameter sweeps, and long-running experiments.
Basic Batch Script
Create a file train_model.slurm:
#!/bin/bash
#SBATCH -o res_%j.txt # output file
#SBATCH -e res_%j.err # error file
#SBATCH -J my_training # job name
#SBATCH --partition=defq # partition
#SBATCH --qos=normal # priority queue
#SBATCH --ntasks=1 # number of tasks
#SBATCH --cpus-per-task=4 # CPU cores per task
#SBATCH --gres=gpu:2 # number of GPUs
#SBATCH --mem=32000 # memory in MB
#SBATCH --time=1-12:00 # time limit (1 day, 12 hours)
# Load required modules
module load CUDA/11.3.1
module load Python/3.9.5
# Activate your environment
source ~/anaconda3/bin/activate
conda activate myenv
# Set CUDA devices
export CUDA_VISIBLE_DEVICES=0,1
# Run your training script
cd /lustreFS/data/mygroup/myproject
python train_model.py --epochs 100 --batch-size 64
Submit the Batch Job
SLURM Parameters Reference
Common SBATCH Directives
| Parameter | Description | Example |
|---|
-J, --job-name | Job name | #SBATCH -J training_job |
-o, --output | Output file | #SBATCH -o output_%j.txt |
-e, --error | Error file | #SBATCH -e error_%j.err |
-p, --partition | Partition | #SBATCH --partition=defq |
--qos | Priority queue | #SBATCH --qos=normal |
-n, --ntasks | Number of tasks | #SBATCH --ntasks=1 |
-c, --cpus-per-task | CPUs per task | #SBATCH --cpus-per-task=8 |
--gres | Generic resources | #SBATCH --gres=gpu:4 |
--mem | Memory (MB) | #SBATCH --mem=64000 |
-t, --time | Time limit | #SBATCH --time=2-00:00 |
# 30 minutes
#SBATCH --time=00:30:00
# 4 hours
#SBATCH --time=04:00:00
# 1 day, 12 hours
#SBATCH --time=1-12:00:00
# 7 days (maximum for long queue)
#SBATCH --time=7-00:00:00
GPU Allocation Examples
# Request any available GPU
#SBATCH --gres=gpu:1
# Request multiple GPUs
#SBATCH --gres=gpu:4
# All GPUs on A5000 node (8 GPUs)
#SBATCH --gres=gpu:8
# A6000 GPU specifically
#SBATCH --partition=a6000 --gres=gpu:1
Advanced Job Examples
Multi-GPU Training Script
#!/bin/bash
#SBATCH -J multi_gpu_training
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:4
#SBATCH --mem=128000
#SBATCH --time=2-00:00:00
#SBATCH -o logs/multi_gpu_%j.out
#SBATCH -e logs/multi_gpu_%j.err
# Ensure logs directory exists
mkdir -p logs
# Load modules
module load CUDA/11.3.1 Python/3.9.5
# Activate environment
conda activate pytorch-env
# Set environment variables
export CUDA_VISIBLE_DEVICES=0,1,2,3
export OMP_NUM_THREADS=4
# Run distributed training
python -m torch.distributed.launch \
--nproc_per_node=4 \
--master_port=12355 \
train_distributed.py \
--config config.yaml \
--output-dir /lustreFS/data/mygroup/results
Parameter Sweep with Job Arrays
#!/bin/bash
#SBATCH -J param_sweep
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --array=1-20%5 # 20 jobs, max 5 running
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
#SBATCH --mem=16000
#SBATCH --time=08:00:00
#SBATCH -o logs/sweep_%A_%a.out
#SBATCH -e logs/sweep_%A_%a.err
# Parameter arrays
learning_rates=(0.001 0.01 0.1 0.2)
batch_sizes=(16 32 64 128 256)
# Calculate parameters for this task
lr_idx=$(( ($SLURM_ARRAY_TASK_ID - 1) / ${#batch_sizes[@]} ))
bs_idx=$(( ($SLURM_ARRAY_TASK_ID - 1) % ${#batch_sizes[@]} ))
lr=${learning_rates[$lr_idx]}
bs=${batch_sizes[$bs_idx]}
# Load environment
module load Python/3.9.5
conda activate myenv
# Run experiment
python train.py \
--learning-rate $lr \
--batch-size $bs \
--experiment-name "sweep_${SLURM_ARRAY_TASK_ID}" \
--output-dir /lustreFS/data/mygroup/sweep_results
A6000 High-Memory Job
#!/bin/bash
#SBATCH -J large_model_training
#SBATCH --partition=a6000
#SBATCH --qos=long-a6000
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:3 # Use 3 of 4 A6000 GPUs
#SBATCH --mem=256000 # 256GB RAM
#SBATCH --time=5-00:00:00 # 5 days
#SBATCH -o logs/large_model_%j.out
#SBATCH -e logs/large_model_%j.err
# Load modules
module load CUDA/11.3.1
# Activate environment with large model libraries
conda activate large-models
# Set memory optimization
export CUDA_VISIBLE_DEVICES=0,1,2
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# Train large language model
python train_llm.py \
--model-size 70B \
--gradient-checkpointing \
--fp16 \
--data-dir /lustreFS/data/mygroup/datasets \
--output-dir /lustreFS/data/mygroup/llm_checkpoints
Job Management Commands
Monitoring Jobs
# Check job queue
squeue
# Check your jobs only
squeue -u $USER
# Detailed job information
scontrol show job <job_id>
# Job history
sacct -u $USER --starttime=2024-01-01
# Job efficiency statistics
seff <job_id>
Job Control
# Cancel a job
scancel <job_id>
# Cancel all your jobs
scancel -u $USER
# Cancel jobs by name
scancel --name=training_job
# Hold a job (prevent it from running)
scontrol hold <job_id>
# Release a held job
scontrol release <job_id>
# Show partition information
sinfo
# Show detailed node information
scontrol show nodes
# Show QoS information
sacctmgr show qos
# Show your job priorities
sprio -u $USER
Resource Monitoring
During Job Execution
# Monitor GPU usage on your job
srun --jobid=<job_id> nvidia-smi
# Check memory usage
srun --jobid=<job_id> free -h
# Monitor CPU usage
srun --jobid=<job_id> htop
# Job efficiency report
seff <job_id>
# Detailed job accounting
sacct -j <job_id> --format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Start,End,Elapsed,MaxRSS,MaxVMSize
Best Practices
Resource Allocation
Request appropriate resources:
# Don't over-allocate
#SBATCH --cpus-per-task=4 # Not 32 if you only use 4
#SBATCH --mem=16000 # Not 500000 if you only need 16GB
Use job arrays for parameter sweeps:
#SBATCH --array=1-100%10 # Limit concurrent jobs
Choose appropriate partitions:
- Use
defq for most workloads - Use
a6000 only when you need >24GB GPU memory
Data Management
Use appropriate storage:
# Large datasets and results
cd /lustreFS/data/mygroup
# Temporary files during job
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR
Clean up after jobs:
# Add to end of script
rm -rf /tmp/$SLURM_JOB_ID
Debugging
Test interactively first:
srun -p defq --qos=normal --gres=gpu:1 --time=1:00:00 --pty /bin/bash
Use smaller datasets for debugging:
python train.py --debug --max-samples 1000
Check logs regularly:
tail -f logs/training_12345.out
Troubleshooting
Common Issues
Job pending forever:
# Check why job is pending
squeue -u $USER -t PENDING
scontrol show job <job_id> | grep Reason
Out of memory errors:
# Reduce batch size or request more memory
#SBATCH --mem=64000 # Increase memory
CUDA out of memory:
# In your script, add:
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
Job killed by time limit:
# Request more time or use checkpointing
#SBATCH --time=3-00:00:00
Cannot access files:
# Check file permissions and paths
ls -la /lustreFS/data/mygroup/
Next Steps
5 - Partitions & Queues
Understanding SLURM partitions and priority queues on Prometheus
Overview
The Prometheus cluster has two partitions with different priority queues (QoS) that control resource limits and scheduling priority. All limits are applied per group, and the default time limit is 4 hours for all partitions.
Partition Architecture
defq Partition (Default)
- Nodes: 8 compute nodes (
gpu[01-08]) - GPU Type: NVIDIA A5000 (24GB VRAM each)
- Total GPUs: 64 (8 GPUs per node)
- Default partition: Jobs submitted without specifying partition go here
a6000 Partition
- Nodes: 1 compute node (
gpu09) - GPU Type: NVIDIA RTX A6000 Ada Generation (48GB VRAM each)
- Total GPUs: 4
- Use case: High-memory GPU workloads
Priority Queues (QoS)
Resource Limits
All resource limits are applied per group, not per user. Coordinate with your group members to avoid conflicts.defq Partition Queues
| Priority Queue | Time Limit | Max CPUs | Max GPUs | Max RAM | Max Jobs | Priority |
|---|
normal | 1 day | 384 | 48 | 3TB | 30 | High |
long | 7 days | 384 | 48 | 3TB | 20 | Medium |
preemptive | Infinite | All* | All* | All* | 10 | Low |
a6000 Partition Queues
| Priority Queue | Time Limit | Max CPUs | Max GPUs | Max RAM | Max Jobs | Priority |
|---|
normal-a6000 | 1 day | 48 | 3 | 384GB | 6 | High |
long-a6000 | 7 days | 48 | 3 | 384GB | 4 | Medium |
preemptive-a6000 | Infinite | All* | All* | All* | 2 | Low |
* Preemptive queues can use all available resources but jobs may be automatically terminated when higher-priority jobs need resources.
Queue Selection Guidelines
Use normal or normal-a6000 for:
- Interactive development and testing
- Short training runs (< 24 hours)
- Production jobs that need guaranteed completion
- Debugging and experimentation
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --time=12:00:00
Use long or long-a6000 for:
- Extended training (1-7 days)
- Large model training requiring multiple days
- Parameter sweeps with many iterations
- Production workloads with longer time requirements
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --time=3-00:00:00 # 3 days
Use preemptive queues sparingly for:
- Low-priority background jobs
- Opportunistic computing when cluster is idle
- Jobs that can handle interruption (with checkpointing)
- Testing with unlimited time
Warning
Preemptive jobs can be automatically terminated at any time! Use the requeue option and implement checkpointing.#SBATCH --partition=defq
#SBATCH --qos=preemptive
#SBATCH --requeue # Automatically resubmit if preempted
#SBATCH --time=7-00:00:00
Choosing the Right Partition
Use defq partition when:
- Your models fit in 24GB GPU memory
- You need multiple GPUs (up to 8 per node)
- Running distributed training across nodes
- Working with standard deep learning models
Use a6000 partition when:
- Your models require > 24GB GPU memory
- Training large language models (70B+ parameters)
- Working with high-resolution images or long sequences
- Need maximum GPU memory per device
Example Job Submissions
Standard Training Job (defq/normal)
#!/bin/bash
#SBATCH -J standard_training
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:2
#SBATCH --mem=64000
#SBATCH --time=18:00:00
# Your training code here
python train_model.py --gpus 2
Long Training Job (defq/long)
#!/bin/bash
#SBATCH -J long_training
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:4
#SBATCH --mem=128000
#SBATCH --time=5-00:00:00 # 5 days
# Long-running training with checkpointing
python train_model.py --gpus 4 --checkpoint-freq 1000
Large Model Training (a6000/normal-a6000)
#!/bin/bash
#SBATCH -J large_model
#SBATCH --partition=a6000
#SBATCH --qos=normal-a6000
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:2
#SBATCH --mem=256000
#SBATCH --time=20:00:00
# Large model requiring high GPU memory
python train_llm.py --model-size 70B --gpus 2
Preemptive Job with Checkpointing
#!/bin/bash
#SBATCH -J preemptive_job
#SBATCH --partition=defq
#SBATCH --qos=preemptive
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --mem=32000
#SBATCH --time=7-00:00:00
#SBATCH --requeue
#SBATCH --signal=SIGUSR1@90 # Signal 90 seconds before termination
# Handle preemption gracefully
trap 'echo "Job preempted, saving checkpoint..."; python save_checkpoint.py' SIGUSR1
python train_model.py --resume-from-checkpoint
Monitoring Queue Status
# View all partitions and their status
sinfo
# Example output:
# PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
# defq* up infinite 6 idle~ gpu[03-08]
# defq* up infinite 1 mix gpu02
# defq* up infinite 1 idle gpu01
# a6000 up infinite 1 mix gpu09
Check Queue Status
# View current job queue
squeue
# View jobs by partition
squeue -p defq
squeue -p a6000
# View jobs by QoS
squeue --qos=normal
squeue --qos=long
Check Your Resource Usage
# View your running jobs
squeue -u $USER
# Check job priorities
sprio -u $USER
# View resource limits
sacctmgr show qos format=Name,MaxWall,MaxTRES,MaxJobs
Resource Planning
Calculate Resource Needs
Before submitting jobs, consider:
GPU Memory Requirements:
- Small models (< 1B params): A5000 (24GB)
- Large models (> 10B params): A6000 (48GB)
Training Time Estimates:
- Quick experiments:
normal queue (< 1 day) - Full training:
long queue (1-7 days)
Number of GPUs:
- Single GPU: Any node
- Multi-GPU: Consider node topology
- Distributed: Multiple nodes in
defq
Group Coordination
Since limits are per group:
- Communicate with group members
- Check current group usage:
squeue -A your-group-name
- Plan resource allocation to avoid conflicts
Best Practices
Queue Selection Strategy
- Start with
normal queues for development - Use
long queues only when necessary - Avoid preemptive queues unless jobs can handle interruption
- Test on smaller resources before scaling up
Resource Efficiency
Don’t over-allocate resources:
# Bad: Requesting 8 GPUs for single-GPU code
#SBATCH --gres=gpu:8
# Good: Request what you actually use
#SBATCH --gres=gpu:1
Use appropriate memory:
# Calculate actual memory needs
#SBATCH --mem=32000 # 32GB, not 500GB
Estimate time accurately:
# Add buffer but don't overestimate
#SBATCH --time=18:00:00 # 18 hours, not 7 days
Troubleshooting
Job Stuck in Queue
# Check why job is pending
scontrol show job <job_id> | grep Reason
# Common reasons:
# - Resources: Requesting more than available
# - Priority: Lower priority than other jobs
# - QoSMaxJobsPerUser: Too many jobs running
Resource Limit Exceeded
# Check current group usage
squeue -A your-group
# Reduce resource requests or wait for jobs to complete
Wrong Partition Choice
# Cancel and resubmit with correct partition
scancel <job_id>
# Edit script and resubmit
sbatch corrected_script.slurm
Next Steps
6 - Storage Systems
File systems, quotas, and data management on the Prometheus cluster
Storage Overview
The Prometheus cluster provides multiple storage systems optimized for different use cases, from personal files to high-performance parallel computing workloads.
Storage Architecture
Home Directories (/trinity/home/)
- Type: SSD-backed storage
- Mount point:
/trinity/home/<username> - Quota: 20GB per user
- Purpose: Personal configuration files, small scripts
- Backup: Regular backups maintained
- Performance: Fast random I/O, moderate capacity
Shared Group Storage (/lustreFS/data/)
- Type: Lustre parallel file system
- Mount point:
/lustreFS/data/<group-name> - Quota: 30TB per group (or 20,971,520 files)
- Purpose: Primary workspace for research data and results
- Performance: High-throughput parallel I/O
- Shared: All group members have access
Local Node Storage
- Type: NVMe SSD (1TB per compute node)
- Purpose: Temporary files during job execution
- Access: Only available during allocated jobs
- Performance: Highest IOPS for temporary data
File System Details
Home Directory Usage
Quota Limit
Home directories have a strict 20GB limit. Use them only for configuration files, not data or models.# Check your home directory quota
quota -us
# View home directory contents
ls -la /trinity/home/$USER
# Typical home directory structure
/trinity/home/username/
├── .bashrc # Shell configuration
├── .ssh/ # SSH keys and config
├── .jupyter/ # Jupyter configuration
├── .conda/ # Conda configuration
├── scripts/ # Small utility scripts
└── .local/ # Local Python packages
Best practices for home directories:
- Store only configuration files and small scripts
- Link to shared storage for data access
- Use symbolic links to avoid quota issues
Shared Group Storage
The /lustreFS/data/ directory provides high-performance storage for your research work:
# Access your group's shared storage
cd /lustreFS/data/<group-name>
# Check group quota
lfs quota -gh <group-name> /lustreFS/
# Example group directory structure
/lustreFS/data/mygroup/
├── datasets/ # Shared datasets
├── models/ # Pre-trained and trained models
├── experiments/ # Individual user experiments
│ ├── user1/
│ ├── user2/
│ └── shared/
├── code/ # Shared code repositories
├── results/ # Experiment results
└── tmp/ # Temporary files
Quota information:
- Space limit: 30TB per group
- File limit: 20,971,520 files per group
- Shared: All group members can read/write
Local Node Storage
Each compute node has local NVMe storage for temporary files:
# During a SLURM job, use local storage for temporary files
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR
# Example usage in job script
#SBATCH --job-name=training
#SBATCH --gres=gpu:1
# Create temporary directory
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR
# Copy data to local storage for faster I/O
cp /lustreFS/data/mygroup/dataset.tar.gz $TMPDIR/
cd $TMPDIR
tar -xzf dataset.tar.gz
# Run training with local data
python train.py --data-dir $TMPDIR/dataset
# Copy results back to shared storage
cp -r $TMPDIR/results /lustreFS/data/mygroup/experiments/
Quota Management
Checking Quotas
# Check your home directory quota
quota -us
# Check group quota on Lustre filesystem
lfs quota -gh <group-name> /lustreFS/
# Example quota output:
# Disk quotas for group mygroup (gid 1001):
# Filesystem used quota limit grace files quota limit grace
# /lustreFS 15.2T 30T 30T - 1234567 20971520 20971520 -
Understanding Quota Output
- used: Current usage
- quota: Soft limit (warning threshold)
- limit: Hard limit (cannot exceed)
- grace: Time allowed to exceed soft quota
- files: Number of files/inodes used
Managing Quota Issues
When approaching quota limits:
Clean up temporary files:
# Find large files
find /lustreFS/data/mygroup -type f -size +1G -ls
# Find old temporary files
find /lustreFS/data/mygroup -name "*.tmp" -mtime +7 -delete
Archive old data:
# Compress old experiments
tar -czf old_experiments.tar.gz experiments/2023/
rm -rf experiments/2023/
Use efficient storage:
# Use compressed formats for datasets
# Store checkpoints selectively
# Remove duplicate files
Data Management Best Practices
Directory Organization
Organize your group’s shared storage efficiently:
# Recommended structure
/lustreFS/data/mygroup/
├── datasets/
│ ├── imagenet/ # Large shared datasets
│ ├── coco/
│ └── custom/
├── models/
│ ├── pretrained/ # Downloaded pre-trained models
│ └── checkpoints/ # Training checkpoints
├── experiments/
│ ├── user1/
│ │ ├── project_a/
│ │ └── project_b/
│ └── user2/
├── code/
│ ├── shared_utils/ # Shared code libraries
│ └── experiments/ # Experiment code
└── results/
├── papers/ # Results for publications
└── ongoing/ # Current experiment results
File Permissions
Set appropriate permissions for shared access:
# Make directories group-writable
chmod g+w /lustreFS/data/mygroup/datasets/
# Set default permissions for new files
umask 002
# Change group ownership if needed
chgrp -R mygroup /lustreFS/data/mygroup/shared/
Data Transfer
Small Files (< 1GB)
# Copy from local machine using scp
scp large_dataset.tar.gz prometheus:/lustreFS/data/mygroup/datasets/
# Copy between directories on cluster
cp -r /lustreFS/data/mygroup/datasets/source /lustreFS/data/mygroup/experiments/
Large Files (> 1GB)
# Use rsync for large transfers with progress
rsync -avP large_dataset/ prometheus:/lustreFS/data/mygroup/datasets/
# Parallel compression for large datasets
tar -cf - dataset/ | pigz > dataset.tar.gz
Download Datasets
# Download directly to shared storage
cd /lustreFS/data/mygroup/datasets/
wget https://example.com/large_dataset.tar.gz
# Use aria2 for faster parallel downloads
aria2c -x 8 -s 8 https://example.com/dataset.tar.gz
Backup Strategies
While the cluster provides reliable storage, implement your own backup strategy:
- Important results: Copy to external storage
- Code: Use git repositories
- Large datasets: Document download sources for re-acquisition
- Models: Keep important checkpoints on external storage
Lustre File System Tips
Use parallel I/O for large files:
# PyTorch DataLoader with multiple workers
dataloader = DataLoader(dataset, batch_size=64, num_workers=8)
Avoid small random writes:
# Bad: Many small writes
for i in {1..1000}; do echo $i >> file.txt; done
# Good: Batch writes
seq 1 1000 > file.txt
Use appropriate stripe settings for large files:
# Set stripe count for large files (> 1GB)
lfs setstripe -c 4 /lustreFS/data/mygroup/large_dataset/
- Copy frequently accessed data to local storage during jobs
- Use local storage for temporary files and intermediate results
- Copy final results back to shared storage
# Example job with local storage optimization
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=4:00:00
# Set up local temporary directory
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR
# Copy dataset to local storage
echo "Copying dataset to local storage..."
cp /lustreFS/data/mygroup/dataset.tar.gz $TMPDIR/
cd $TMPDIR
tar -xzf dataset.tar.gz
# Run training with local data (much faster I/O)
python train.py --data-dir $TMPDIR/dataset --output-dir $TMPDIR/results
# Copy results back to shared storage
echo "Copying results back..."
cp -r $TMPDIR/results /lustreFS/data/mygroup/experiments/
Environment Variables
Set up useful environment variables for data management:
# Add to your ~/.bashrc
export GROUP_DATA="/lustreFS/data/mygroup"
export DATASETS="$GROUP_DATA/datasets"
export MODELS="$GROUP_DATA/models"
export EXPERIMENTS="$GROUP_DATA/experiments/$USER"
export RESULTS="$GROUP_DATA/results"
# Create your experiment directory
mkdir -p $EXPERIMENTS
Common Storage Issues
Quota Exceeded
# Error: "Disk quota exceeded"
# Solution: Check and clean up usage
lfs quota -gh mygroup /lustreFS/
find $GROUP_DATA -type f -size +1G -ls
Permission Denied
# Error: "Permission denied"
# Solution: Check file permissions and group membership
ls -la /lustreFS/data/mygroup/
groups # Check your group membership
# Solutions:
# 1. Use local storage for temporary files
# 2. Reduce number of small files
# 3. Use parallel I/O libraries
# 4. Check stripe settings for large files
lfs getstripe /lustreFS/data/mygroup/large_file
File System Full
# Check available space
df -h /lustreFS
# If file system is full, clean up:
# 1. Remove temporary files
# 2. Compress old data
# 3. Archive completed experiments
Next Steps
7 - Environment Modules
Using Lmod environment modules to manage software on Prometheus
Overview
The Prometheus cluster uses Lmod (Lua-based Environment Modules) to manage software packages and their dependencies. This system allows you to easily load and unload different software versions without conflicts.
Module System Basics
Environment modules modify your shell environment to provide access to specific software packages. When you load a module, it typically:
- Adds software to your
PATH - Sets environment variables
- Loads required dependencies
- Configures library paths
Basic Module Commands
List Available Modules
# Show all available modules
module available
module avail
module av
ml av
# Search for specific modules
module avail gcc
module avail python
module avail cuda
Load Modules
# Load a specific module
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5
# Short form using 'ml'
ml GCC/10.3.0 CUDA/11.3.1 Python/3.9.5
Check Loaded Modules
# List currently loaded modules
module list
ml list
Unload Modules
# Unload a specific module
module unload GCC/10.3.0
# Unload all modules
module purge
ml purge
Common Software Modules
Compilers
# GNU Compiler Collection
module load GCC/10.3.0
module load GCC/11.2.0
# Intel Compilers (if available)
module load intel/2021.4.0
CUDA and GPU Development
# CUDA Toolkit
module load CUDA/11.3.1
module load CUDA/11.7.0
module load CUDA/12.0.0
# Check CUDA after loading
nvcc --version
nvidia-smi
Python Environments
# Python interpreter
module load Python/3.9.5
module load Python/3.10.8
# Python with scientific libraries
module load Python/3.9.5-GCCcore-10.3.0
Deep Learning Frameworks
# PyTorch (if pre-installed as module)
module load PyTorch/1.12.1-foss-2022a-CUDA-11.7.0
# TensorFlow (if pre-installed as module)
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
# Git (usually available by default)
module load git/2.36.0
# CMake
module load CMake/3.24.3
# HDF5 for data storage
module load HDF5/1.12.2
Module Dependencies
Lmod automatically handles dependencies. When you load a module, it loads required dependencies:
# Loading Python might automatically load GCC
module load Python/3.9.5
# Check what was loaded
module list
# Might show:
# GCCcore/10.3.0
# Python/3.9.5-GCCcore-10.3.0
Setting Up Your Environment
Create a Module Loading Script
Create ~/load_modules.sh for consistent environment setup:
#!/bin/bash
# ~/load_modules.sh - Load standard development environment
# Clear any existing modules
module purge
# Load core development tools
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5
# Optional: Load additional tools
# module load git/2.36.0
# module load CMake/3.24.3
echo "Development environment loaded:"
module list
Make it executable and use it:
chmod +x ~/load_modules.sh
source ~/load_modules.sh
Add to Your Shell Configuration
Add common modules to your ~/.bashrc:
# Add to ~/.bashrc
# Load standard modules at login
if [ -f ~/load_modules.sh ]; then
source ~/load_modules.sh
fi
SLURM Job Scripts with Modules
Basic Job with Modules
#!/bin/bash
#SBATCH -J module_job
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --gres=gpu:1
#SBATCH --time=2:00:00
# Load required modules
module purge
module load CUDA/11.3.1
module load Python/3.9.5
# Verify modules are loaded
echo "Loaded modules:"
module list
# Check CUDA availability
echo "CUDA version:"
nvcc --version
# Activate your conda environment
source ~/anaconda3/bin/activate
conda activate myenv
# Run your script
python train.py
Multiple GPU Job with Modules
#!/bin/bash
#SBATCH -J multi_gpu_training
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --gres=gpu:4
#SBATCH --time=1-00:00:00
# Load modules for CUDA development
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5
# Load MPI for distributed computing (if available)
# module load OpenMPI/4.1.4-GCC-10.3.0
# Set CUDA environment
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Activate environment
conda activate pytorch-gpu
# Run distributed training
python -m torch.distributed.launch \
--nproc_per_node=4 \
train_distributed.py
Python Package Management
Using Conda with Modules
# Load Python module
module load Python/3.9.5
# Install conda (if not already available)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/miniconda3
# Add conda to path
echo 'export PATH="$HOME/miniconda3/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Create environment
conda create -n myenv python=3.9
conda activate myenv
# Install packages
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
Using pip with Modules
# Load Python module
module load Python/3.9.5
# Create virtual environment
python -m venv ~/venvs/myproject
source ~/venvs/myproject/bin/activate
# Upgrade pip
pip install --upgrade pip
# Install packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install jupyter numpy pandas matplotlib
Module Collections
Save frequently used module combinations:
# Save current modules as a collection
module save my_collection
# List saved collections
module savelist
# Restore a collection
module restore my_collection
Custom Module Paths
If your group has custom modules:
# Add custom module path
module use /lustreFS/data/mygroup/modules
# Check module paths
module show MODULEPATH
Troubleshooting Modules
Common Issues
Module not found:
# Check available modules
module avail | grep -i package_name
# Check if you have access to the module path
ls -la /opt/modules/
Conflicting modules:
# Clear all modules and start fresh
module purge
module load GCC/10.3.0 CUDA/11.3.1
CUDA not found after loading:
# Verify CUDA module is loaded
module list | grep -i cuda
# Check CUDA environment
echo $CUDA_HOME
echo $CUDA_PATH
which nvcc
Python packages not found:
# Ensure Python module is loaded before using pip/conda
module load Python/3.9.5
which python
python --version
# Show detailed module information
module show CUDA/11.3.1
module help CUDA/11.3.1
# See what a module does before loading
module display CUDA/11.3.1
Best Practices
For Interactive Development
- Create a standard environment script
- Use module collections for frequently used combinations
- Load modules before activating conda/venv
For SLURM Jobs
- Always start with
module purge - Load modules explicitly in job scripts
- Verify modules are loaded with
module list - Document module requirements in your scripts
For Reproducibility
Pin module versions in scripts:
module load CUDA/11.3.1 # Not just 'CUDA'
Document module requirements:
# Required modules:
# - GCC/10.3.0
# - CUDA/11.3.1
# - Python/3.9.5
Use environment files for complex setups
Example Workflows
Deep Learning Setup
# Standard deep learning environment
module purge
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5
# Activate conda environment
conda activate pytorch-env
# Verify setup
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
Development Environment
# Development tools
module purge
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5
module load git/2.36.0
module load CMake/3.24.3
# Save as collection
module save development
# Later, restore quickly
module restore development
Compilation Environment
# For compiling CUDA code
module purge
module load GCC/10.3.0
module load CUDA/11.3.1
# Compile CUDA program
nvcc -o program program.cu
# For C++ with GPU support
g++ -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcudart program.cpp -o program
Next Steps
8 - VS Code Remote Development
Set up Visual Studio Code for remote development on the Prometheus cluster
Overview
Visual Studio Code provides excellent remote development capabilities for the Prometheus cluster. You can edit code, run Jupyter notebooks, and debug applications directly on the cluster while using your local VS Code interface.
Prerequisites
- VS Code Desktop installed on your local machine
- Remote-SSH extension for VS Code
- SSH access to Prometheus cluster (see Getting Started)
- Valid cluster account with SSH keys configured
Install Required Extensions
Install these essential VS Code extensions:
- Remote - SSH (
ms-vscode-remote.remote-ssh) - Python (
ms-python.python) - Jupyter (
ms-toolsai.jupyter) - Git integration (built-in)
Optional but recommended:
- Remote - SSH: Editing Configuration Files (
ms-vscode-remote.remote-ssh-edit) - GitLens (
eamodio.gitlens) - Thunder Client for API testing (
rangav.vscode-thunder-client)
SSH Configuration
Basic SSH Setup
First, ensure your ~/.ssh/config file is properly configured:
# ~/.ssh/config
Host prometheus
Hostname prometheus.cyens.org.cy
User <your-username>
IdentityFile ~/.ssh/id_rsa
Host *.cluster
User <your-username>
IdentityFile ~/.ssh/prometheus_user_sshd
ProxyJump prometheus
Replace <your-username> with your actual cluster username.
User SSHD Process Setup
To connect VS Code directly to compute nodes, you need to set up a user SSHD process through SLURM.
Step 1: Generate SSH Keys for User SSHD
Connect to Prometheus and create SSH keys for the user SSHD process:
ssh prometheus
ssh-keygen -t rsa -f ~/.ssh/prometheus_user_sshd
This creates:
- Private key:
~/.ssh/prometheus_user_sshd - Public key:
~/.ssh/prometheus_user_sshd.pub
Step 2: Create SSHD Job Script
Create ~/sshd.sh script for launching the user SSHD process:
#!/bin/bash
#SBATCH -o res_%j.txt # output file
#SBATCH -e res_%j.err # error file
#SBATCH -J sshd # job name
#SBATCH --partition=defq # partition
#SBATCH --qos=normal # priority queue
#SBATCH --ntasks=1 # number of tasks
#SBATCH --cpus-per-task=2 # CPU cores
#SBATCH --gres=gpu:1 # number of GPUs
#SBATCH --mem=1000 # memory in MB
#SBATCH --time=0-04:00 # 4 hours maximum
# Find an available port
PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
echo "********************************************************************"
echo "Starting sshd in Slurm as user"
echo "Environment information:"
echo "Date:" $(date)
echo "Allocated node:" $(hostname)
echo "Path:" $(pwd)
echo "Listening on:" $PORT
echo "********************************************************************"
# Start user SSHD process
/usr/sbin/sshd -D -p ${PORT} -f /dev/null -h ${HOME}/.ssh/prometheus_user_sshd
Make the script executable:
Step 3: Submit SSHD Job
Submit the SSHD job to get a compute node:
Check the job status and get the allocated node and port:
# Check job status
squeue -u $USER
# View the output file to get connection details
cat res_<job_id>.txt
The output will show something like:
Starting sshd in Slurm as user
Date: Thu Jun 5 10:30:00 UTC 2025
Allocated node: gpu02
Listening on: 45672
Connecting VS Code
Method 1: Direct Connection
- Open VS Code on your local machine
- Press F1 or Ctrl/Cmd+Shift+P to open command palette
- Type: “Remote-SSH: Connect to Host”
- Enter:
ssh -p 45672 gpu02.cluster (use your actual port and node)
VS Code will automatically update your SSH config file.
Method 2: Manual SSH Config
Add the connection details to your ~/.ssh/config:
Host gpu02.cluster
HostName gpu02.cluster
Port 45672
User <your-username>
IdentityFile ~/.ssh/prometheus_user_sshd
ProxyJump prometheus
Then connect using “Remote-SSH: Connect to Host” → gpu02.cluster
Development Workflow
1. Connect to Compute Node
# Submit SSHD job
sbatch sshd.sh
# Wait for job to start (check with squeue)
squeue -u $USER
# Get connection details
cat res_<job_id>.txt
# Connect VS Code to the allocated node
2. Open Your Project
Once connected to the compute node:
- File → Open Folder
- Navigate to:
/lustreFS/data/mygroup/experiments/myproject - Open the folder
3. Set Up Python Environment
In the VS Code terminal on the remote machine:
# Load required modules
module load CUDA/11.3.1 Python/3.9.5
# Activate your conda environment
conda activate myenv
# Verify GPU access
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
- Press Ctrl/Cmd+Shift+P
- Type: “Python: Select Interpreter”
- Choose: Your conda environment interpreter
- Usually:
~/miniconda3/envs/myenv/bin/python
Working with Jupyter Notebooks
Start Jupyter Server
In the VS Code terminal on the remote machine:
# Load modules and activate environment
module load Python/3.9.5
conda activate myenv
# Start Jupyter (no browser needed)
jupyter notebook --no-browser --port=8888 --ip=0.0.0.0
Connect VS Code to Jupyter
- Open a
.ipynb file in VS Code - Click “Select Kernel” in the top-right
- Choose “Existing Jupyter Server”
- Enter:
http://localhost:8888 - Enter the token from the Jupyter output
Development Best Practices
Resource Management
Request appropriate resources for development:
#SBATCH --cpus-per-task=4 # Not 32 for development
#SBATCH --gres=gpu:1 # Usually sufficient for development
#SBATCH --mem=8000 # 8GB for most development tasks
#SBATCH --time=0-04:00 # 4 hours for development session
Use longer sessions for intensive work:
#SBATCH --time=0-08:00 # 8 hours for longer development
#SBATCH --qos=long # If you need more than 1 day
File Organization
Set up a consistent workspace structure:
# Recommended project structure
/lustreFS/data/mygroup/experiments/myproject/
├── data/ # Datasets and data files
├── notebooks/ # Jupyter notebooks
├── src/ # Source code
├── configs/ # Configuration files
├── scripts/ # Training and utility scripts
├── results/ # Experiment results
└── README.md # Project documentation
Environment Configuration
Create a workspace settings file (.vscode/settings.json):
{
"python.defaultInterpreterPath": "~/miniconda3/envs/myenv/bin/python",
"python.terminal.activateEnvironment": true,
"jupyter.jupyterServerType": "local",
"files.watcherExclude": {
"**/data/**": true,
"**/results/**": true,
"**/.git/**": true
}
}
Git Integration
Configure Git for your project:
# Set up Git credentials
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
# Initialize repository (if new project)
cd /lustreFS/data/mygroup/experiments/myproject
git init
git add .
git commit -m "Initial commit"
Troubleshooting
Connection Issues
“Could not establish connection”:
- Check if SSHD job is running:
squeue -u $USER - Verify node name and port from job output
- Ensure SSH keys are properly configured
“Permission denied”:
- Check SSH key permissions:
chmod 600 ~/.ssh/prometheus_user_sshd - Verify ProxyJump configuration in SSH config
- Test SSH connection manually:
ssh -p <port> <node>.cluster
Slow file operations:
- Exclude large directories from VS Code watcher
- Use local storage for frequently accessed files
- Consider using VS Code on the head node for file browsing only
High memory usage:
- Close unused notebooks and files
- Restart VS Code Python extension if needed
- Request more memory in SSHD job if necessary
SSHD Job Management
Job terminated unexpectedly:
- Check job logs:
cat res_<job_id>.err - Resubmit SSHD job:
sbatch sshd.sh - Update VS Code connection with new port/node
Need longer development time:
# Modify sshd.sh for longer sessions
#SBATCH --time=0-08:00 # 8 hours
#SBATCH --qos=long # For >24 hours
Cleanup and Best Practices
End Development Session
When finishing your work:
- Save all files in VS Code
- Close VS Code connection
- Cancel the SSHD job:
- Remove SSH config entries added by VS Code (optional)
Security Considerations
- Don’t leave SSHD jobs running when not needed
- Use strong passphrases for SSH keys
- Regularly rotate SSH keys if required by policy
- Monitor your running jobs:
squeue -u $USER
Advanced Configuration
Multiple Concurrent Sessions
You can run multiple SSHD jobs for different projects:
# Submit multiple jobs
sbatch sshd.sh
sbatch sshd.sh
# Connect VS Code to different nodes
# Node 1: gpu01.cluster:45672
# Node 2: gpu03.cluster:45673
Custom SSHD Configuration
Create specialized SSHD scripts for different use cases:
# sshd_gpu4.sh - For multi-GPU development
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=16
#SBATCH --mem=64000
# sshd_a6000.sh - For high-memory development
#SBATCH --partition=a6000
#SBATCH --qos=normal-a6000
#SBATCH --gres=gpu:1
Next Steps
9 - Software Installation
Installing and managing software packages on the Prometheus cluster
Overview
The Prometheus cluster provides several methods for installing and managing software packages. This guide covers both system-wide modules and user-specific installations.
Installation Methods
1. Environment Modules (Recommended)
Use pre-installed software via the module system when available:
module avail python
module load Python/3.9.5
2. Conda/Mamba Package Manager
Install packages in isolated environments:
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
3. pip Package Manager
Install Python packages via pip:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
4. Source Installation
Compile software from source when needed:
git clone https://github.com/project/repo.git
cd repo && python setup.py install
Setting Up Python Environments
Conda Installation
If conda is not available, install Miniconda:
# Download Miniconda
cd /lustreFS/data/mygroup/$USER
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Install Miniconda
bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/miniconda3
# Initialize conda
~/miniconda3/bin/conda init bash
source ~/.bashrc
Create Virtual Environments
# Create a new environment
conda create -n pytorch-env python=3.9
# Activate environment
conda activate pytorch-env
# Install packages
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install jupyter matplotlib pandas scikit-learn
Environment Management
# List environments
conda env list
# Export environment
conda env export > environment.yml
# Create from file
conda env create -f environment.yml
# Remove environment
conda env remove -n old-env
Deep Learning Frameworks
PyTorch Installation
# Create PyTorch environment
conda create -n pytorch python=3.9
conda activate pytorch
# Install PyTorch with CUDA support
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Verify installation
python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')"
TensorFlow Installation
# Create TensorFlow environment
conda create -n tensorflow python=3.9
conda activate tensorflow
# Install TensorFlow
pip install tensorflow[and-cuda]
# Verify GPU support
python -c "import tensorflow as tf; print(f'TensorFlow {tf.__version__}, GPUs: {len(tf.config.list_physical_devices("GPU"))}')"
JAX Installation
# Create JAX environment
conda create -n jax python=3.9
conda activate jax
# Install JAX with CUDA support
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
# Verify installation
python -c "import jax; print(f'JAX devices: {jax.devices()}')"
Specialized Libraries
MinkowskiEngine
MinkowskiEngine is an auto-differentiation library for sparse tensors, particularly useful for 3D computer vision tasks.
Installation Steps
Create dedicated environment:
conda create -n py3-mink python=3.8
conda activate py3-mink
Install dependencies:
conda install openblas-devel -c anaconda
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
Load required modules:
module load CUDA/11.3.1 gnu9
Submit interactive job for compilation:
srun -n 1 -c 4 --gres=gpu:1 --mem=20000 --pty /bin/bash
Install MinkowskiEngine:
conda activate py3-mink
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \
--install-option="--blas_include_dirs=${CONDA_PREFIX}/include" \
--install-option="--blas=openblas"
Usage Example
import torch
import MinkowskiEngine as ME
# Create sparse tensor
coords = torch.IntTensor([[0, 1], [0, 1], [0, 2], [1, 0], [1, 2]])
feats = torch.FloatTensor([[1], [2], [3], [4], [5]])
# Create sparse tensor
sparse_tensor = ME.SparseTensor(feats, coords)
print(f"Sparse tensor shape: {sparse_tensor.shape}")
PointGPT
PointGPT extends GPT concepts to point clouds for 3D understanding tasks.
Installation Steps
Create environment:
conda create -n pointgpt python=3.8
conda activate pointgpt
Install PyTorch and dependencies:
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 tensorboard -c pytorch -c conda-forge
pip install easydict h5py matplotlib open3d opencv-python pyyaml timm tqdm transforms3d termcolor scipy ninja plyfile numpy==1.23.4
pip install setuptools==59.5.0
Load CUDA module:
Clone PointGPT repository:
cd /lustreFS/data/mygroup/$USER
git clone https://github.com/CGuangyan-BIT/PointGPT.git
cd PointGPT
Submit interactive job for compilation:
srun -n 1 -c 4 --gres=gpu:1 --mem=20000 --pty /bin/bash
Install extensions:
conda activate pointgpt
# Chamfer Distance & EMD
cd ./extensions/chamfer_dist
python setup.py install --user
cd ../emd
python setup.py install --user
cd ../
# PointNet++
pip install "git+https://github.com/erikwijmans/Pointnet2_PyTorch.git#egg=pointnet2_ops&subdirectory=pointnet2_ops_lib"
# GPU kNN
pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl
Computer Vision Libraries
OpenCV Installation
conda activate myenv
conda install opencv -c conda-forge
# Or install from pip
pip install opencv-python opencv-contrib-python
Open3D for 3D Processing
conda activate myenv
pip install open3d
# Test installation
python -c "import open3d as o3d; print(f'Open3D {o3d.__version__}')"
PIL/Pillow for Image Processing
conda install pillow
# or
pip install Pillow
Scientific Computing
NumPy, SciPy, Pandas
conda install numpy scipy pandas matplotlib seaborn
# or
pip install numpy scipy pandas matplotlib seaborn
Jupyter and IPython
conda install jupyter ipython ipykernel
# or
pip install jupyter ipython ipykernel
# Add environment to Jupyter
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"
Scikit-learn
conda install scikit-learn
# or
pip install scikit-learn
Git and Version Control
# Git is usually available by default
git --version
# Configure Git
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"
# Install build essentials
conda install cmake make ninja
# For C++ development
conda install gxx_linux-64 gcc_linux-64
# Install debugging tools
pip install pdb++ ipdb
# Memory profiling
pip install memory_profiler
# Line profiling
pip install line_profiler
Installation in SLURM Jobs
Interactive Installation
# Submit interactive job for installation
srun --partition=defq --qos=normal --gres=gpu:1 --mem=16000 --time=2:00:00 --pty /bin/bash
# Load modules
module load CUDA/11.3.1 Python/3.9.5
# Activate environment
conda activate myenv
# Install packages
pip install package-name
Batch Installation Script
#!/bin/bash
#SBATCH -J install_packages
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --cpus-per-task=4
#SBATCH --mem=8000
#SBATCH --time=1:00:00
# Load modules
module load Python/3.9.5
# Activate environment
conda activate myenv
# Install packages
pip install -r requirements.txt
echo "Installation completed"
Package Management Best Practices
Requirements Files
Create requirements.txt for reproducibility:
torch==1.12.1+cu117
torchvision==0.13.1+cu117
torchaudio==0.12.1+cu117
numpy==1.23.4
pandas==1.5.2
matplotlib==3.6.2
jupyter==1.0.0
Install from requirements:
pip install -r requirements.txt
Environment Files
Create environment.yml for conda:
name: myproject
channels:
- pytorch
- nvidia
- conda-forge
dependencies:
- python=3.9
- pytorch=1.12.1
- torchvision=0.13.1
- torchaudio=0.12.1
- pytorch-cuda=11.7
- numpy
- pandas
- matplotlib
- jupyter
- pip
- pip:
- some-pip-package
Create environment:
conda env create -f environment.yml
Storage Considerations
Install packages in shared group storage to avoid quota issues:
# Set conda environments path
echo "envs_dirs:
- /lustreFS/data/mygroup/conda/envs" > ~/.condarc
# Set pip cache directory
export PIP_CACHE_DIR=/lustreFS/data/mygroup/pip-cache
echo 'export PIP_CACHE_DIR=/lustreFS/data/mygroup/pip-cache' >> ~/.bashrc
Troubleshooting
Common Installation Issues
CUDA compatibility errors:
# Check CUDA version
nvidia-smi
# Install matching PyTorch version
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
Memory errors during installation:
# Request more memory for installation
srun --mem=32000 --pty /bin/bash
# Or increase pip timeout
pip install --timeout 1000 package-name
Permission errors:
# Install in user space
pip install --user package-name
# Or check conda environment ownership
ls -la ~/miniconda3/envs/
Network timeouts:
# Use conda-forge channel
conda install -c conda-forge package-name
# Or use pip with retries
pip install --retries 10 package-name
Compilation Issues
Missing compilers:
# Load compiler modules
module load GCC/10.3.0
# Check compiler availability
gcc --version
nvcc --version
Missing headers:
# Install development packages
conda install gxx_linux-64 gcc_linux-64
# For CUDA development
module load CUDA/11.3.1
echo $CUDA_HOME
Environment Conflicts
Package conflicts:
# Create fresh environment
conda create -n clean-env python=3.9
conda activate clean-env
# Install packages one by one
conda install pytorch -c pytorch
Module vs conda conflicts:
# Always load modules before activating conda
module load Python/3.9.5
conda activate myenv
Package Documentation
Keep track of installed packages:
# List conda packages
conda list > conda_packages.txt
# List pip packages
pip freeze > pip_requirements.txt
# Environment information
conda info --envs > environments.txt
Next Steps