This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Prometheus Cluster Documentation

Complete guide to using the Prometheus deep learning cluster at CYENS

This section contains comprehensive documentation for the Prometheus cluster - a high-performance computing environment for deep learning research at CYENS.

Overview

The Prometheus cluster is a state-of-the-art deep learning computing facility featuring:

  • 64 NVIDIA A5000 GPUs (24GB each) across 8 compute nodes
  • 4 NVIDIA A6000 Ada GPUs (48GB each) on a dedicated node
  • 4.6TB total GPU memory for large-scale model training
  • High-performance Lustre storage with 305TB capacity
  • SLURM job scheduler for efficient resource management

This documentation will guide you through:

Quick Start

  1. Generate SSH keys and request cluster access
  2. Connect via SSH to prometheus.cyens.org.cy
  3. Submit your first job using SLURM
  4. Set up development environment with modules or containers

Cluster Specifications

Compute Resources

  • 9 compute nodes total
  • GPU nodes gpu[01-08]: 8×A5000 GPUs each (64 total GPUs)
  • GPU node gpu09: 4×A6000 Ada GPUs (48GB VRAM each)
  • 512GB RAM per compute node
  • 32 CPU cores per node (AMD EPYC 7313)

Storage

  • Home directories: 20GB SSD per user
  • Shared storage: 30TB Lustre filesystem per group
  • Local storage: 1TB NVMe SSD per compute node

Networking Infrastructure

  • Management Network: Netgear M4300-52G switch with 48×1G ports plus 2×10GBASE-T and 2×SFP+
  • High-Performance Interconnect: Mellanox HDR InfiniBand switch with 40×QSFP56 ports
  • InfiniBand Speed: 200Gb/s HDR connectivity with hybrid copper cables
  • Low Latency: Sub-microsecond messaging for distributed computing workloads

Software Environment

  • Rocky Linux 8.5 operating system
  • SLURM workload manager
  • Lmod environment modules
  • CUDA 11.3+ with deep learning frameworks

Support & Resources

  • System administrators: Contact your MRG leader
  • Documentation: This site and /opt/cluster/docs/
  • Cluster status: Monitor with sinfo and squeue

For detailed instructions, start with the Getting Started guide.

1 - Getting Started

SSH access and initial setup for the Prometheus cluster

Prerequisites

Before accessing the Prometheus cluster, you need:

  • A valid cluster account (contact your MRG leader)
  • SSH client installed on your local machine
  • Basic familiarity with Linux command line

Generate SSH Keys

The Prometheus cluster uses RSA key authentication for secure access. You need to generate a public/private key pair:

Step 1: Create SSH Key Pair

Open your terminal and run:

ssh-keygen

Follow the on-screen instructions. This creates:

  • Private key: ~/.ssh/id_rsa (keep this secure!)
  • Public key: ~/.ssh/id_rsa.pub (share this with administrators)

Step 2: Secure Your Private Key

For Linux/Mac users, set proper permissions:

chmod 600 ~/.ssh/id_rsa

Step 3: Request Cluster Access

  1. Send your public key to your MRG leader
  2. Request a Prometheus account
  3. Wait for account confirmation

For additional security, add a passphrase to your key:

ssh-keygen -p -f ~/.ssh/id_rsa

Connect to Prometheus

Configure SSH Client

Create or edit ~/.ssh/config file with the following content:

Host prometheus
  Hostname prometheus.cyens.org.cy
  User <your-username>
  IdentityFile ~/.ssh/id_rsa

Replace <your-username> with your actual cluster username.

Connect via SSH

Once your account is activated, connect using:

ssh prometheus

You should now be logged into the Prometheus head node!

First Login Setup

Check Your Environment

# Check current directory
pwd

# List available partitions
sinfo

# Check your groups
groups

# View your home directory quota
quota -us

Understand the File System

# Your home directory (20GB limit)
ls -la /trinity/home/$USER

# Shared group storage (30TB per group)
ls -la /lustreFS/data/

# Check group quota
lfs quota -gh <group-name> /lustreFS/

Cluster Architecture Overview

The Prometheus cluster consists of:

Head Node

  • Login and job submission point
  • DO NOT run compute jobs here
  • Used for file management and job scheduling

Compute Nodes

  • gpu[01-08]: 8 nodes with A5000 GPUs (8 GPUs each)
  • gpu09: 1 node with A6000 Ada GPUs (4 GPUs)
  • 512GB RAM and 32 CPU cores per node

Storage Systems

  • /trinity/home/: Personal home directories (SSD, 20GB limit)
  • /lustreFS/data/: Shared group storage (305TB Lustre filesystem)
  • Local storage: 1TB NVMe on each compute node

Important Usage Rules

Checking Cluster Status

# View partition information
sinfo

# Check job queue
squeue

# Check your running jobs
squeue -u $USER

# View detailed node information
scontrol show nodes

Example sinfo output:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      6  idle~ gpu[03-08]
defq*        up   infinite      1    mix gpu02
defq*        up   infinite      1   idle gpu01
a6000        up   infinite      1    mix gpu09

Next Steps

Now that you’re connected to Prometheus:

  1. Set up your development environment - See Environment Modules
  2. Learn about partitions and queues - Read Partitions & Queues
  3. Submit your first job - Follow Job Submission
  4. Configure VS Code (optional) - See VS Code Setup

Getting Help

  • Cluster status: Use sinfo and squeue commands
  • Documentation: Check /opt/cluster/docs/ on the cluster
  • Support: Contact your MRG leader
  • System issues: Report to cluster administrators

Common First-Time Issues

“Permission denied (publickey)”

  • Verify your public key was added to the cluster
  • Check your SSH config file syntax
  • Ensure private key permissions are correct (chmod 600)

“Connection refused”

  • Verify the hostname: prometheus.cyens.org.cy
  • Check if you’re connected to the internet
  • Confirm your account is activated

“Quota exceeded”

  • Home directory has a 20GB limit
  • Use group storage in /lustreFS/data/ for large files
  • Check usage with quota -us

2 - Hardware Specifications

Detailed hardware specifications of the Prometheus cluster

Cluster Overview

The Prometheus cluster features a modern architecture optimized for deep learning workloads with high-performance GPUs, abundant memory, and fast storage systems.

Head Node

Management and login node for the cluster

Hardware Configuration

  • Chassis: GIGABYTE R182-Z90-00
  • Motherboard: GIGABYTE MZ92-FS0-00
  • CPU: 2× AMD EPYC 7313 (16 cores/32 threads each)
  • Total CPU Cores: 32 cores / 64 threads
  • RAM: 16× 32GB Samsung M393A4K40EB3-CWE
  • Total RAM: 512GB DDR4
  • Storage: 2× 1.92TB Intel SSDSC2KB019T8 SSD
  • File System: /trinity/home (400GB allocated)

Purpose

  • SSH login and job submission
  • File management and transfers
  • SLURM job scheduling
  • Not for compute workloads

Compute Nodes

GPU Nodes gpu[01-08] (8 nodes)

Primary compute nodes with NVIDIA A5000 GPUs

Hardware Configuration

  • Chassis: Supermicro AS-4124GS-TNR
  • Motherboard: Supermicro H12DSG-O-CPU
  • CPU: 2× AMD EPYC 7313 (16 cores/32 threads each)
  • Total CPU Cores: 32 cores / 64 threads per node
  • RAM: 16× 32GB SK Hynix HMAA4GR7AJR8N-XN
  • Total RAM: 512GB DDR4 per node
  • Local Storage: 1× 1TB Samsung SSD 980 NVMe

GPU Specifications

  • GPU Model: NVIDIA A5000
  • GPU Count: 8 GPUs per node (64 total across all nodes)
  • GPU Memory: 24GB GDDR6 per GPU
  • CUDA Cores: 8,192 per GPU
  • Tensor Cores: 256 RT Cores (2nd gen)
  • Peak Performance: 27.8 TFLOPS FP32 per GPU
  • Memory Bandwidth: 768 GB/s per GPU

Total Resources (gpu[01-08])

  • Total GPUs: 64× NVIDIA A5000
  • Total GPU Memory: 1,536GB (1.5TB)
  • Total CPU Cores: 256 cores / 512 threads
  • Total System RAM: 4TB DDR4

GPU Node gpu09 (1 node)

High-memory GPU node with NVIDIA A6000 Ada

Hardware Configuration

  • Chassis: ASUS RS720A-E11-RS12
  • Motherboard: ASUS KMPP-D32
  • CPU: 2× AMD EPYC 7313 (16 cores/32 threads each)
  • Total CPU Cores: 32 cores / 64 threads
  • RAM: 16× 32GB SK Hynix HMAA4GR7AJR8N-XN
  • Total RAM: 512GB DDR4
  • Local Storage: 1× 1TB Samsung SSD 980 NVMe

GPU Specifications

  • GPU Model: NVIDIA RTX A6000 Ada Generation
  • GPU Count: 4 GPUs
  • GPU Memory: 48GB GDDR6 per GPU
  • CUDA Cores: 18,176 per GPU
  • Tensor Cores: 568 (4th gen)
  • Peak Performance: 91.06 TFLOPS FP32 per GPU
  • Memory Bandwidth: 960 GB/s per GPU

Total Resources (gpu09)

  • Total GPUs: 4× NVIDIA A6000 Ada
  • Total GPU Memory: 192GB
  • Total CPU Cores: 32 cores / 64 threads
  • Total System RAM: 512GB DDR4

Storage Nodes

Storage Architecture

High-performance parallel file system

Hardware Configuration (2 storage nodes)

  • Chassis: Supermicro Super Server
  • Motherboard: Supermicro H12SSL-i
  • CPU: 1× AMD EPYC 7302P (16 cores/32 threads each)
  • RAM: 8× 16GB Samsung M393A2K40DB3-CWE
  • Total RAM: 256GB DDR4 per node
  • OS Storage: 2× 240GB Intel SSDSC2KB240G7 SSD
  • Data Storage: 24× 7.68TB Samsung MZILT7T6HALA/007 NVMe SSD

Storage Specifications

  • File System: Lustre parallel file system
  • Mount Point: /lustreFS
  • Raw Capacity: 368TB per storage node
  • Total Raw Capacity: 736TB across both nodes
  • Usable Capacity: ~305TB (after RAID and file system overhead)
  • Performance: High-throughput parallel I/O

Software Environment

Operating System

  • Distribution: Rocky Linux 8.5 (Green Obsidian)
  • Kernel Version: 4.18.0-348.23.1.el8_5.x86_64
  • Architecture: x86_64

Management Software

  • Job Scheduler: SLURM Workload Manager
  • Module System: Lmod (Lua-based Environment Modules)
  • File System: Lustre for parallel storage

Development Tools

  • CUDA Toolkit: 11.3+ with cuDNN
  • Compilers: GCC, Intel, NVCC
  • MPI: OpenMPI, MPICH
  • Python: Multiple versions with conda/pip
  • Deep Learning: PyTorch, TensorFlow, JAX
  • Containers: Singularity/Apptainer support

Network Architecture

Interconnect

  • Compute Network: High-speed Ethernet
  • Storage Network: Dedicated Lustre network
  • Management Network: Separate administrative network

Bandwidth

  • Node-to-Node: High-bandwidth for distributed training
  • Storage Access: Optimized for parallel I/O workloads
  • External Access: Internet connectivity for downloads

Performance Characteristics

Compute Performance

  • Total GPU Performance:
    • A5000 nodes: 1,779 TFLOPS FP32 (64 × 27.8)
    • A6000 node: 364 TFLOPS FP32 (4 × 91.06)
    • Combined: ~2,143 TFLOPS FP32
  • Memory Bandwidth:
    • A5000 total: 49,152 GB/s
    • A6000 total: 3,840 GB/s
    • Combined: ~53TB/s GPU memory bandwidth

Storage Performance

  • Lustre File System: High-throughput parallel I/O
  • Local NVMe: High IOPS for temporary data
  • Home Directories: SSD-backed for fast access

Resource Allocation

Per-Node Resources

  • CPU Cores: 32 physical / 64 logical per node
  • System Memory: 512GB DDR4 per node
  • GPU Memory:
    • A5000 nodes: 192GB per node (8 × 24GB)
    • A6000 node: 192GB (4 × 48GB)
  • Local Storage: 1TB NVMe SSD per compute node

Total Cluster Resources

  • Compute Nodes: 9 total
  • CPU Cores: 288 physical / 576 logical
  • System Memory: 4.5TB DDR4
  • GPUs: 68 total (64 A5000 + 4 A6000 Ada)
  • GPU Memory: 1.728TB total
  • Shared Storage: 305TB Lustre + local NVMe

Use Cases and Workloads

Optimized For

  • Large Language Models: High GPU memory for transformer models
  • Computer Vision: Parallel training on multiple GPUs
  • Distributed Training: Multi-node deep learning
  • High-throughput Computing: Batch processing workflows
  • Interactive Development: Jupyter notebooks and VS Code

Performance Considerations

  • Memory-bound workloads: Benefit from A6000’s 48GB VRAM
  • Compute-intensive tasks: Leverage A5000’s efficiency
  • Data-intensive jobs: Utilize high-performance Lustre storage
  • Multi-GPU training: Scale across nodes with SLURM

3 - Environment Setup

Configure your development environment on the Prometheus cluster

Development Environment Options

The Prometheus cluster supports multiple development environments:

  1. Container-based (Recommended)
  2. Module-based (Traditional HPC)
  3. Custom Python environments

Container-Based Setup

Using Pre-built Containers

The cluster provides optimized containers for common deep learning frameworks:

# List available containers
ls /shared/containers/

# Use PyTorch container
singularity shell --nv /shared/containers/pytorch-gpu.sif

# Use TensorFlow container
singularity shell --nv /shared/containers/tensorflow-gpu.sif

Building Custom Containers

Create a definition file (pytorch-custom.def):

Bootstrap: docker
From: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel

%post
    apt-get update && apt-get install -y \
        git \
        vim \
        htop \
        tmux
    
    pip install \
        transformers \
        datasets \
        wandb \
        jupyter \
        matplotlib \
        seaborn

%environment
    export CUDA_VISIBLE_DEVICES=0,1,2,3
    export PYTHONPATH=/opt/code:$PYTHONPATH

%runscript
    exec "$@"

Build the container:

singularity build pytorch-custom.sif pytorch-custom.def

Python Environment Setup

Using Conda

# Load conda module
module load conda

# Create environment
conda create -n myenv python=3.9

# Activate environment
conda activate myenv

# Install packages
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c conda-forge jupyter matplotlib pandas

Using pip with virtual environments

# Load Python module
module load python/3.9

# Create virtual environment
python -m venv ~/venvs/deeplearning
source ~/venvs/deeplearning/bin/activate

# Install packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install jupyter notebook jupyterlab
pip install transformers datasets wandb

GPU Environment Configuration

Checking GPU Availability

# Check available GPUs
nvidia-smi

# Check CUDA version
nvcc --version

# Test PyTorch GPU access
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

Setting GPU Visibility

# Use specific GPUs
export CUDA_VISIBLE_DEVICES=0,1

# Use all available GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Jupyter Notebook Setup

Local Jupyter on Compute Node

  1. Request an interactive session:

    srun --partition=gpu --gres=gpu:1 --time=4:00:00 --pty bash
    
  2. Start Jupyter:

    module load python/3.9
    source ~/venvs/deeplearning/bin/activate
    jupyter notebook --no-browser --port=8888 --ip=0.0.0.0
    
  3. Set up SSH tunnel (from your local machine):

    ssh -L 8888:compute-node:8888 username@prometheus-cluster.example.com
    

JupyterHub Access

If available, access JupyterHub directly:

https://jupyter.prometheus-cluster.example.com

Development Tools

VS Code Remote Development

  1. Install VS Code with Remote-SSH extension
  2. Configure SSH connection in VS Code
  3. Connect to cluster and open your project folder

tmux for Session Management

# Start new session
tmux new-session -s training

# Detach session (Ctrl+b, then d)
# Reattach session
tmux attach-session -t training

# List sessions
tmux list-sessions

Storage and Data Access

Home Directory Setup

# Create project structure
mkdir -p ~/projects/{experiments,datasets,models,scripts}
mkdir -p ~/logs

Using Shared Storage

# Link shared datasets
ln -s /shared/datasets ~/datasets

# Copy models to your space
cp -r /shared/models/pretrained ~/models/

# Use scratch space for temporary files
export TMPDIR=/scratch/$USER
mkdir -p $TMPDIR

Environment Variables

Create ~/.cluster_env:

# CUDA settings
export CUDA_VISIBLE_DEVICES=0,1,2,3
export CUDA_CACHE_PATH=/scratch/$USER/cuda_cache

# Python settings
export PYTHONPATH=$HOME/projects:$PYTHONPATH
export JUPYTER_CONFIG_DIR=$HOME/.jupyter

# Weights & Biases
export WANDB_DIR=$HOME/logs/wandb
export WANDB_CACHE_DIR=/scratch/$USER/wandb_cache

# Hugging Face
export HF_DATASETS_CACHE=/scratch/$USER/hf_cache
export TRANSFORMERS_CACHE=/scratch/$USER/transformers_cache

Source it in your .bashrc:

echo 'source ~/.cluster_env' >> ~/.bashrc

Troubleshooting

Common Issues

CUDA out of memory:

# Clear GPU memory
nvidia-smi --gpu-reset

# Monitor GPU usage
watch -n 1 nvidia-smi

Module not found:

# Check loaded modules
module list

# Reload environment
source ~/.bashrc

Permission denied:

# Check file permissions
ls -la

# Fix permissions
chmod 755 script.py

Next Steps

4 - Job Submission

Submit and manage jobs using SLURM on the Prometheus cluster

SLURM Job Scheduler

The Prometheus cluster uses SLURM (Simple Linux Utility for Resource Management) for job scheduling and resource allocation. SLURM ensures fair resource sharing and efficient cluster utilization.

Interactive Jobs

Interactive jobs are perfect for development, testing, and debugging. Use srun to request resources immediately.

Basic Interactive Session

# Request 1 GPU, 2 CPUs, 1GB RAM for 1 hour
srun -c 1 -n 1 -p defq --qos=normal --mem=100 --gres=gpu:1 -t 01:00 --pty /bin/bash

Interactive Session Parameters

# Request specific resources
srun --partition=defq \
     --qos=normal \
     --cpus-per-task=4 \
     --gres=gpu:2 \
     --mem=20000 \
     --time=04:00:00 \
     --pty /bin/bash

A6000 Interactive Session

# Request A6000 GPU with high memory
srun --partition=a6000 \
     --qos=normal-a6000 \
     --cpus-per-task=8 \
     --gres=gpu:1 \
     --mem=50000 \
     --time=02:00:00 \
     --pty /bin/bash

Batch Jobs

Batch jobs run unattended and are ideal for training models, parameter sweeps, and long-running experiments.

Basic Batch Script

Create a file train_model.slurm:

#!/bin/bash
#SBATCH -o res_%j.txt      # output file
#SBATCH -e res_%j.err      # error file
#SBATCH -J my_training     # job name
#SBATCH --partition=defq   # partition
#SBATCH --qos=normal       # priority queue
#SBATCH --ntasks=1         # number of tasks
#SBATCH --cpus-per-task=4  # CPU cores per task
#SBATCH --gres=gpu:2       # number of GPUs
#SBATCH --mem=32000        # memory in MB
#SBATCH --time=1-12:00     # time limit (1 day, 12 hours)

# Load required modules
module load CUDA/11.3.1
module load Python/3.9.5

# Activate your environment
source ~/anaconda3/bin/activate
conda activate myenv

# Set CUDA devices
export CUDA_VISIBLE_DEVICES=0,1

# Run your training script
cd /lustreFS/data/mygroup/myproject
python train_model.py --epochs 100 --batch-size 64

Submit the Batch Job

sbatch train_model.slurm

SLURM Parameters Reference

Common SBATCH Directives

ParameterDescriptionExample
-J, --job-nameJob name#SBATCH -J training_job
-o, --outputOutput file#SBATCH -o output_%j.txt
-e, --errorError file#SBATCH -e error_%j.err
-p, --partitionPartition#SBATCH --partition=defq
--qosPriority queue#SBATCH --qos=normal
-n, --ntasksNumber of tasks#SBATCH --ntasks=1
-c, --cpus-per-taskCPUs per task#SBATCH --cpus-per-task=8
--gresGeneric resources#SBATCH --gres=gpu:4
--memMemory (MB)#SBATCH --mem=64000
-t, --timeTime limit#SBATCH --time=2-00:00

Time Format Examples

# 30 minutes
#SBATCH --time=00:30:00

# 4 hours
#SBATCH --time=04:00:00

# 1 day, 12 hours
#SBATCH --time=1-12:00:00

# 7 days (maximum for long queue)
#SBATCH --time=7-00:00:00

GPU Allocation Examples

# Request any available GPU
#SBATCH --gres=gpu:1

# Request multiple GPUs
#SBATCH --gres=gpu:4

# All GPUs on A5000 node (8 GPUs)
#SBATCH --gres=gpu:8

# A6000 GPU specifically
#SBATCH --partition=a6000 --gres=gpu:1

Advanced Job Examples

Multi-GPU Training Script

#!/bin/bash
#SBATCH -J multi_gpu_training
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:4
#SBATCH --mem=128000
#SBATCH --time=2-00:00:00
#SBATCH -o logs/multi_gpu_%j.out
#SBATCH -e logs/multi_gpu_%j.err

# Ensure logs directory exists
mkdir -p logs

# Load modules
module load CUDA/11.3.1 Python/3.9.5

# Activate environment
conda activate pytorch-env

# Set environment variables
export CUDA_VISIBLE_DEVICES=0,1,2,3
export OMP_NUM_THREADS=4

# Run distributed training
python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --master_port=12355 \
    train_distributed.py \
    --config config.yaml \
    --output-dir /lustreFS/data/mygroup/results

Parameter Sweep with Job Arrays

#!/bin/bash
#SBATCH -J param_sweep
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --array=1-20%5      # 20 jobs, max 5 running
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
#SBATCH --mem=16000
#SBATCH --time=08:00:00
#SBATCH -o logs/sweep_%A_%a.out
#SBATCH -e logs/sweep_%A_%a.err

# Parameter arrays
learning_rates=(0.001 0.01 0.1 0.2)
batch_sizes=(16 32 64 128 256)

# Calculate parameters for this task
lr_idx=$(( ($SLURM_ARRAY_TASK_ID - 1) / ${#batch_sizes[@]} ))
bs_idx=$(( ($SLURM_ARRAY_TASK_ID - 1) % ${#batch_sizes[@]} ))

lr=${learning_rates[$lr_idx]}
bs=${batch_sizes[$bs_idx]}

# Load environment
module load Python/3.9.5
conda activate myenv

# Run experiment
python train.py \
    --learning-rate $lr \
    --batch-size $bs \
    --experiment-name "sweep_${SLURM_ARRAY_TASK_ID}" \
    --output-dir /lustreFS/data/mygroup/sweep_results

A6000 High-Memory Job

#!/bin/bash
#SBATCH -J large_model_training
#SBATCH --partition=a6000
#SBATCH --qos=long-a6000
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:3        # Use 3 of 4 A6000 GPUs
#SBATCH --mem=256000        # 256GB RAM
#SBATCH --time=5-00:00:00   # 5 days
#SBATCH -o logs/large_model_%j.out
#SBATCH -e logs/large_model_%j.err

# Load modules
module load CUDA/11.3.1

# Activate environment with large model libraries
conda activate large-models

# Set memory optimization
export CUDA_VISIBLE_DEVICES=0,1,2
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Train large language model
python train_llm.py \
    --model-size 70B \
    --gradient-checkpointing \
    --fp16 \
    --data-dir /lustreFS/data/mygroup/datasets \
    --output-dir /lustreFS/data/mygroup/llm_checkpoints

Job Management Commands

Monitoring Jobs

# Check job queue
squeue

# Check your jobs only
squeue -u $USER

# Detailed job information
scontrol show job <job_id>

# Job history
sacct -u $USER --starttime=2024-01-01

# Job efficiency statistics
seff <job_id>

Job Control

# Cancel a job
scancel <job_id>

# Cancel all your jobs
scancel -u $USER

# Cancel jobs by name
scancel --name=training_job

# Hold a job (prevent it from running)
scontrol hold <job_id>

# Release a held job
scontrol release <job_id>

Job Information

# Show partition information
sinfo

# Show detailed node information
scontrol show nodes

# Show QoS information
sacctmgr show qos

# Show your job priorities
sprio -u $USER

Resource Monitoring

During Job Execution

# Monitor GPU usage on your job
srun --jobid=<job_id> nvidia-smi

# Check memory usage
srun --jobid=<job_id> free -h

# Monitor CPU usage
srun --jobid=<job_id> htop

Job Performance Analysis

# Job efficiency report
seff <job_id>

# Detailed job accounting
sacct -j <job_id> --format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Start,End,Elapsed,MaxRSS,MaxVMSize

Best Practices

Resource Allocation

  1. Request appropriate resources:

    # Don't over-allocate
    #SBATCH --cpus-per-task=4   # Not 32 if you only use 4
    #SBATCH --mem=16000         # Not 500000 if you only need 16GB
    
  2. Use job arrays for parameter sweeps:

    #SBATCH --array=1-100%10    # Limit concurrent jobs
    
  3. Choose appropriate partitions:

    • Use defq for most workloads
    • Use a6000 only when you need >24GB GPU memory

Data Management

  1. Use appropriate storage:

    # Large datasets and results
    cd /lustreFS/data/mygroup
    
    # Temporary files during job
    export TMPDIR=/tmp/$SLURM_JOB_ID
    mkdir -p $TMPDIR
    
  2. Clean up after jobs:

    # Add to end of script
    rm -rf /tmp/$SLURM_JOB_ID
    

Debugging

  1. Test interactively first:

    srun -p defq --qos=normal --gres=gpu:1 --time=1:00:00 --pty /bin/bash
    
  2. Use smaller datasets for debugging:

    python train.py --debug --max-samples 1000
    
  3. Check logs regularly:

    tail -f logs/training_12345.out
    

Troubleshooting

Common Issues

Job pending forever:

# Check why job is pending
squeue -u $USER -t PENDING
scontrol show job <job_id> | grep Reason

Out of memory errors:

# Reduce batch size or request more memory
#SBATCH --mem=64000  # Increase memory

CUDA out of memory:

# In your script, add:
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

Job killed by time limit:

# Request more time or use checkpointing
#SBATCH --time=3-00:00:00

Cannot access files:

# Check file permissions and paths
ls -la /lustreFS/data/mygroup/

Next Steps

5 - Partitions & Queues

Understanding SLURM partitions and priority queues on Prometheus

Overview

The Prometheus cluster has two partitions with different priority queues (QoS) that control resource limits and scheduling priority. All limits are applied per group, and the default time limit is 4 hours for all partitions.

Partition Architecture

defq Partition (Default)

  • Nodes: 8 compute nodes (gpu[01-08])
  • GPU Type: NVIDIA A5000 (24GB VRAM each)
  • Total GPUs: 64 (8 GPUs per node)
  • Default partition: Jobs submitted without specifying partition go here

a6000 Partition

  • Nodes: 1 compute node (gpu09)
  • GPU Type: NVIDIA RTX A6000 Ada Generation (48GB VRAM each)
  • Total GPUs: 4
  • Use case: High-memory GPU workloads

Priority Queues (QoS)

defq Partition Queues

Priority QueueTime LimitMax CPUsMax GPUsMax RAMMax JobsPriority
normal1 day384483TB30High
long7 days384483TB20Medium
preemptiveInfiniteAll*All*All*10Low

a6000 Partition Queues

Priority QueueTime LimitMax CPUsMax GPUsMax RAMMax JobsPriority
normal-a60001 day483384GB6High
long-a60007 days483384GB4Medium
preemptive-a6000InfiniteAll*All*All*2Low

* Preemptive queues can use all available resources but jobs may be automatically terminated when higher-priority jobs need resources.

Queue Selection Guidelines

Use normal or normal-a6000 for:

  • Interactive development and testing
  • Short training runs (< 24 hours)
  • Production jobs that need guaranteed completion
  • Debugging and experimentation
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --time=12:00:00

Use long or long-a6000 for:

  • Extended training (1-7 days)
  • Large model training requiring multiple days
  • Parameter sweeps with many iterations
  • Production workloads with longer time requirements
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --time=3-00:00:00  # 3 days

Use preemptive queues sparingly for:

  • Low-priority background jobs
  • Opportunistic computing when cluster is idle
  • Jobs that can handle interruption (with checkpointing)
  • Testing with unlimited time
#SBATCH --partition=defq
#SBATCH --qos=preemptive
#SBATCH --requeue          # Automatically resubmit if preempted
#SBATCH --time=7-00:00:00

Choosing the Right Partition

Use defq partition when:

  • Your models fit in 24GB GPU memory
  • You need multiple GPUs (up to 8 per node)
  • Running distributed training across nodes
  • Working with standard deep learning models

Use a6000 partition when:

  • Your models require > 24GB GPU memory
  • Training large language models (70B+ parameters)
  • Working with high-resolution images or long sequences
  • Need maximum GPU memory per device

Example Job Submissions

Standard Training Job (defq/normal)

#!/bin/bash
#SBATCH -J standard_training
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:2
#SBATCH --mem=64000
#SBATCH --time=18:00:00

# Your training code here
python train_model.py --gpus 2

Long Training Job (defq/long)

#!/bin/bash
#SBATCH -J long_training
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:4
#SBATCH --mem=128000
#SBATCH --time=5-00:00:00  # 5 days

# Long-running training with checkpointing
python train_model.py --gpus 4 --checkpoint-freq 1000

Large Model Training (a6000/normal-a6000)

#!/bin/bash
#SBATCH -J large_model
#SBATCH --partition=a6000
#SBATCH --qos=normal-a6000
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:2
#SBATCH --mem=256000
#SBATCH --time=20:00:00

# Large model requiring high GPU memory
python train_llm.py --model-size 70B --gpus 2

Preemptive Job with Checkpointing

#!/bin/bash
#SBATCH -J preemptive_job
#SBATCH --partition=defq
#SBATCH --qos=preemptive
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --mem=32000
#SBATCH --time=7-00:00:00
#SBATCH --requeue
#SBATCH --signal=SIGUSR1@90  # Signal 90 seconds before termination

# Handle preemption gracefully
trap 'echo "Job preempted, saving checkpoint..."; python save_checkpoint.py' SIGUSR1

python train_model.py --resume-from-checkpoint

Monitoring Queue Status

Check Partition Information

# View all partitions and their status
sinfo

# Example output:
# PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
# defq*        up   infinite      6  idle~ gpu[03-08]
# defq*        up   infinite      1    mix gpu02
# defq*        up   infinite      1   idle gpu01
# a6000        up   infinite      1    mix gpu09

Check Queue Status

# View current job queue
squeue

# View jobs by partition
squeue -p defq
squeue -p a6000

# View jobs by QoS
squeue --qos=normal
squeue --qos=long

Check Your Resource Usage

# View your running jobs
squeue -u $USER

# Check job priorities
sprio -u $USER

# View resource limits
sacctmgr show qos format=Name,MaxWall,MaxTRES,MaxJobs

Resource Planning

Calculate Resource Needs

Before submitting jobs, consider:

  1. GPU Memory Requirements:

    • Small models (< 1B params): A5000 (24GB)
    • Large models (> 10B params): A6000 (48GB)
  2. Training Time Estimates:

    • Quick experiments: normal queue (< 1 day)
    • Full training: long queue (1-7 days)
  3. Number of GPUs:

    • Single GPU: Any node
    • Multi-GPU: Consider node topology
    • Distributed: Multiple nodes in defq

Group Coordination

Since limits are per group:

  1. Communicate with group members
  2. Check current group usage:
    squeue -A your-group-name
    
  3. Plan resource allocation to avoid conflicts

Best Practices

Queue Selection Strategy

  1. Start with normal queues for development
  2. Use long queues only when necessary
  3. Avoid preemptive queues unless jobs can handle interruption
  4. Test on smaller resources before scaling up

Resource Efficiency

  1. Don’t over-allocate resources:

    # Bad: Requesting 8 GPUs for single-GPU code
    #SBATCH --gres=gpu:8
    
    # Good: Request what you actually use
    #SBATCH --gres=gpu:1
    
  2. Use appropriate memory:

    # Calculate actual memory needs
    #SBATCH --mem=32000  # 32GB, not 500GB
    
  3. Estimate time accurately:

    # Add buffer but don't overestimate
    #SBATCH --time=18:00:00  # 18 hours, not 7 days
    

Troubleshooting

Job Stuck in Queue

# Check why job is pending
scontrol show job <job_id> | grep Reason

# Common reasons:
# - Resources: Requesting more than available
# - Priority: Lower priority than other jobs
# - QoSMaxJobsPerUser: Too many jobs running

Resource Limit Exceeded

# Check current group usage
squeue -A your-group

# Reduce resource requests or wait for jobs to complete

Wrong Partition Choice

# Cancel and resubmit with correct partition
scancel <job_id>
# Edit script and resubmit
sbatch corrected_script.slurm

Next Steps

6 - Storage Systems

File systems, quotas, and data management on the Prometheus cluster

Storage Overview

The Prometheus cluster provides multiple storage systems optimized for different use cases, from personal files to high-performance parallel computing workloads.

Storage Architecture

Home Directories (/trinity/home/)

  • Type: SSD-backed storage
  • Mount point: /trinity/home/<username>
  • Quota: 20GB per user
  • Purpose: Personal configuration files, small scripts
  • Backup: Regular backups maintained
  • Performance: Fast random I/O, moderate capacity

Shared Group Storage (/lustreFS/data/)

  • Type: Lustre parallel file system
  • Mount point: /lustreFS/data/<group-name>
  • Quota: 30TB per group (or 20,971,520 files)
  • Purpose: Primary workspace for research data and results
  • Performance: High-throughput parallel I/O
  • Shared: All group members have access

Local Node Storage

  • Type: NVMe SSD (1TB per compute node)
  • Purpose: Temporary files during job execution
  • Access: Only available during allocated jobs
  • Performance: Highest IOPS for temporary data

File System Details

Home Directory Usage

# Check your home directory quota
quota -us

# View home directory contents
ls -la /trinity/home/$USER

# Typical home directory structure
/trinity/home/username/
├── .bashrc                 # Shell configuration
├── .ssh/                   # SSH keys and config
├── .jupyter/               # Jupyter configuration
├── .conda/                 # Conda configuration
├── scripts/                # Small utility scripts
└── .local/                 # Local Python packages

Best practices for home directories:

  • Store only configuration files and small scripts
  • Link to shared storage for data access
  • Use symbolic links to avoid quota issues

Shared Group Storage

The /lustreFS/data/ directory provides high-performance storage for your research work:

# Access your group's shared storage
cd /lustreFS/data/<group-name>

# Check group quota
lfs quota -gh <group-name> /lustreFS/

# Example group directory structure
/lustreFS/data/mygroup/
├── datasets/               # Shared datasets
├── models/                 # Pre-trained and trained models
├── experiments/            # Individual user experiments
│   ├── user1/
│   ├── user2/
│   └── shared/
├── code/                   # Shared code repositories
├── results/                # Experiment results
└── tmp/                    # Temporary files

Quota information:

  • Space limit: 30TB per group
  • File limit: 20,971,520 files per group
  • Shared: All group members can read/write

Local Node Storage

Each compute node has local NVMe storage for temporary files:

# During a SLURM job, use local storage for temporary files
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR

# Example usage in job script
#SBATCH --job-name=training
#SBATCH --gres=gpu:1

# Create temporary directory
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR

# Copy data to local storage for faster I/O
cp /lustreFS/data/mygroup/dataset.tar.gz $TMPDIR/
cd $TMPDIR
tar -xzf dataset.tar.gz

# Run training with local data
python train.py --data-dir $TMPDIR/dataset

# Copy results back to shared storage
cp -r $TMPDIR/results /lustreFS/data/mygroup/experiments/

Quota Management

Checking Quotas

# Check your home directory quota
quota -us

# Check group quota on Lustre filesystem
lfs quota -gh <group-name> /lustreFS/

# Example quota output:
# Disk quotas for group mygroup (gid 1001):
#      Filesystem    used   quota   limit   grace   files   quota   limit   grace
#        /lustreFS   15.2T     30T     30T       -  1234567  20971520 20971520   -

Understanding Quota Output

  • used: Current usage
  • quota: Soft limit (warning threshold)
  • limit: Hard limit (cannot exceed)
  • grace: Time allowed to exceed soft quota
  • files: Number of files/inodes used

Managing Quota Issues

When approaching quota limits:

  1. Clean up temporary files:

    # Find large files
    find /lustreFS/data/mygroup -type f -size +1G -ls
    
    # Find old temporary files
    find /lustreFS/data/mygroup -name "*.tmp" -mtime +7 -delete
    
  2. Archive old data:

    # Compress old experiments
    tar -czf old_experiments.tar.gz experiments/2023/
    rm -rf experiments/2023/
    
  3. Use efficient storage:

    # Use compressed formats for datasets
    # Store checkpoints selectively
    # Remove duplicate files
    

Data Management Best Practices

Directory Organization

Organize your group’s shared storage efficiently:

# Recommended structure
/lustreFS/data/mygroup/
├── datasets/
│   ├── imagenet/           # Large shared datasets
│   ├── coco/
│   └── custom/
├── models/
│   ├── pretrained/         # Downloaded pre-trained models
│   └── checkpoints/        # Training checkpoints
├── experiments/
│   ├── user1/
│   │   ├── project_a/
│   │   └── project_b/
│   └── user2/
├── code/
│   ├── shared_utils/       # Shared code libraries
│   └── experiments/        # Experiment code
└── results/
    ├── papers/             # Results for publications
    └── ongoing/            # Current experiment results

File Permissions

Set appropriate permissions for shared access:

# Make directories group-writable
chmod g+w /lustreFS/data/mygroup/datasets/

# Set default permissions for new files
umask 002

# Change group ownership if needed
chgrp -R mygroup /lustreFS/data/mygroup/shared/

Data Transfer

Small Files (< 1GB)

# Copy from local machine using scp
scp large_dataset.tar.gz prometheus:/lustreFS/data/mygroup/datasets/

# Copy between directories on cluster
cp -r /lustreFS/data/mygroup/datasets/source /lustreFS/data/mygroup/experiments/

Large Files (> 1GB)

# Use rsync for large transfers with progress
rsync -avP large_dataset/ prometheus:/lustreFS/data/mygroup/datasets/

# Parallel compression for large datasets
tar -cf - dataset/ | pigz > dataset.tar.gz

Download Datasets

# Download directly to shared storage
cd /lustreFS/data/mygroup/datasets/
wget https://example.com/large_dataset.tar.gz

# Use aria2 for faster parallel downloads
aria2c -x 8 -s 8 https://example.com/dataset.tar.gz

Backup Strategies

While the cluster provides reliable storage, implement your own backup strategy:

  1. Important results: Copy to external storage
  2. Code: Use git repositories
  3. Large datasets: Document download sources for re-acquisition
  4. Models: Keep important checkpoints on external storage

Performance Optimization

Lustre File System Tips

  1. Use parallel I/O for large files:

    # PyTorch DataLoader with multiple workers
    dataloader = DataLoader(dataset, batch_size=64, num_workers=8)
    
  2. Avoid small random writes:

    # Bad: Many small writes
    for i in {1..1000}; do echo $i >> file.txt; done
    
    # Good: Batch writes
    seq 1 1000 > file.txt
    
  3. Use appropriate stripe settings for large files:

    # Set stripe count for large files (> 1GB)
    lfs setstripe -c 4 /lustreFS/data/mygroup/large_dataset/
    

Local Storage Performance

  1. Copy frequently accessed data to local storage during jobs
  2. Use local storage for temporary files and intermediate results
  3. Copy final results back to shared storage
# Example job with local storage optimization
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=4:00:00

# Set up local temporary directory
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR

# Copy dataset to local storage
echo "Copying dataset to local storage..."
cp /lustreFS/data/mygroup/dataset.tar.gz $TMPDIR/
cd $TMPDIR
tar -xzf dataset.tar.gz

# Run training with local data (much faster I/O)
python train.py --data-dir $TMPDIR/dataset --output-dir $TMPDIR/results

# Copy results back to shared storage
echo "Copying results back..."
cp -r $TMPDIR/results /lustreFS/data/mygroup/experiments/

Environment Variables

Set up useful environment variables for data management:

# Add to your ~/.bashrc
export GROUP_DATA="/lustreFS/data/mygroup"
export DATASETS="$GROUP_DATA/datasets"
export MODELS="$GROUP_DATA/models"
export EXPERIMENTS="$GROUP_DATA/experiments/$USER"
export RESULTS="$GROUP_DATA/results"

# Create your experiment directory
mkdir -p $EXPERIMENTS

Common Storage Issues

Quota Exceeded

# Error: "Disk quota exceeded"
# Solution: Check and clean up usage
lfs quota -gh mygroup /lustreFS/
find $GROUP_DATA -type f -size +1G -ls

Permission Denied

# Error: "Permission denied"
# Solution: Check file permissions and group membership
ls -la /lustreFS/data/mygroup/
groups  # Check your group membership

Slow I/O Performance

# Solutions:
# 1. Use local storage for temporary files
# 2. Reduce number of small files
# 3. Use parallel I/O libraries
# 4. Check stripe settings for large files
lfs getstripe /lustreFS/data/mygroup/large_file

File System Full

# Check available space
df -h /lustreFS

# If file system is full, clean up:
# 1. Remove temporary files
# 2. Compress old data
# 3. Archive completed experiments

Next Steps

7 - Environment Modules

Using Lmod environment modules to manage software on Prometheus

Overview

The Prometheus cluster uses Lmod (Lua-based Environment Modules) to manage software packages and their dependencies. This system allows you to easily load and unload different software versions without conflicts.

Module System Basics

Environment modules modify your shell environment to provide access to specific software packages. When you load a module, it typically:

  • Adds software to your PATH
  • Sets environment variables
  • Loads required dependencies
  • Configures library paths

Basic Module Commands

List Available Modules

# Show all available modules
module available
module avail
module av
ml av

# Search for specific modules
module avail gcc
module avail python
module avail cuda

Load Modules

# Load a specific module
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5

# Short form using 'ml'
ml GCC/10.3.0 CUDA/11.3.1 Python/3.9.5

Check Loaded Modules

# List currently loaded modules
module list
ml list

Unload Modules

# Unload a specific module
module unload GCC/10.3.0

# Unload all modules
module purge
ml purge

Common Software Modules

Compilers

# GNU Compiler Collection
module load GCC/10.3.0
module load GCC/11.2.0

# Intel Compilers (if available)
module load intel/2021.4.0

CUDA and GPU Development

# CUDA Toolkit
module load CUDA/11.3.1
module load CUDA/11.7.0
module load CUDA/12.0.0

# Check CUDA after loading
nvcc --version
nvidia-smi

Python Environments

# Python interpreter
module load Python/3.9.5
module load Python/3.10.8

# Python with scientific libraries
module load Python/3.9.5-GCCcore-10.3.0

Deep Learning Frameworks

# PyTorch (if pre-installed as module)
module load PyTorch/1.12.1-foss-2022a-CUDA-11.7.0

# TensorFlow (if pre-installed as module)
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0

Development Tools

# Git (usually available by default)
module load git/2.36.0

# CMake
module load CMake/3.24.3

# HDF5 for data storage
module load HDF5/1.12.2

Module Dependencies

Lmod automatically handles dependencies. When you load a module, it loads required dependencies:

# Loading Python might automatically load GCC
module load Python/3.9.5

# Check what was loaded
module list
# Might show:
# GCCcore/10.3.0
# Python/3.9.5-GCCcore-10.3.0

Setting Up Your Environment

Create a Module Loading Script

Create ~/load_modules.sh for consistent environment setup:

#!/bin/bash
# ~/load_modules.sh - Load standard development environment

# Clear any existing modules
module purge

# Load core development tools
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5

# Optional: Load additional tools
# module load git/2.36.0
# module load CMake/3.24.3

echo "Development environment loaded:"
module list

Make it executable and use it:

chmod +x ~/load_modules.sh
source ~/load_modules.sh

Add to Your Shell Configuration

Add common modules to your ~/.bashrc:

# Add to ~/.bashrc
# Load standard modules at login
if [ -f ~/load_modules.sh ]; then
    source ~/load_modules.sh
fi

SLURM Job Scripts with Modules

Basic Job with Modules

#!/bin/bash
#SBATCH -J module_job
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --gres=gpu:1
#SBATCH --time=2:00:00

# Load required modules
module purge
module load CUDA/11.3.1
module load Python/3.9.5

# Verify modules are loaded
echo "Loaded modules:"
module list

# Check CUDA availability
echo "CUDA version:"
nvcc --version

# Activate your conda environment
source ~/anaconda3/bin/activate
conda activate myenv

# Run your script
python train.py

Multiple GPU Job with Modules

#!/bin/bash
#SBATCH -J multi_gpu_training
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --gres=gpu:4
#SBATCH --time=1-00:00:00

# Load modules for CUDA development
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5

# Load MPI for distributed computing (if available)
# module load OpenMPI/4.1.4-GCC-10.3.0

# Set CUDA environment
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Activate environment
conda activate pytorch-gpu

# Run distributed training
python -m torch.distributed.launch \
    --nproc_per_node=4 \
    train_distributed.py

Python Package Management

Using Conda with Modules

# Load Python module
module load Python/3.9.5

# Install conda (if not already available)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/miniconda3

# Add conda to path
echo 'export PATH="$HOME/miniconda3/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Create environment
conda create -n myenv python=3.9
conda activate myenv

# Install packages
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Using pip with Modules

# Load Python module
module load Python/3.9.5

# Create virtual environment
python -m venv ~/venvs/myproject
source ~/venvs/myproject/bin/activate

# Upgrade pip
pip install --upgrade pip

# Install packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install jupyter numpy pandas matplotlib

Module Collections

Save frequently used module combinations:

# Save current modules as a collection
module save my_collection

# List saved collections
module savelist

# Restore a collection
module restore my_collection

Custom Module Paths

If your group has custom modules:

# Add custom module path
module use /lustreFS/data/mygroup/modules

# Check module paths
module show MODULEPATH

Troubleshooting Modules

Common Issues

Module not found:

# Check available modules
module avail | grep -i package_name

# Check if you have access to the module path
ls -la /opt/modules/

Conflicting modules:

# Clear all modules and start fresh
module purge
module load GCC/10.3.0 CUDA/11.3.1

CUDA not found after loading:

# Verify CUDA module is loaded
module list | grep -i cuda

# Check CUDA environment
echo $CUDA_HOME
echo $CUDA_PATH
which nvcc

Python packages not found:

# Ensure Python module is loaded before using pip/conda
module load Python/3.9.5
which python
python --version

Module Information

# Show detailed module information
module show CUDA/11.3.1
module help CUDA/11.3.1

# See what a module does before loading
module display CUDA/11.3.1

Best Practices

For Interactive Development

  1. Create a standard environment script
  2. Use module collections for frequently used combinations
  3. Load modules before activating conda/venv

For SLURM Jobs

  1. Always start with module purge
  2. Load modules explicitly in job scripts
  3. Verify modules are loaded with module list
  4. Document module requirements in your scripts

For Reproducibility

  1. Pin module versions in scripts:

    module load CUDA/11.3.1  # Not just 'CUDA'
    
  2. Document module requirements:

    # Required modules:
    # - GCC/10.3.0
    # - CUDA/11.3.1
    # - Python/3.9.5
    
  3. Use environment files for complex setups

Example Workflows

Deep Learning Setup

# Standard deep learning environment
module purge
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5

# Activate conda environment
conda activate pytorch-env

# Verify setup
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

Development Environment

# Development tools
module purge
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5
module load git/2.36.0
module load CMake/3.24.3

# Save as collection
module save development

# Later, restore quickly
module restore development

Compilation Environment

# For compiling CUDA code
module purge
module load GCC/10.3.0
module load CUDA/11.3.1

# Compile CUDA program
nvcc -o program program.cu

# For C++ with GPU support
g++ -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcudart program.cpp -o program

Next Steps

8 - VS Code Remote Development

Set up Visual Studio Code for remote development on the Prometheus cluster

Overview

Visual Studio Code provides excellent remote development capabilities for the Prometheus cluster. You can edit code, run Jupyter notebooks, and debug applications directly on the cluster while using your local VS Code interface.

Prerequisites

  • VS Code Desktop installed on your local machine
  • Remote-SSH extension for VS Code
  • SSH access to Prometheus cluster (see Getting Started)
  • Valid cluster account with SSH keys configured

Install Required Extensions

Install these essential VS Code extensions:

  1. Remote - SSH (ms-vscode-remote.remote-ssh)
  2. Python (ms-python.python)
  3. Jupyter (ms-toolsai.jupyter)
  4. Git integration (built-in)

Optional but recommended:

  • Remote - SSH: Editing Configuration Files (ms-vscode-remote.remote-ssh-edit)
  • GitLens (eamodio.gitlens)
  • Thunder Client for API testing (rangav.vscode-thunder-client)

SSH Configuration

Basic SSH Setup

First, ensure your ~/.ssh/config file is properly configured:

# ~/.ssh/config
Host prometheus
  Hostname prometheus.cyens.org.cy
  User <your-username>
  IdentityFile ~/.ssh/id_rsa

Host *.cluster
  User <your-username>
  IdentityFile ~/.ssh/prometheus_user_sshd
  ProxyJump prometheus

Replace <your-username> with your actual cluster username.

User SSHD Process Setup

To connect VS Code directly to compute nodes, you need to set up a user SSHD process through SLURM.

Step 1: Generate SSH Keys for User SSHD

Connect to Prometheus and create SSH keys for the user SSHD process:

ssh prometheus
ssh-keygen -t rsa -f ~/.ssh/prometheus_user_sshd

This creates:

  • Private key: ~/.ssh/prometheus_user_sshd
  • Public key: ~/.ssh/prometheus_user_sshd.pub

Step 2: Create SSHD Job Script

Create ~/sshd.sh script for launching the user SSHD process:

#!/bin/bash
#SBATCH -o res_%j.txt      # output file
#SBATCH -e res_%j.err      # error file
#SBATCH -J sshd            # job name
#SBATCH --partition=defq   # partition
#SBATCH --qos=normal       # priority queue
#SBATCH --ntasks=1         # number of tasks
#SBATCH --cpus-per-task=2  # CPU cores
#SBATCH --gres=gpu:1       # number of GPUs
#SBATCH --mem=1000         # memory in MB
#SBATCH --time=0-04:00     # 4 hours maximum

# Find an available port
PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')

echo "********************************************************************"
echo "Starting sshd in Slurm as user"
echo "Environment information:"
echo "Date:" $(date)
echo "Allocated node:" $(hostname)
echo "Path:" $(pwd)
echo "Listening on:" $PORT
echo "********************************************************************"

# Start user SSHD process
/usr/sbin/sshd -D -p ${PORT} -f /dev/null -h ${HOME}/.ssh/prometheus_user_sshd

Make the script executable:

chmod +x ~/sshd.sh

Step 3: Submit SSHD Job

Submit the SSHD job to get a compute node:

sbatch sshd.sh

Check the job status and get the allocated node and port:

# Check job status
squeue -u $USER

# View the output file to get connection details
cat res_<job_id>.txt

The output will show something like:

Starting sshd in Slurm as user
Date: Thu Jun 5 10:30:00 UTC 2025
Allocated node: gpu02
Listening on: 45672

Connecting VS Code

Method 1: Direct Connection

  1. Open VS Code on your local machine
  2. Press F1 or Ctrl/Cmd+Shift+P to open command palette
  3. Type: “Remote-SSH: Connect to Host”
  4. Enter: ssh -p 45672 gpu02.cluster (use your actual port and node)

VS Code will automatically update your SSH config file.

Method 2: Manual SSH Config

Add the connection details to your ~/.ssh/config:

Host gpu02.cluster
    HostName gpu02.cluster
    Port 45672
    User <your-username>
    IdentityFile ~/.ssh/prometheus_user_sshd
    ProxyJump prometheus

Then connect using “Remote-SSH: Connect to Host” → gpu02.cluster

Development Workflow

1. Connect to Compute Node

# Submit SSHD job
sbatch sshd.sh

# Wait for job to start (check with squeue)
squeue -u $USER

# Get connection details
cat res_<job_id>.txt

# Connect VS Code to the allocated node

2. Open Your Project

Once connected to the compute node:

  1. File → Open Folder
  2. Navigate to: /lustreFS/data/mygroup/experiments/myproject
  3. Open the folder

3. Set Up Python Environment

In the VS Code terminal on the remote machine:

# Load required modules
module load CUDA/11.3.1 Python/3.9.5

# Activate your conda environment
conda activate myenv

# Verify GPU access
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

4. Configure Python Interpreter

  1. Press Ctrl/Cmd+Shift+P
  2. Type: “Python: Select Interpreter”
  3. Choose: Your conda environment interpreter
    • Usually: ~/miniconda3/envs/myenv/bin/python

Working with Jupyter Notebooks

Start Jupyter Server

In the VS Code terminal on the remote machine:

# Load modules and activate environment
module load Python/3.9.5
conda activate myenv

# Start Jupyter (no browser needed)
jupyter notebook --no-browser --port=8888 --ip=0.0.0.0

Connect VS Code to Jupyter

  1. Open a .ipynb file in VS Code
  2. Click “Select Kernel” in the top-right
  3. Choose “Existing Jupyter Server”
  4. Enter: http://localhost:8888
  5. Enter the token from the Jupyter output

Development Best Practices

Resource Management

  1. Request appropriate resources for development:

    #SBATCH --cpus-per-task=4  # Not 32 for development
    #SBATCH --gres=gpu:1       # Usually sufficient for development
    #SBATCH --mem=8000         # 8GB for most development tasks
    #SBATCH --time=0-04:00     # 4 hours for development session
    
  2. Use longer sessions for intensive work:

    #SBATCH --time=0-08:00     # 8 hours for longer development
    #SBATCH --qos=long         # If you need more than 1 day
    

File Organization

Set up a consistent workspace structure:

# Recommended project structure
/lustreFS/data/mygroup/experiments/myproject/
├── data/                   # Datasets and data files
├── notebooks/              # Jupyter notebooks
├── src/                    # Source code
├── configs/                # Configuration files
├── scripts/                # Training and utility scripts
├── results/                # Experiment results
└── README.md               # Project documentation

Environment Configuration

Create a workspace settings file (.vscode/settings.json):

{
    "python.defaultInterpreterPath": "~/miniconda3/envs/myenv/bin/python",
    "python.terminal.activateEnvironment": true,
    "jupyter.jupyterServerType": "local",
    "files.watcherExclude": {
        "**/data/**": true,
        "**/results/**": true,
        "**/.git/**": true
    }
}

Git Integration

Configure Git for your project:

# Set up Git credentials
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

# Initialize repository (if new project)
cd /lustreFS/data/mygroup/experiments/myproject
git init
git add .
git commit -m "Initial commit"

Troubleshooting

Connection Issues

“Could not establish connection”:

  1. Check if SSHD job is running: squeue -u $USER
  2. Verify node name and port from job output
  3. Ensure SSH keys are properly configured

“Permission denied”:

  1. Check SSH key permissions: chmod 600 ~/.ssh/prometheus_user_sshd
  2. Verify ProxyJump configuration in SSH config
  3. Test SSH connection manually: ssh -p <port> <node>.cluster

Performance Issues

Slow file operations:

  1. Exclude large directories from VS Code watcher
  2. Use local storage for frequently accessed files
  3. Consider using VS Code on the head node for file browsing only

High memory usage:

  1. Close unused notebooks and files
  2. Restart VS Code Python extension if needed
  3. Request more memory in SSHD job if necessary

SSHD Job Management

Job terminated unexpectedly:

  1. Check job logs: cat res_<job_id>.err
  2. Resubmit SSHD job: sbatch sshd.sh
  3. Update VS Code connection with new port/node

Need longer development time:

# Modify sshd.sh for longer sessions
#SBATCH --time=0-08:00     # 8 hours
#SBATCH --qos=long         # For >24 hours

Cleanup and Best Practices

End Development Session

When finishing your work:

  1. Save all files in VS Code
  2. Close VS Code connection
  3. Cancel the SSHD job:
    scancel <job_id>
    
  4. Remove SSH config entries added by VS Code (optional)

Security Considerations

  1. Don’t leave SSHD jobs running when not needed
  2. Use strong passphrases for SSH keys
  3. Regularly rotate SSH keys if required by policy
  4. Monitor your running jobs: squeue -u $USER

Advanced Configuration

Multiple Concurrent Sessions

You can run multiple SSHD jobs for different projects:

# Submit multiple jobs
sbatch sshd.sh
sbatch sshd.sh

# Connect VS Code to different nodes
# Node 1: gpu01.cluster:45672
# Node 2: gpu03.cluster:45673

Custom SSHD Configuration

Create specialized SSHD scripts for different use cases:

# sshd_gpu4.sh - For multi-GPU development
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=16
#SBATCH --mem=64000

# sshd_a6000.sh - For high-memory development
#SBATCH --partition=a6000
#SBATCH --qos=normal-a6000
#SBATCH --gres=gpu:1

Next Steps

9 - Software Installation

Installing and managing software packages on the Prometheus cluster

Overview

The Prometheus cluster provides several methods for installing and managing software packages. This guide covers both system-wide modules and user-specific installations.

Installation Methods

Use pre-installed software via the module system when available:

module avail python
module load Python/3.9.5

2. Conda/Mamba Package Manager

Install packages in isolated environments:

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

3. pip Package Manager

Install Python packages via pip:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

4. Source Installation

Compile software from source when needed:

git clone https://github.com/project/repo.git
cd repo && python setup.py install

Setting Up Python Environments

Conda Installation

If conda is not available, install Miniconda:

# Download Miniconda
cd /lustreFS/data/mygroup/$USER
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install Miniconda
bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/miniconda3

# Initialize conda
~/miniconda3/bin/conda init bash
source ~/.bashrc

Create Virtual Environments

# Create a new environment
conda create -n pytorch-env python=3.9

# Activate environment
conda activate pytorch-env

# Install packages
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install jupyter matplotlib pandas scikit-learn

Environment Management

# List environments
conda env list

# Export environment
conda env export > environment.yml

# Create from file
conda env create -f environment.yml

# Remove environment
conda env remove -n old-env

Deep Learning Frameworks

PyTorch Installation

# Create PyTorch environment
conda create -n pytorch python=3.9
conda activate pytorch

# Install PyTorch with CUDA support
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

# Verify installation
python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

TensorFlow Installation

# Create TensorFlow environment
conda create -n tensorflow python=3.9
conda activate tensorflow

# Install TensorFlow
pip install tensorflow[and-cuda]

# Verify GPU support
python -c "import tensorflow as tf; print(f'TensorFlow {tf.__version__}, GPUs: {len(tf.config.list_physical_devices("GPU"))}')"

JAX Installation

# Create JAX environment
conda create -n jax python=3.9
conda activate jax

# Install JAX with CUDA support
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

# Verify installation
python -c "import jax; print(f'JAX devices: {jax.devices()}')"

Specialized Libraries

MinkowskiEngine

MinkowskiEngine is an auto-differentiation library for sparse tensors, particularly useful for 3D computer vision tasks.

Installation Steps

  1. Create dedicated environment:

    conda create -n py3-mink python=3.8
    conda activate py3-mink
    
  2. Install dependencies:

    conda install openblas-devel -c anaconda
    conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
    
  3. Load required modules:

    module load CUDA/11.3.1 gnu9
    
  4. Submit interactive job for compilation:

    srun -n 1 -c 4 --gres=gpu:1 --mem=20000 --pty /bin/bash
    
  5. Install MinkowskiEngine:

    conda activate py3-mink
    pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \
        --install-option="--blas_include_dirs=${CONDA_PREFIX}/include" \
        --install-option="--blas=openblas"
    

Usage Example

import torch
import MinkowskiEngine as ME

# Create sparse tensor
coords = torch.IntTensor([[0, 1], [0, 1], [0, 2], [1, 0], [1, 2]])
feats = torch.FloatTensor([[1], [2], [3], [4], [5]])

# Create sparse tensor
sparse_tensor = ME.SparseTensor(feats, coords)
print(f"Sparse tensor shape: {sparse_tensor.shape}")

PointGPT

PointGPT extends GPT concepts to point clouds for 3D understanding tasks.

Installation Steps

  1. Create environment:

    conda create -n pointgpt python=3.8
    conda activate pointgpt
    
  2. Install PyTorch and dependencies:

    conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 tensorboard -c pytorch -c conda-forge
    pip install easydict h5py matplotlib open3d opencv-python pyyaml timm tqdm transforms3d termcolor scipy ninja plyfile numpy==1.23.4
    pip install setuptools==59.5.0
    
  3. Load CUDA module:

    module load CUDA/11.3.1
    
  4. Clone PointGPT repository:

    cd /lustreFS/data/mygroup/$USER
    git clone https://github.com/CGuangyan-BIT/PointGPT.git
    cd PointGPT
    
  5. Submit interactive job for compilation:

    srun -n 1 -c 4 --gres=gpu:1 --mem=20000 --pty /bin/bash
    
  6. Install extensions:

    conda activate pointgpt
    
    # Chamfer Distance & EMD
    cd ./extensions/chamfer_dist
    python setup.py install --user
    cd ../emd
    python setup.py install --user
    cd ../
    
    # PointNet++
    pip install "git+https://github.com/erikwijmans/Pointnet2_PyTorch.git#egg=pointnet2_ops&subdirectory=pointnet2_ops_lib"
    
    # GPU kNN
    pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl
    

Computer Vision Libraries

OpenCV Installation

conda activate myenv
conda install opencv -c conda-forge

# Or install from pip
pip install opencv-python opencv-contrib-python

Open3D for 3D Processing

conda activate myenv
pip install open3d

# Test installation
python -c "import open3d as o3d; print(f'Open3D {o3d.__version__}')"

PIL/Pillow for Image Processing

conda install pillow
# or
pip install Pillow

Scientific Computing

NumPy, SciPy, Pandas

conda install numpy scipy pandas matplotlib seaborn
# or
pip install numpy scipy pandas matplotlib seaborn

Jupyter and IPython

conda install jupyter ipython ipykernel
# or
pip install jupyter ipython ipykernel

# Add environment to Jupyter
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

Scikit-learn

conda install scikit-learn
# or
pip install scikit-learn

Development Tools

Git and Version Control

# Git is usually available by default
git --version

# Configure Git
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

Build Tools

# Install build essentials
conda install cmake make ninja

# For C++ development
conda install gxx_linux-64 gcc_linux-64

Debugging Tools

# Install debugging tools
pip install pdb++ ipdb

# Memory profiling
pip install memory_profiler

# Line profiling
pip install line_profiler

Installation in SLURM Jobs

Interactive Installation

# Submit interactive job for installation
srun --partition=defq --qos=normal --gres=gpu:1 --mem=16000 --time=2:00:00 --pty /bin/bash

# Load modules
module load CUDA/11.3.1 Python/3.9.5

# Activate environment
conda activate myenv

# Install packages
pip install package-name

Batch Installation Script

#!/bin/bash
#SBATCH -J install_packages
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --cpus-per-task=4
#SBATCH --mem=8000
#SBATCH --time=1:00:00

# Load modules
module load Python/3.9.5

# Activate environment
conda activate myenv

# Install packages
pip install -r requirements.txt

echo "Installation completed"

Package Management Best Practices

Requirements Files

Create requirements.txt for reproducibility:

torch==1.12.1+cu117
torchvision==0.13.1+cu117
torchaudio==0.12.1+cu117
numpy==1.23.4
pandas==1.5.2
matplotlib==3.6.2
jupyter==1.0.0

Install from requirements:

pip install -r requirements.txt

Environment Files

Create environment.yml for conda:

name: myproject
channels:
  - pytorch
  - nvidia
  - conda-forge
dependencies:
  - python=3.9
  - pytorch=1.12.1
  - torchvision=0.13.1
  - torchaudio=0.12.1
  - pytorch-cuda=11.7
  - numpy
  - pandas
  - matplotlib
  - jupyter
  - pip
  - pip:
    - some-pip-package

Create environment:

conda env create -f environment.yml

Storage Considerations

Install packages in shared group storage to avoid quota issues:

# Set conda environments path
echo "envs_dirs:
  - /lustreFS/data/mygroup/conda/envs" > ~/.condarc

# Set pip cache directory
export PIP_CACHE_DIR=/lustreFS/data/mygroup/pip-cache
echo 'export PIP_CACHE_DIR=/lustreFS/data/mygroup/pip-cache' >> ~/.bashrc

Troubleshooting

Common Installation Issues

CUDA compatibility errors:

# Check CUDA version
nvidia-smi

# Install matching PyTorch version
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Memory errors during installation:

# Request more memory for installation
srun --mem=32000 --pty /bin/bash

# Or increase pip timeout
pip install --timeout 1000 package-name

Permission errors:

# Install in user space
pip install --user package-name

# Or check conda environment ownership
ls -la ~/miniconda3/envs/

Network timeouts:

# Use conda-forge channel
conda install -c conda-forge package-name

# Or use pip with retries
pip install --retries 10 package-name

Compilation Issues

Missing compilers:

# Load compiler modules
module load GCC/10.3.0

# Check compiler availability
gcc --version
nvcc --version

Missing headers:

# Install development packages
conda install gxx_linux-64 gcc_linux-64

# For CUDA development
module load CUDA/11.3.1
echo $CUDA_HOME

Environment Conflicts

Package conflicts:

# Create fresh environment
conda create -n clean-env python=3.9
conda activate clean-env

# Install packages one by one
conda install pytorch -c pytorch

Module vs conda conflicts:

# Always load modules before activating conda
module load Python/3.9.5
conda activate myenv

Package Documentation

Keep track of installed packages:

# List conda packages
conda list > conda_packages.txt

# List pip packages
pip freeze > pip_requirements.txt

# Environment information
conda info --envs > environments.txt

Next Steps