This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Prometheus Cluster Documentation

Complete guide to using the Prometheus deep learning cluster at CYENS

1: Getting Started
2: Hardware Specifications
3: Environment Setup
4: Job Submission
5: Partitions & Queues
6: Storage Systems
7: Environment Modules
8: VS Code Remote Development
9: Software Installation

This section contains comprehensive documentation for the Prometheus cluster - a high-performance computing environment for deep learning research at CYENS.

Overview

The Prometheus cluster is a state-of-the-art deep learning computing facility featuring:

64 NVIDIA A5000 GPUs (24GB each) across 8 compute nodes
4 NVIDIA A6000 Ada GPUs (48GB each) on a dedicated node
4.6TB total GPU memory for large-scale model training
High-performance Lustre storage with 305TB capacity
SLURM job scheduler for efficient resource management

This documentation will guide you through:

Getting Started - SSH access and account setup
Hardware Specifications - Detailed cluster architecture
Job Submission - SLURM batch and interactive jobs
Partitions & Queues - Resource allocation policies
Storage Systems - File systems and quotas
Environment Modules - Software stack management
VS Code Setup - Remote development environment
Software Installation - Third-party libraries and tools

Quick Start

Generate SSH keys and request cluster access
Connect via SSH to prometheus.cyens.org.cy
Submit your first job using SLURM
Set up development environment with modules or containers

Cluster Specifications

Compute Resources

9 compute nodes total
GPU nodes gpu[01-08]: 8×A5000 GPUs each (64 total GPUs)
GPU node gpu09: 4×A6000 Ada GPUs (48GB VRAM each)
512GB RAM per compute node
32 CPU cores per node (AMD EPYC 7313)

Storage

Home directories: 20GB SSD per user
Shared storage: 30TB Lustre filesystem per group
Local storage: 1TB NVMe SSD per compute node

Networking Infrastructure

Management Network: Netgear M4300-52G switch with 48×1G ports plus 2×10GBASE-T and 2×SFP+
High-Performance Interconnect: Mellanox HDR InfiniBand switch with 40×QSFP56 ports
InfiniBand Speed: 200Gb/s HDR connectivity with hybrid copper cables
Low Latency: Sub-microsecond messaging for distributed computing workloads

Software Environment

Rocky Linux 8.5 operating system
SLURM workload manager
Lmod environment modules
CUDA 11.3+ with deep learning frameworks

Support & Resources

System administrators: Contact your MRG leader
Documentation: This site and /opt/cluster/docs/
Cluster status: Monitor with sinfo and squeue

For detailed instructions, start with the Getting Started guide.

1 - Getting Started

SSH access and initial setup for the Prometheus cluster

Prerequisites

Before accessing the Prometheus cluster, you need:

A valid cluster account (contact your MRG leader)
SSH client installed on your local machine
Basic familiarity with Linux command line

Generate SSH Keys

The Prometheus cluster uses RSA key authentication for secure access. You need to generate a public/private key pair:

Step 1: Create SSH Key Pair

Open your terminal and run:

ssh-keygen

Follow the on-screen instructions. This creates:

Private key: ~/.ssh/id_rsa (keep this secure!)
Public key: ~/.ssh/id_rsa.pub (share this with administrators)

Step 2: Secure Your Private Key

For Linux/Mac users, set proper permissions:

chmod 600 ~/.ssh/id_rsa

Step 3: Request Cluster Access

Send your public key to your MRG leader
Request a Prometheus account
Wait for account confirmation

Step 4: Add Passphrase (Optional but Recommended)

For additional security, add a passphrase to your key:

ssh-keygen -p -f ~/.ssh/id_rsa

Connect to Prometheus

Configure SSH Client

Create or edit ~/.ssh/config file with the following content:

Host prometheus
  Hostname prometheus.cyens.org.cy
  User <your-username>
  IdentityFile ~/.ssh/id_rsa

Replace <your-username> with your actual cluster username.

Connect via SSH

Once your account is activated, connect using:

ssh prometheus

You should now be logged into the Prometheus head node!

Check Your Environment

# Check current directory
pwd

# List available partitions
sinfo

# Check your groups
groups

# View your home directory quota
quota -us

Understand the File System

# Your home directory (20GB limit)
ls -la /trinity/home/$USER

# Shared group storage (30TB per group)
ls -la /lustreFS/data/

# Check group quota
lfs quota -gh <group-name> /lustreFS/

Cluster Architecture Overview

The Prometheus cluster consists of:

Head Node

Login and job submission point
DO NOT run compute jobs here
Used for file management and job scheduling

Compute Nodes

gpu[01-08]: 8 nodes with A5000 GPUs (8 GPUs each)
gpu09: 1 node with A6000 Ada GPUs (4 GPUs)
512GB RAM and 32 CPU cores per node

Storage Systems

/trinity/home/: Personal home directories (SSD, 20GB limit)
/lustreFS/data/: Shared group storage (305TB Lustre filesystem)
Local storage: 1TB NVMe on each compute node

Important Usage Rules

Critical

NEVER run compute jobs directly on the head node! Always use SLURM to submit jobs to compute nodes.

Checking Cluster Status

# View partition information
sinfo

# Check job queue
squeue

# Check your running jobs
squeue -u $USER

# View detailed node information
scontrol show nodes

Example sinfo output:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      6  idle~ gpu[03-08]
defq*        up   infinite      1    mix gpu02
defq*        up   infinite      1   idle gpu01
a6000        up   infinite      1    mix gpu09

Next Steps

Now that you’re connected to Prometheus:

Set up your development environment - See Environment Modules
Learn about partitions and queues - Read Partitions & Queues
Submit your first job - Follow Job Submission
Configure VS Code (optional) - See VS Code Setup

Getting Help

Cluster status: Use sinfo and squeue commands
Documentation: Check /opt/cluster/docs/ on the cluster
Support: Contact your MRG leader
System issues: Report to cluster administrators

Common First-Time Issues

“Permission denied (publickey)”

Verify your public key was added to the cluster
Check your SSH config file syntax
Ensure private key permissions are correct (chmod 600)

“Connection refused”

Verify the hostname: prometheus.cyens.org.cy
Check if you’re connected to the internet
Confirm your account is activated

“Quota exceeded”

Home directory has a 20GB limit
Use group storage in /lustreFS/data/ for large files
Check usage with quota -us

2 - Hardware Specifications

Detailed hardware specifications of the Prometheus cluster

Cluster Overview

The Prometheus cluster features a modern architecture optimized for deep learning workloads with high-performance GPUs, abundant memory, and fast storage systems.

Head Node

Management and login node for the cluster

Hardware Configuration

Chassis: GIGABYTE R182-Z90-00
Motherboard: GIGABYTE MZ92-FS0-00
CPU: 2× AMD EPYC 7313 (16 cores/32 threads each)
Total CPU Cores: 32 cores / 64 threads
RAM: 16× 32GB Samsung M393A4K40EB3-CWE
Total RAM: 512GB DDR4
Storage: 2× 1.92TB Intel SSDSC2KB019T8 SSD
File System: /trinity/home (400GB allocated)

Purpose

SSH login and job submission
File management and transfers
SLURM job scheduling
Not for compute workloads

Compute Nodes

GPU Nodes `gpu[01-08]` (8 nodes)

Primary compute nodes with NVIDIA A5000 GPUs

Hardware Configuration

Chassis: Supermicro AS-4124GS-TNR
Motherboard: Supermicro H12DSG-O-CPU
CPU: 2× AMD EPYC 7313 (16 cores/32 threads each)
Total CPU Cores: 32 cores / 64 threads per node
RAM: 16× 32GB SK Hynix HMAA4GR7AJR8N-XN
Total RAM: 512GB DDR4 per node
Local Storage: 1× 1TB Samsung SSD 980 NVMe

GPU Specifications

GPU Model: NVIDIA A5000
GPU Count: 8 GPUs per node (64 total across all nodes)
GPU Memory: 24GB GDDR6 per GPU
CUDA Cores: 8,192 per GPU
Tensor Cores: 256 RT Cores (2nd gen)
Peak Performance: 27.8 TFLOPS FP32 per GPU
Memory Bandwidth: 768 GB/s per GPU

Total Resources (gpu[01-08])

Total GPUs: 64× NVIDIA A5000
Total GPU Memory: 1,536GB (1.5TB)
Total CPU Cores: 256 cores / 512 threads
Total System RAM: 4TB DDR4

GPU Node `gpu09` (1 node)

High-memory GPU node with NVIDIA A6000 Ada

Hardware Configuration

Chassis: ASUS RS720A-E11-RS12
Motherboard: ASUS KMPP-D32
CPU: 2× AMD EPYC 7313 (16 cores/32 threads each)
Total CPU Cores: 32 cores / 64 threads
RAM: 16× 32GB SK Hynix HMAA4GR7AJR8N-XN
Total RAM: 512GB DDR4
Local Storage: 1× 1TB Samsung SSD 980 NVMe

GPU Specifications

GPU Model: NVIDIA RTX A6000 Ada Generation
GPU Count: 4 GPUs
GPU Memory: 48GB GDDR6 per GPU
CUDA Cores: 18,176 per GPU
Tensor Cores: 568 (4th gen)
Peak Performance: 91.06 TFLOPS FP32 per GPU
Memory Bandwidth: 960 GB/s per GPU

Total Resources (gpu09)

Total GPUs: 4× NVIDIA A6000 Ada
Total GPU Memory: 192GB
Total CPU Cores: 32 cores / 64 threads
Total System RAM: 512GB DDR4

Storage Nodes

Storage Architecture

High-performance parallel file system

Hardware Configuration (2 storage nodes)

Chassis: Supermicro Super Server
Motherboard: Supermicro H12SSL-i
CPU: 1× AMD EPYC 7302P (16 cores/32 threads each)
RAM: 8× 16GB Samsung M393A2K40DB3-CWE
Total RAM: 256GB DDR4 per node
OS Storage: 2× 240GB Intel SSDSC2KB240G7 SSD
Data Storage: 24× 7.68TB Samsung MZILT7T6HALA/007 NVMe SSD

Storage Specifications

File System: Lustre parallel file system
Mount Point: /lustreFS
Raw Capacity: 368TB per storage node
Total Raw Capacity: 736TB across both nodes
Usable Capacity: ~305TB (after RAID and file system overhead)
Performance: High-throughput parallel I/O

Software Environment

Operating System

Distribution: Rocky Linux 8.5 (Green Obsidian)
Kernel Version: 4.18.0-348.23.1.el8_5.x86_64
Architecture: x86_64

Management Software

Job Scheduler: SLURM Workload Manager
Module System: Lmod (Lua-based Environment Modules)
File System: Lustre for parallel storage

Development Tools

CUDA Toolkit: 11.3+ with cuDNN
Compilers: GCC, Intel, NVCC
MPI: OpenMPI, MPICH
Python: Multiple versions with conda/pip
Deep Learning: PyTorch, TensorFlow, JAX
Containers: Singularity/Apptainer support

Network Architecture

Interconnect

Compute Network: High-speed Ethernet
Storage Network: Dedicated Lustre network
Management Network: Separate administrative network

Bandwidth

Node-to-Node: High-bandwidth for distributed training
Storage Access: Optimized for parallel I/O workloads
External Access: Internet connectivity for downloads

Performance Characteristics

Compute Performance

Total GPU Performance:
- A5000 nodes: 1,779 TFLOPS FP32 (64 × 27.8)
- A6000 node: 364 TFLOPS FP32 (4 × 91.06)
- Combined: ~2,143 TFLOPS FP32
Memory Bandwidth:
- A5000 total: 49,152 GB/s
- A6000 total: 3,840 GB/s
- Combined: ~53TB/s GPU memory bandwidth

Storage Performance

Lustre File System: High-throughput parallel I/O
Local NVMe: High IOPS for temporary data
Home Directories: SSD-backed for fast access

Resource Allocation

Per-Node Resources

CPU Cores: 32 physical / 64 logical per node
System Memory: 512GB DDR4 per node
GPU Memory:
- A5000 nodes: 192GB per node (8 × 24GB)
- A6000 node: 192GB (4 × 48GB)
Local Storage: 1TB NVMe SSD per compute node

Total Cluster Resources

Compute Nodes: 9 total
CPU Cores: 288 physical / 576 logical
System Memory: 4.5TB DDR4
GPUs: 68 total (64 A5000 + 4 A6000 Ada)
GPU Memory: 1.728TB total
Shared Storage: 305TB Lustre + local NVMe

Use Cases and Workloads

Optimized For

Large Language Models: High GPU memory for transformer models
Computer Vision: Parallel training on multiple GPUs
Distributed Training: Multi-node deep learning
High-throughput Computing: Batch processing workflows
Interactive Development: Jupyter notebooks and VS Code

Performance Considerations

Memory-bound workloads: Benefit from A6000’s 48GB VRAM
Compute-intensive tasks: Leverage A5000’s efficiency
Data-intensive jobs: Utilize high-performance Lustre storage
Multi-GPU training: Scale across nodes with SLURM

3 - Environment Setup

Configure your development environment on the Prometheus cluster

Development Environment Options

The Prometheus cluster supports multiple development environments:

Container-based (Recommended)
Module-based (Traditional HPC)
Custom Python environments

Container-Based Setup

Using Pre-built Containers

The cluster provides optimized containers for common deep learning frameworks:

# List available containers
ls /shared/containers/

# Use PyTorch container
singularity shell --nv /shared/containers/pytorch-gpu.sif

# Use TensorFlow container
singularity shell --nv /shared/containers/tensorflow-gpu.sif

Building Custom Containers

Create a definition file (pytorch-custom.def):

Bootstrap: docker
From: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel

%post
    apt-get update && apt-get install -y \
        git \
        vim \
        htop \
        tmux
    
    pip install \
        transformers \
        datasets \
        wandb \
        jupyter \
        matplotlib \
        seaborn

%environment
    export CUDA_VISIBLE_DEVICES=0,1,2,3
    export PYTHONPATH=/opt/code:$PYTHONPATH

%runscript
    exec "$@"

Build the container:

singularity build pytorch-custom.sif pytorch-custom.def

Python Environment Setup

Using Conda

# Load conda module
module load conda

# Create environment
conda create -n myenv python=3.9

# Activate environment
conda activate myenv

# Install packages
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c conda-forge jupyter matplotlib pandas

Using pip with virtual environments

# Load Python module
module load python/3.9

# Create virtual environment
python -m venv ~/venvs/deeplearning
source ~/venvs/deeplearning/bin/activate

# Install packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install jupyter notebook jupyterlab
pip install transformers datasets wandb

GPU Environment Configuration

Checking GPU Availability

# Check available GPUs
nvidia-smi

# Check CUDA version
nvcc --version

# Test PyTorch GPU access
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

Setting GPU Visibility

# Use specific GPUs
export CUDA_VISIBLE_DEVICES=0,1

# Use all available GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Jupyter Notebook Setup

Local Jupyter on Compute Node

Request an interactive session:

srun --partition=gpu --gres=gpu:1 --time=4:00:00 --pty bash

Start Jupyter:

module load python/3.9
source ~/venvs/deeplearning/bin/activate
jupyter notebook --no-browser --port=8888 --ip=0.0.0.0

Set up SSH tunnel (from your local machine):

ssh -L 8888:compute-node:8888 username@prometheus-cluster.example.com

JupyterHub Access

If available, access JupyterHub directly:

https://jupyter.prometheus-cluster.example.com

Development Tools

VS Code Remote Development

Install VS Code with Remote-SSH extension
Configure SSH connection in VS Code
Connect to cluster and open your project folder

tmux for Session Management

# Start new session
tmux new-session -s training

# Detach session (Ctrl+b, then d)
# Reattach session
tmux attach-session -t training

# List sessions
tmux list-sessions

Storage and Data Access

Home Directory Setup

# Create project structure
mkdir -p ~/projects/{experiments,datasets,models,scripts}
mkdir -p ~/logs

Using Shared Storage

# Link shared datasets
ln -s /shared/datasets ~/datasets

# Copy models to your space
cp -r /shared/models/pretrained ~/models/

# Use scratch space for temporary files
export TMPDIR=/scratch/$USER
mkdir -p $TMPDIR

Environment Variables

Create ~/.cluster_env:

# CUDA settings
export CUDA_VISIBLE_DEVICES=0,1,2,3
export CUDA_CACHE_PATH=/scratch/$USER/cuda_cache

# Python settings
export PYTHONPATH=$HOME/projects:$PYTHONPATH
export JUPYTER_CONFIG_DIR=$HOME/.jupyter

# Weights & Biases
export WANDB_DIR=$HOME/logs/wandb
export WANDB_CACHE_DIR=/scratch/$USER/wandb_cache

# Hugging Face
export HF_DATASETS_CACHE=/scratch/$USER/hf_cache
export TRANSFORMERS_CACHE=/scratch/$USER/transformers_cache

Source it in your .bashrc:

echo 'source ~/.cluster_env' >> ~/.bashrc

Troubleshooting

Common Issues

CUDA out of memory:

# Clear GPU memory
nvidia-smi --gpu-reset

# Monitor GPU usage
watch -n 1 nvidia-smi

Module not found:

# Check loaded modules
module list

# Reload environment
source ~/.bashrc

Permission denied:

# Check file permissions
ls -la

# Fix permissions
chmod 755 script.py

Next Steps

Submit your first job: Job Submission Guide
Monitor your work: Monitoring Guide
Manage data: Storage Guide

4 - Job Submission

Submit and manage jobs using SLURM on the Prometheus cluster

SLURM Job Scheduler

The Prometheus cluster uses SLURM (Simple Linux Utility for Resource Management) for job scheduling and resource allocation. SLURM ensures fair resource sharing and efficient cluster utilization.

Important

NEVER run compute jobs directly on the head node! Always use SLURM to submit jobs to compute nodes.

Interactive Jobs

Interactive jobs are perfect for development, testing, and debugging. Use srun to request resources immediately.

Basic Interactive Session

# Request 1 GPU, 2 CPUs, 1GB RAM for 1 hour
srun -c 1 -n 1 -p defq --qos=normal --mem=100 --gres=gpu:1 -t 01:00 --pty /bin/bash

Interactive Session Parameters

# Request specific resources
srun --partition=defq \
     --qos=normal \
     --cpus-per-task=4 \
     --gres=gpu:2 \
     --mem=20000 \
     --time=04:00:00 \
     --pty /bin/bash

A6000 Interactive Session

# Request A6000 GPU with high memory
srun --partition=a6000 \
     --qos=normal-a6000 \
     --cpus-per-task=8 \
     --gres=gpu:1 \
     --mem=50000 \
     --time=02:00:00 \
     --pty /bin/bash

Batch Jobs

Batch jobs run unattended and are ideal for training models, parameter sweeps, and long-running experiments.

Basic Batch Script

Create a file train_model.slurm:

#!/bin/bash
#SBATCH -o res_%j.txt      # output file
#SBATCH -e res_%j.err      # error file
#SBATCH -J my_training     # job name
#SBATCH --partition=defq   # partition
#SBATCH --qos=normal       # priority queue
#SBATCH --ntasks=1         # number of tasks
#SBATCH --cpus-per-task=4  # CPU cores per task
#SBATCH --gres=gpu:2       # number of GPUs
#SBATCH --mem=32000        # memory in MB
#SBATCH --time=1-12:00     # time limit (1 day, 12 hours)

# Load required modules
module load CUDA/11.3.1
module load Python/3.9.5

# Activate your environment
source ~/anaconda3/bin/activate
conda activate myenv

# Set CUDA devices
export CUDA_VISIBLE_DEVICES=0,1

# Run your training script
cd /lustreFS/data/mygroup/myproject
python train_model.py --epochs 100 --batch-size 64

Submit the Batch Job

sbatch train_model.slurm

SLURM Parameters Reference

Common SBATCH Directives

Parameter	Description	Example
`-J, --job-name`	Job name	`#SBATCH -J training_job`
`-o, --output`	Output file	`#SBATCH -o output_%j.txt`
`-e, --error`	Error file	`#SBATCH -e error_%j.err`
`-p, --partition`	Partition	`#SBATCH --partition=defq`
`--qos`	Priority queue	`#SBATCH --qos=normal`
`-n, --ntasks`	Number of tasks	`#SBATCH --ntasks=1`
`-c, --cpus-per-task`	CPUs per task	`#SBATCH --cpus-per-task=8`
`--gres`	Generic resources	`#SBATCH --gres=gpu:4`
`--mem`	Memory (MB)	`#SBATCH --mem=64000`
`-t, --time`	Time limit	`#SBATCH --time=2-00:00`

Time Format Examples

# 30 minutes
#SBATCH --time=00:30:00

# 4 hours
#SBATCH --time=04:00:00

# 1 day, 12 hours
#SBATCH --time=1-12:00:00

# 7 days (maximum for long queue)
#SBATCH --time=7-00:00:00

GPU Allocation Examples

# Request any available GPU
#SBATCH --gres=gpu:1

# Request multiple GPUs
#SBATCH --gres=gpu:4

# All GPUs on A5000 node (8 GPUs)
#SBATCH --gres=gpu:8

# A6000 GPU specifically
#SBATCH --partition=a6000 --gres=gpu:1

Advanced Job Examples

Multi-GPU Training Script

#!/bin/bash
#SBATCH -J multi_gpu_training
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:4
#SBATCH --mem=128000
#SBATCH --time=2-00:00:00
#SBATCH -o logs/multi_gpu_%j.out
#SBATCH -e logs/multi_gpu_%j.err

# Ensure logs directory exists
mkdir -p logs

# Load modules
module load CUDA/11.3.1 Python/3.9.5

# Activate environment
conda activate pytorch-env

# Set environment variables
export CUDA_VISIBLE_DEVICES=0,1,2,3
export OMP_NUM_THREADS=4

# Run distributed training
python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --master_port=12355 \
    train_distributed.py \
    --config config.yaml \
    --output-dir /lustreFS/data/mygroup/results

Parameter Sweep with Job Arrays

#!/bin/bash
#SBATCH -J param_sweep
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --array=1-20%5      # 20 jobs, max 5 running
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:1
#SBATCH --mem=16000
#SBATCH --time=08:00:00
#SBATCH -o logs/sweep_%A_%a.out
#SBATCH -e logs/sweep_%A_%a.err

# Parameter arrays
learning_rates=(0.001 0.01 0.1 0.2)
batch_sizes=(16 32 64 128 256)

# Calculate parameters for this task
lr_idx=$(( ($SLURM_ARRAY_TASK_ID - 1) / ${#batch_sizes[@]} ))
bs_idx=$(( ($SLURM_ARRAY_TASK_ID - 1) % ${#batch_sizes[@]} ))

lr=${learning_rates[$lr_idx]}
bs=${batch_sizes[$bs_idx]}

# Load environment
module load Python/3.9.5
conda activate myenv

# Run experiment
python train.py \
    --learning-rate $lr \
    --batch-size $bs \
    --experiment-name "sweep_${SLURM_ARRAY_TASK_ID}" \
    --output-dir /lustreFS/data/mygroup/sweep_results

A6000 High-Memory Job

#!/bin/bash
#SBATCH -J large_model_training
#SBATCH --partition=a6000
#SBATCH --qos=long-a6000
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:3        # Use 3 of 4 A6000 GPUs
#SBATCH --mem=256000        # 256GB RAM
#SBATCH --time=5-00:00:00   # 5 days
#SBATCH -o logs/large_model_%j.out
#SBATCH -e logs/large_model_%j.err

# Load modules
module load CUDA/11.3.1

# Activate environment with large model libraries
conda activate large-models

# Set memory optimization
export CUDA_VISIBLE_DEVICES=0,1,2
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

# Train large language model
python train_llm.py \
    --model-size 70B \
    --gradient-checkpointing \
    --fp16 \
    --data-dir /lustreFS/data/mygroup/datasets \
    --output-dir /lustreFS/data/mygroup/llm_checkpoints

Job Management Commands

Monitoring Jobs

# Check job queue
squeue

# Check your jobs only
squeue -u $USER

# Detailed job information
scontrol show job <job_id>

# Job history
sacct -u $USER --starttime=2024-01-01

# Job efficiency statistics
seff <job_id>

Job Control

# Cancel a job
scancel <job_id>

# Cancel all your jobs
scancel -u $USER

# Cancel jobs by name
scancel --name=training_job

# Hold a job (prevent it from running)
scontrol hold <job_id>

# Release a held job
scontrol release <job_id>

Job Information

# Show partition information
sinfo

# Show detailed node information
scontrol show nodes

# Show QoS information
sacctmgr show qos

# Show your job priorities
sprio -u $USER

Resource Monitoring

During Job Execution

# Monitor GPU usage on your job
srun --jobid=<job_id> nvidia-smi

# Check memory usage
srun --jobid=<job_id> free -h

# Monitor CPU usage
srun --jobid=<job_id> htop

Job Performance Analysis

# Job efficiency report
seff <job_id>

# Detailed job accounting
sacct -j <job_id> --format=JobID,JobName,Partition,Account,AllocCPUS,State,ExitCode,Start,End,Elapsed,MaxRSS,MaxVMSize

Best Practices

Resource Allocation

Request appropriate resources:

# Don't over-allocate
#SBATCH --cpus-per-task=4   # Not 32 if you only use 4
#SBATCH --mem=16000         # Not 500000 if you only need 16GB

Use job arrays for parameter sweeps:

#SBATCH --array=1-100%10    # Limit concurrent jobs

Choose appropriate partitions:
- Use defq for most workloads
- Use a6000 only when you need >24GB GPU memory

Data Management

Use appropriate storage:

# Large datasets and results
cd /lustreFS/data/mygroup

# Temporary files during job
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR

Clean up after jobs:

# Add to end of script
rm -rf /tmp/$SLURM_JOB_ID

Debugging

Test interactively first:

srun -p defq --qos=normal --gres=gpu:1 --time=1:00:00 --pty /bin/bash

Use smaller datasets for debugging:

python train.py --debug --max-samples 1000

Check logs regularly:
```
tail -f logs/training_12345.out
```

Troubleshooting

Common Issues

Job pending forever:

# Check why job is pending
squeue -u $USER -t PENDING
scontrol show job <job_id> | grep Reason

Out of memory errors:

# Reduce batch size or request more memory
#SBATCH --mem=64000  # Increase memory

CUDA out of memory:

# In your script, add:
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

Job killed by time limit:

# Request more time or use checkpointing
#SBATCH --time=3-00:00:00

Cannot access files:

# Check file permissions and paths
ls -la /lustreFS/data/mygroup/

Next Steps

Learn about partitions and queues: Partitions & Queues
Understand storage systems: Storage Guide
Set up development environment: Environment Modules

5 - Partitions & Queues

Understanding SLURM partitions and priority queues on Prometheus

Overview

The Prometheus cluster has two partitions with different priority queues (QoS) that control resource limits and scheduling priority. All limits are applied per group, and the default time limit is 4 hours for all partitions.

Partition Architecture

`defq` Partition (Default)

Nodes: 8 compute nodes (gpu[01-08])
GPU Type: NVIDIA A5000 (24GB VRAM each)
Total GPUs: 64 (8 GPUs per node)
Default partition: Jobs submitted without specifying partition go here

`a6000` Partition

Nodes: 1 compute node (gpu09)
GPU Type: NVIDIA RTX A6000 Ada Generation (48GB VRAM each)
Total GPUs: 4
Use case: High-memory GPU workloads

Priority Queues (QoS)

Resource Limits

All resource limits are applied per group, not per user. Coordinate with your group members to avoid conflicts.

`defq` Partition Queues

Priority Queue	Time Limit	Max CPUs	Max GPUs	Max RAM	Max Jobs	Priority
`normal`	1 day	384	48	3TB	30	High
`long`	7 days	384	48	3TB	20	Medium
`preemptive`	Infinite	All*	All*	All*	10	Low

`a6000` Partition Queues

Priority Queue	Time Limit	Max CPUs	Max GPUs	Max RAM	Max Jobs	Priority
`normal-a6000`	1 day	48	3	384GB	6	High
`long-a6000`	7 days	48	3	384GB	4	Medium
`preemptive-a6000`	Infinite	All*	All*	All*	2	Low

* Preemptive queues can use all available resources but jobs may be automatically terminated when higher-priority jobs need resources.

Queue Selection Guidelines

Use `normal` or `normal-a6000` for:

Interactive development and testing
Short training runs (< 24 hours)
Production jobs that need guaranteed completion
Debugging and experimentation

#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --time=12:00:00

Use `long` or `long-a6000` for:

Extended training (1-7 days)
Large model training requiring multiple days
Parameter sweeps with many iterations
Production workloads with longer time requirements

#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --time=3-00:00:00  # 3 days

Use `preemptive` queues sparingly for:

Low-priority background jobs
Opportunistic computing when cluster is idle
Jobs that can handle interruption (with checkpointing)
Testing with unlimited time

Warning

Preemptive jobs can be automatically terminated at any time! Use the requeue option and implement checkpointing.

#SBATCH --partition=defq
#SBATCH --qos=preemptive
#SBATCH --requeue          # Automatically resubmit if preempted
#SBATCH --time=7-00:00:00

Choosing the Right Partition

Use `defq` partition when:

Your models fit in 24GB GPU memory
You need multiple GPUs (up to 8 per node)
Running distributed training across nodes
Working with standard deep learning models

Use `a6000` partition when:

Your models require > 24GB GPU memory
Training large language models (70B+ parameters)
Working with high-resolution images or long sequences
Need maximum GPU memory per device

Example Job Submissions

Standard Training Job (defq/normal)

#!/bin/bash
#SBATCH -J standard_training
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:2
#SBATCH --mem=64000
#SBATCH --time=18:00:00

# Your training code here
python train_model.py --gpus 2

Long Training Job (defq/long)

#!/bin/bash
#SBATCH -J long_training
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:4
#SBATCH --mem=128000
#SBATCH --time=5-00:00:00  # 5 days

# Long-running training with checkpointing
python train_model.py --gpus 4 --checkpoint-freq 1000

Large Model Training (a6000/normal-a6000)

#!/bin/bash
#SBATCH -J large_model
#SBATCH --partition=a6000
#SBATCH --qos=normal-a6000
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:2
#SBATCH --mem=256000
#SBATCH --time=20:00:00

# Large model requiring high GPU memory
python train_llm.py --model-size 70B --gpus 2

Preemptive Job with Checkpointing

#!/bin/bash
#SBATCH -J preemptive_job
#SBATCH --partition=defq
#SBATCH --qos=preemptive
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --mem=32000
#SBATCH --time=7-00:00:00
#SBATCH --requeue
#SBATCH --signal=SIGUSR1@90  # Signal 90 seconds before termination

# Handle preemption gracefully
trap 'echo "Job preempted, saving checkpoint..."; python save_checkpoint.py' SIGUSR1

python train_model.py --resume-from-checkpoint

Monitoring Queue Status

Check Partition Information

# View all partitions and their status
sinfo

# Example output:
# PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
# defq*        up   infinite      6  idle~ gpu[03-08]
# defq*        up   infinite      1    mix gpu02
# defq*        up   infinite      1   idle gpu01
# a6000        up   infinite      1    mix gpu09

Check Queue Status

# View current job queue
squeue

# View jobs by partition
squeue -p defq
squeue -p a6000

# View jobs by QoS
squeue --qos=normal
squeue --qos=long

Check Your Resource Usage

# View your running jobs
squeue -u $USER

# Check job priorities
sprio -u $USER

# View resource limits
sacctmgr show qos format=Name,MaxWall,MaxTRES,MaxJobs

Resource Planning

Calculate Resource Needs

Before submitting jobs, consider:

GPU Memory Requirements:
- Small models (< 1B params): A5000 (24GB)
- Large models (> 10B params): A6000 (48GB)
Training Time Estimates:
- Quick experiments: normal queue (< 1 day)
- Full training: long queue (1-7 days)
Number of GPUs:
- Single GPU: Any node
- Multi-GPU: Consider node topology
- Distributed: Multiple nodes in defq

Group Coordination

Since limits are per group:

Communicate with group members
Check current group usage:
```
squeue -A your-group-name
```
Plan resource allocation to avoid conflicts

Best Practices

Queue Selection Strategy

Start with normal queues for development
Use long queues only when necessary
Avoid preemptive queues unless jobs can handle interruption
Test on smaller resources before scaling up

Resource Efficiency

Don’t over-allocate resources:

# Bad: Requesting 8 GPUs for single-GPU code
#SBATCH --gres=gpu:8

# Good: Request what you actually use
#SBATCH --gres=gpu:1

Use appropriate memory:

# Calculate actual memory needs
#SBATCH --mem=32000  # 32GB, not 500GB

Estimate time accurately:

# Add buffer but don't overestimate
#SBATCH --time=18:00:00  # 18 hours, not 7 days

Troubleshooting

Job Stuck in Queue

# Check why job is pending
scontrol show job <job_id> | grep Reason

# Common reasons:
# - Resources: Requesting more than available
# - Priority: Lower priority than other jobs
# - QoSMaxJobsPerUser: Too many jobs running

Resource Limit Exceeded

# Check current group usage
squeue -A your-group

# Reduce resource requests or wait for jobs to complete

Wrong Partition Choice

# Cancel and resubmit with correct partition
scancel <job_id>
# Edit script and resubmit
sbatch corrected_script.slurm

Next Steps

Learn about storage systems: Storage Guide
Set up your environment: Environment Modules
Configure VS Code: VS Code Setup

6 - Storage Systems

File systems, quotas, and data management on the Prometheus cluster

Storage Overview

The Prometheus cluster provides multiple storage systems optimized for different use cases, from personal files to high-performance parallel computing workloads.

Storage Architecture

Home Directories (`/trinity/home/`)

Type: SSD-backed storage
Mount point: /trinity/home/<username>
Quota: 20GB per user
Purpose: Personal configuration files, small scripts
Backup: Regular backups maintained
Performance: Fast random I/O, moderate capacity

Shared Group Storage (`/lustreFS/data/`)

Type: Lustre parallel file system
Mount point: /lustreFS/data/<group-name>
Quota: 30TB per group (or 20,971,520 files)
Purpose: Primary workspace for research data and results
Performance: High-throughput parallel I/O
Shared: All group members have access

Local Node Storage

Type: NVMe SSD (1TB per compute node)
Purpose: Temporary files during job execution
Access: Only available during allocated jobs
Performance: Highest IOPS for temporary data

File System Details

Home Directory Usage

Quota Limit

Home directories have a strict 20GB limit. Use them only for configuration files, not data or models.

# Check your home directory quota
quota -us

# View home directory contents
ls -la /trinity/home/$USER

# Typical home directory structure
/trinity/home/username/
├── .bashrc                 # Shell configuration
├── .ssh/                   # SSH keys and config
├── .jupyter/               # Jupyter configuration
├── .conda/                 # Conda configuration
├── scripts/                # Small utility scripts
└── .local/                 # Local Python packages

Best practices for home directories:

Store only configuration files and small scripts
Link to shared storage for data access
Use symbolic links to avoid quota issues

Shared Group Storage

The /lustreFS/data/ directory provides high-performance storage for your research work:

# Access your group's shared storage
cd /lustreFS/data/<group-name>

# Check group quota
lfs quota -gh <group-name> /lustreFS/

# Example group directory structure
/lustreFS/data/mygroup/
├── datasets/               # Shared datasets
├── models/                 # Pre-trained and trained models
├── experiments/            # Individual user experiments
│   ├── user1/
│   ├── user2/
│   └── shared/
├── code/                   # Shared code repositories
├── results/                # Experiment results
└── tmp/                    # Temporary files

Quota information:

Space limit: 30TB per group
File limit: 20,971,520 files per group
Shared: All group members can read/write

Local Node Storage

Each compute node has local NVMe storage for temporary files:

# During a SLURM job, use local storage for temporary files
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR

# Example usage in job script
#SBATCH --job-name=training
#SBATCH --gres=gpu:1

# Create temporary directory
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR

# Copy data to local storage for faster I/O
cp /lustreFS/data/mygroup/dataset.tar.gz $TMPDIR/
cd $TMPDIR
tar -xzf dataset.tar.gz

# Run training with local data
python train.py --data-dir $TMPDIR/dataset

# Copy results back to shared storage
cp -r $TMPDIR/results /lustreFS/data/mygroup/experiments/

Quota Management

Checking Quotas

# Check your home directory quota
quota -us

# Check group quota on Lustre filesystem
lfs quota -gh <group-name> /lustreFS/

# Example quota output:
# Disk quotas for group mygroup (gid 1001):
#      Filesystem    used   quota   limit   grace   files   quota   limit   grace
#        /lustreFS   15.2T     30T     30T       -  1234567  20971520 20971520   -

Understanding Quota Output

used: Current usage
quota: Soft limit (warning threshold)
limit: Hard limit (cannot exceed)
grace: Time allowed to exceed soft quota
files: Number of files/inodes used

Managing Quota Issues

When approaching quota limits:

Clean up temporary files:

# Find large files
find /lustreFS/data/mygroup -type f -size +1G -ls

# Find old temporary files
find /lustreFS/data/mygroup -name "*.tmp" -mtime +7 -delete

Archive old data:

# Compress old experiments
tar -czf old_experiments.tar.gz experiments/2023/
rm -rf experiments/2023/

Use efficient storage:

# Use compressed formats for datasets
# Store checkpoints selectively
# Remove duplicate files

Data Management Best Practices

Directory Organization

Organize your group’s shared storage efficiently:

# Recommended structure
/lustreFS/data/mygroup/
├── datasets/
│   ├── imagenet/           # Large shared datasets
│   ├── coco/
│   └── custom/
├── models/
│   ├── pretrained/         # Downloaded pre-trained models
│   └── checkpoints/        # Training checkpoints
├── experiments/
│   ├── user1/
│   │   ├── project_a/
│   │   └── project_b/
│   └── user2/
├── code/
│   ├── shared_utils/       # Shared code libraries
│   └── experiments/        # Experiment code
└── results/
    ├── papers/             # Results for publications
    └── ongoing/            # Current experiment results

File Permissions

Set appropriate permissions for shared access:

# Make directories group-writable
chmod g+w /lustreFS/data/mygroup/datasets/

# Set default permissions for new files
umask 002

# Change group ownership if needed
chgrp -R mygroup /lustreFS/data/mygroup/shared/

Data Transfer

Small Files (< 1GB)

# Copy from local machine using scp
scp large_dataset.tar.gz prometheus:/lustreFS/data/mygroup/datasets/

# Copy between directories on cluster
cp -r /lustreFS/data/mygroup/datasets/source /lustreFS/data/mygroup/experiments/

Large Files (> 1GB)

# Use rsync for large transfers with progress
rsync -avP large_dataset/ prometheus:/lustreFS/data/mygroup/datasets/

# Parallel compression for large datasets
tar -cf - dataset/ | pigz > dataset.tar.gz

Download Datasets

# Download directly to shared storage
cd /lustreFS/data/mygroup/datasets/
wget https://example.com/large_dataset.tar.gz

# Use aria2 for faster parallel downloads
aria2c -x 8 -s 8 https://example.com/dataset.tar.gz

Backup Strategies

While the cluster provides reliable storage, implement your own backup strategy:

Important results: Copy to external storage
Code: Use git repositories
Large datasets: Document download sources for re-acquisition
Models: Keep important checkpoints on external storage

Performance Optimization

Lustre File System Tips

Use parallel I/O for large files:

# PyTorch DataLoader with multiple workers
dataloader = DataLoader(dataset, batch_size=64, num_workers=8)

Avoid small random writes:

# Bad: Many small writes
for i in {1..1000}; do echo $i >> file.txt; done

# Good: Batch writes
seq 1 1000 > file.txt

Use appropriate stripe settings for large files:

# Set stripe count for large files (> 1GB)
lfs setstripe -c 4 /lustreFS/data/mygroup/large_dataset/

Local Storage Performance

Copy frequently accessed data to local storage during jobs
Use local storage for temporary files and intermediate results
Copy final results back to shared storage

# Example job with local storage optimization
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=4:00:00

# Set up local temporary directory
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR

# Copy dataset to local storage
echo "Copying dataset to local storage..."
cp /lustreFS/data/mygroup/dataset.tar.gz $TMPDIR/
cd $TMPDIR
tar -xzf dataset.tar.gz

# Run training with local data (much faster I/O)
python train.py --data-dir $TMPDIR/dataset --output-dir $TMPDIR/results

# Copy results back to shared storage
echo "Copying results back..."
cp -r $TMPDIR/results /lustreFS/data/mygroup/experiments/

Environment Variables

Set up useful environment variables for data management:

# Add to your ~/.bashrc
export GROUP_DATA="/lustreFS/data/mygroup"
export DATASETS="$GROUP_DATA/datasets"
export MODELS="$GROUP_DATA/models"
export EXPERIMENTS="$GROUP_DATA/experiments/$USER"
export RESULTS="$GROUP_DATA/results"

# Create your experiment directory
mkdir -p $EXPERIMENTS

Common Storage Issues

Quota Exceeded

# Error: "Disk quota exceeded"
# Solution: Check and clean up usage
lfs quota -gh mygroup /lustreFS/
find $GROUP_DATA -type f -size +1G -ls

Permission Denied

# Error: "Permission denied"
# Solution: Check file permissions and group membership
ls -la /lustreFS/data/mygroup/
groups  # Check your group membership

Slow I/O Performance

# Solutions:
# 1. Use local storage for temporary files
# 2. Reduce number of small files
# 3. Use parallel I/O libraries
# 4. Check stripe settings for large files
lfs getstripe /lustreFS/data/mygroup/large_file

File System Full

# Check available space
df -h /lustreFS

# If file system is full, clean up:
# 1. Remove temporary files
# 2. Compress old data
# 3. Archive completed experiments

Next Steps

Set up your development environment: Environment Modules
Configure VS Code for remote development: VS Code Setup
Learn about software installation: Software Installation

7 - Environment Modules

Using Lmod environment modules to manage software on Prometheus

Overview

The Prometheus cluster uses Lmod (Lua-based Environment Modules) to manage software packages and their dependencies. This system allows you to easily load and unload different software versions without conflicts.

Module System Basics

Environment modules modify your shell environment to provide access to specific software packages. When you load a module, it typically:

Adds software to your PATH
Sets environment variables
Loads required dependencies
Configures library paths

Basic Module Commands

List Available Modules

# Show all available modules
module available
module avail
module av
ml av

# Search for specific modules
module avail gcc
module avail python
module avail cuda

Load Modules

# Load a specific module
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5

# Short form using 'ml'
ml GCC/10.3.0 CUDA/11.3.1 Python/3.9.5

Check Loaded Modules

# List currently loaded modules
module list
ml list

Unload Modules

# Unload a specific module
module unload GCC/10.3.0

# Unload all modules
module purge
ml purge

Common Software Modules

Compilers

# GNU Compiler Collection
module load GCC/10.3.0
module load GCC/11.2.0

# Intel Compilers (if available)
module load intel/2021.4.0

CUDA and GPU Development

# CUDA Toolkit
module load CUDA/11.3.1
module load CUDA/11.7.0
module load CUDA/12.0.0

# Check CUDA after loading
nvcc --version
nvidia-smi

Python Environments

# Python interpreter
module load Python/3.9.5
module load Python/3.10.8

# Python with scientific libraries
module load Python/3.9.5-GCCcore-10.3.0

Deep Learning Frameworks

# PyTorch (if pre-installed as module)
module load PyTorch/1.12.1-foss-2022a-CUDA-11.7.0

# TensorFlow (if pre-installed as module)
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0

Development Tools

# Git (usually available by default)
module load git/2.36.0

# CMake
module load CMake/3.24.3

# HDF5 for data storage
module load HDF5/1.12.2

Module Dependencies

Lmod automatically handles dependencies. When you load a module, it loads required dependencies:

# Loading Python might automatically load GCC
module load Python/3.9.5

# Check what was loaded
module list
# Might show:
# GCCcore/10.3.0
# Python/3.9.5-GCCcore-10.3.0

Setting Up Your Environment

Create a Module Loading Script

Create ~/load_modules.sh for consistent environment setup:

#!/bin/bash
# ~/load_modules.sh - Load standard development environment

# Clear any existing modules
module purge

# Load core development tools
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5

# Optional: Load additional tools
# module load git/2.36.0
# module load CMake/3.24.3

echo "Development environment loaded:"
module list

Make it executable and use it:

chmod +x ~/load_modules.sh
source ~/load_modules.sh

Add to Your Shell Configuration

Add common modules to your ~/.bashrc:

# Add to ~/.bashrc
# Load standard modules at login
if [ -f ~/load_modules.sh ]; then
    source ~/load_modules.sh
fi

SLURM Job Scripts with Modules

Basic Job with Modules

#!/bin/bash
#SBATCH -J module_job
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --gres=gpu:1
#SBATCH --time=2:00:00

# Load required modules
module purge
module load CUDA/11.3.1
module load Python/3.9.5

# Verify modules are loaded
echo "Loaded modules:"
module list

# Check CUDA availability
echo "CUDA version:"
nvcc --version

# Activate your conda environment
source ~/anaconda3/bin/activate
conda activate myenv

# Run your script
python train.py

Multiple GPU Job with Modules

#!/bin/bash
#SBATCH -J multi_gpu_training
#SBATCH --partition=defq
#SBATCH --qos=long
#SBATCH --gres=gpu:4
#SBATCH --time=1-00:00:00

# Load modules for CUDA development
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5

# Load MPI for distributed computing (if available)
# module load OpenMPI/4.1.4-GCC-10.3.0

# Set CUDA environment
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Activate environment
conda activate pytorch-gpu

# Run distributed training
python -m torch.distributed.launch \
    --nproc_per_node=4 \
    train_distributed.py

Python Package Management

Using Conda with Modules

# Load Python module
module load Python/3.9.5

# Install conda (if not already available)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/miniconda3

# Add conda to path
echo 'export PATH="$HOME/miniconda3/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Create environment
conda create -n myenv python=3.9
conda activate myenv

# Install packages
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Using pip with Modules

# Load Python module
module load Python/3.9.5

# Create virtual environment
python -m venv ~/venvs/myproject
source ~/venvs/myproject/bin/activate

# Upgrade pip
pip install --upgrade pip

# Install packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
pip install jupyter numpy pandas matplotlib

Module Collections

Save frequently used module combinations:

# Save current modules as a collection
module save my_collection

# List saved collections
module savelist

# Restore a collection
module restore my_collection

Custom Module Paths

If your group has custom modules:

# Add custom module path
module use /lustreFS/data/mygroup/modules

# Check module paths
module show MODULEPATH

Troubleshooting Modules

Common Issues

Module not found:

# Check available modules
module avail | grep -i package_name

# Check if you have access to the module path
ls -la /opt/modules/

Conflicting modules:

# Clear all modules and start fresh
module purge
module load GCC/10.3.0 CUDA/11.3.1

CUDA not found after loading:

# Verify CUDA module is loaded
module list | grep -i cuda

# Check CUDA environment
echo $CUDA_HOME
echo $CUDA_PATH
which nvcc

Python packages not found:

# Ensure Python module is loaded before using pip/conda
module load Python/3.9.5
which python
python --version

Module Information

# Show detailed module information
module show CUDA/11.3.1
module help CUDA/11.3.1

# See what a module does before loading
module display CUDA/11.3.1

Best Practices

For Interactive Development

Create a standard environment script
Use module collections for frequently used combinations
Load modules before activating conda/venv

For SLURM Jobs

Always start with module purge
Load modules explicitly in job scripts
Verify modules are loaded with module list
Document module requirements in your scripts

For Reproducibility

Pin module versions in scripts:

module load CUDA/11.3.1  # Not just 'CUDA'

Document module requirements:

# Required modules:
# - GCC/10.3.0
# - CUDA/11.3.1
# - Python/3.9.5

Use environment files for complex setups

Example Workflows

Deep Learning Setup

# Standard deep learning environment
module purge
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5

# Activate conda environment
conda activate pytorch-env

# Verify setup
python -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

Development Environment

# Development tools
module purge
module load GCC/10.3.0
module load CUDA/11.3.1
module load Python/3.9.5
module load git/2.36.0
module load CMake/3.24.3

# Save as collection
module save development

# Later, restore quickly
module restore development

Compilation Environment

# For compiling CUDA code
module purge
module load GCC/10.3.0
module load CUDA/11.3.1

# Compile CUDA program
nvcc -o program program.cu

# For C++ with GPU support
g++ -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcudart program.cpp -o program

Next Steps

Configure VS Code for remote development: VS Code Setup
Learn about software installation: Software Installation
Submit your first job: Job Submission

8 - VS Code Remote Development

Set up Visual Studio Code for remote development on the Prometheus cluster

Overview

Visual Studio Code provides excellent remote development capabilities for the Prometheus cluster. You can edit code, run Jupyter notebooks, and debug applications directly on the cluster while using your local VS Code interface.

Prerequisites

VS Code Desktop installed on your local machine
Remote-SSH extension for VS Code
SSH access to Prometheus cluster (see Getting Started)
Valid cluster account with SSH keys configured

Install Required Extensions

Install these essential VS Code extensions:

Remote - SSH (ms-vscode-remote.remote-ssh)
Python (ms-python.python)
Jupyter (ms-toolsai.jupyter)
Git integration (built-in)

Optional but recommended:

Remote - SSH: Editing Configuration Files (ms-vscode-remote.remote-ssh-edit)
GitLens (eamodio.gitlens)
Thunder Client for API testing (rangav.vscode-thunder-client)

SSH Configuration

Basic SSH Setup

First, ensure your ~/.ssh/config file is properly configured:

# ~/.ssh/config
Host prometheus
  Hostname prometheus.cyens.org.cy
  User <your-username>
  IdentityFile ~/.ssh/id_rsa

Host *.cluster
  User <your-username>
  IdentityFile ~/.ssh/prometheus_user_sshd
  ProxyJump prometheus

Replace <your-username> with your actual cluster username.

User SSHD Process Setup

To connect VS Code directly to compute nodes, you need to set up a user SSHD process through SLURM.

Step 1: Generate SSH Keys for User SSHD

Connect to Prometheus and create SSH keys for the user SSHD process:

ssh prometheus
ssh-keygen -t rsa -f ~/.ssh/prometheus_user_sshd

This creates:

Private key: ~/.ssh/prometheus_user_sshd
Public key: ~/.ssh/prometheus_user_sshd.pub

Step 2: Create SSHD Job Script

Create ~/sshd.sh script for launching the user SSHD process:

#!/bin/bash
#SBATCH -o res_%j.txt      # output file
#SBATCH -e res_%j.err      # error file
#SBATCH -J sshd            # job name
#SBATCH --partition=defq   # partition
#SBATCH --qos=normal       # priority queue
#SBATCH --ntasks=1         # number of tasks
#SBATCH --cpus-per-task=2  # CPU cores
#SBATCH --gres=gpu:1       # number of GPUs
#SBATCH --mem=1000         # memory in MB
#SBATCH --time=0-04:00     # 4 hours maximum

# Find an available port
PORT=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')

echo "********************************************************************"
echo "Starting sshd in Slurm as user"
echo "Environment information:"
echo "Date:" $(date)
echo "Allocated node:" $(hostname)
echo "Path:" $(pwd)
echo "Listening on:" $PORT
echo "********************************************************************"

# Start user SSHD process
/usr/sbin/sshd -D -p ${PORT} -f /dev/null -h ${HOME}/.ssh/prometheus_user_sshd

Make the script executable:

chmod +x ~/sshd.sh

Step 3: Submit SSHD Job

Submit the SSHD job to get a compute node:

sbatch sshd.sh

Check the job status and get the allocated node and port:

# Check job status
squeue -u $USER

# View the output file to get connection details
cat res_<job_id>.txt

The output will show something like:

Starting sshd in Slurm as user
Date: Thu Jun 5 10:30:00 UTC 2025
Allocated node: gpu02
Listening on: 45672

Connecting VS Code

Method 1: Direct Connection

Open VS Code on your local machine
Press F1 or Ctrl/Cmd+Shift+P to open command palette
Type: “Remote-SSH: Connect to Host”
Enter: ssh -p 45672 gpu02.cluster (use your actual port and node)

VS Code will automatically update your SSH config file.

Method 2: Manual SSH Config

Add the connection details to your ~/.ssh/config:

Host gpu02.cluster
    HostName gpu02.cluster
    Port 45672
    User <your-username>
    IdentityFile ~/.ssh/prometheus_user_sshd
    ProxyJump prometheus

Then connect using “Remote-SSH: Connect to Host” → gpu02.cluster

Development Workflow

1. Connect to Compute Node

# Submit SSHD job
sbatch sshd.sh

# Wait for job to start (check with squeue)
squeue -u $USER

# Get connection details
cat res_<job_id>.txt

# Connect VS Code to the allocated node

2. Open Your Project

Once connected to the compute node:

File → Open Folder
Navigate to: /lustreFS/data/mygroup/experiments/myproject
Open the folder

3. Set Up Python Environment

In the VS Code terminal on the remote machine:

# Load required modules
module load CUDA/11.3.1 Python/3.9.5

# Activate your conda environment
conda activate myenv

# Verify GPU access
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

4. Configure Python Interpreter

Press Ctrl/Cmd+Shift+P
Type: “Python: Select Interpreter”
Choose: Your conda environment interpreter
- Usually: ~/miniconda3/envs/myenv/bin/python

Working with Jupyter Notebooks

Start Jupyter Server

In the VS Code terminal on the remote machine:

# Load modules and activate environment
module load Python/3.9.5
conda activate myenv

# Start Jupyter (no browser needed)
jupyter notebook --no-browser --port=8888 --ip=0.0.0.0

Connect VS Code to Jupyter

Open a .ipynb file in VS Code
Click “Select Kernel” in the top-right
Choose “Existing Jupyter Server”
Enter: http://localhost:8888
Enter the token from the Jupyter output

Development Best Practices

Resource Management

Request appropriate resources for development:

#SBATCH --cpus-per-task=4  # Not 32 for development
#SBATCH --gres=gpu:1       # Usually sufficient for development
#SBATCH --mem=8000         # 8GB for most development tasks
#SBATCH --time=0-04:00     # 4 hours for development session

Use longer sessions for intensive work:

#SBATCH --time=0-08:00     # 8 hours for longer development
#SBATCH --qos=long         # If you need more than 1 day

File Organization

Set up a consistent workspace structure:

# Recommended project structure
/lustreFS/data/mygroup/experiments/myproject/
├── data/                   # Datasets and data files
├── notebooks/              # Jupyter notebooks
├── src/                    # Source code
├── configs/                # Configuration files
├── scripts/                # Training and utility scripts
├── results/                # Experiment results
└── README.md               # Project documentation

Environment Configuration

Create a workspace settings file (.vscode/settings.json):

{
    "python.defaultInterpreterPath": "~/miniconda3/envs/myenv/bin/python",
    "python.terminal.activateEnvironment": true,
    "jupyter.jupyterServerType": "local",
    "files.watcherExclude": {
        "**/data/**": true,
        "**/results/**": true,
        "**/.git/**": true
    }
}

Git Integration

Configure Git for your project:

# Set up Git credentials
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

# Initialize repository (if new project)
cd /lustreFS/data/mygroup/experiments/myproject
git init
git add .
git commit -m "Initial commit"

Troubleshooting

Connection Issues

“Could not establish connection”:

Check if SSHD job is running: squeue -u $USER
Verify node name and port from job output
Ensure SSH keys are properly configured

“Permission denied”:

Check SSH key permissions: chmod 600 ~/.ssh/prometheus_user_sshd
Verify ProxyJump configuration in SSH config
Test SSH connection manually: ssh -p <port> <node>.cluster

Performance Issues

Slow file operations:

Exclude large directories from VS Code watcher
Use local storage for frequently accessed files
Consider using VS Code on the head node for file browsing only

High memory usage:

Close unused notebooks and files
Restart VS Code Python extension if needed
Request more memory in SSHD job if necessary

SSHD Job Management

Job terminated unexpectedly:

Check job logs: cat res_<job_id>.err
Resubmit SSHD job: sbatch sshd.sh
Update VS Code connection with new port/node

Need longer development time:

# Modify sshd.sh for longer sessions
#SBATCH --time=0-08:00     # 8 hours
#SBATCH --qos=long         # For >24 hours

Cleanup and Best Practices

End Development Session

When finishing your work:

Save all files in VS Code
Close VS Code connection
Cancel the SSHD job:
```
scancel <job_id>
```
Remove SSH config entries added by VS Code (optional)

Security Considerations

Don’t leave SSHD jobs running when not needed
Use strong passphrases for SSH keys
Regularly rotate SSH keys if required by policy
Monitor your running jobs: squeue -u $USER

Advanced Configuration

Multiple Concurrent Sessions

You can run multiple SSHD jobs for different projects:

# Submit multiple jobs
sbatch sshd.sh
sbatch sshd.sh

# Connect VS Code to different nodes
# Node 1: gpu01.cluster:45672
# Node 2: gpu03.cluster:45673

Custom SSHD Configuration

Create specialized SSHD scripts for different use cases:

# sshd_gpu4.sh - For multi-GPU development
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=16
#SBATCH --mem=64000

# sshd_a6000.sh - For high-memory development
#SBATCH --partition=a6000
#SBATCH --qos=normal-a6000
#SBATCH --gres=gpu:1

Next Steps

Learn about software installation: Software Installation
Explore advanced job submission: Job Submission
Understand storage optimization: Storage Systems

9 - Software Installation

Installing and managing software packages on the Prometheus cluster

Overview

The Prometheus cluster provides several methods for installing and managing software packages. This guide covers both system-wide modules and user-specific installations.

Installation Methods

1. Environment Modules (Recommended)

Use pre-installed software via the module system when available:

module avail python
module load Python/3.9.5

2. Conda/Mamba Package Manager

Install packages in isolated environments:

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

3. pip Package Manager

Install Python packages via pip:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

4. Source Installation

Compile software from source when needed:

git clone https://github.com/project/repo.git
cd repo && python setup.py install

Setting Up Python Environments

Conda Installation

If conda is not available, install Miniconda:

# Download Miniconda
cd /lustreFS/data/mygroup/$USER
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install Miniconda
bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/miniconda3

# Initialize conda
~/miniconda3/bin/conda init bash
source ~/.bashrc

Create Virtual Environments

# Create a new environment
conda create -n pytorch-env python=3.9

# Activate environment
conda activate pytorch-env

# Install packages
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install jupyter matplotlib pandas scikit-learn

Environment Management

# List environments
conda env list

# Export environment
conda env export > environment.yml

# Create from file
conda env create -f environment.yml

# Remove environment
conda env remove -n old-env

Deep Learning Frameworks

PyTorch Installation

# Create PyTorch environment
conda create -n pytorch python=3.9
conda activate pytorch

# Install PyTorch with CUDA support
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

# Verify installation
python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

TensorFlow Installation

# Create TensorFlow environment
conda create -n tensorflow python=3.9
conda activate tensorflow

# Install TensorFlow
pip install tensorflow[and-cuda]

# Verify GPU support
python -c "import tensorflow as tf; print(f'TensorFlow {tf.__version__}, GPUs: {len(tf.config.list_physical_devices("GPU"))}')"

JAX Installation

# Create JAX environment
conda create -n jax python=3.9
conda activate jax

# Install JAX with CUDA support
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

# Verify installation
python -c "import jax; print(f'JAX devices: {jax.devices()}')"

Specialized Libraries

MinkowskiEngine

MinkowskiEngine is an auto-differentiation library for sparse tensors, particularly useful for 3D computer vision tasks.

Installation Steps

Create dedicated environment:

conda create -n py3-mink python=3.8
conda activate py3-mink

Install dependencies:

conda install openblas-devel -c anaconda
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge

Load required modules:
```
module load CUDA/11.3.1 gnu9
```

Submit interactive job for compilation:

srun -n 1 -c 4 --gres=gpu:1 --mem=20000 --pty /bin/bash

Install MinkowskiEngine:

conda activate py3-mink
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \
    --install-option="--blas_include_dirs=${CONDA_PREFIX}/include" \
    --install-option="--blas=openblas"

Usage Example

import torch
import MinkowskiEngine as ME

# Create sparse tensor
coords = torch.IntTensor([[0, 1], [0, 1], [0, 2], [1, 0], [1, 2]])
feats = torch.FloatTensor([[1], [2], [3], [4], [5]])

# Create sparse tensor
sparse_tensor = ME.SparseTensor(feats, coords)
print(f"Sparse tensor shape: {sparse_tensor.shape}")

PointGPT

PointGPT extends GPT concepts to point clouds for 3D understanding tasks.

Installation Steps

Create environment:

conda create -n pointgpt python=3.8
conda activate pointgpt

Install PyTorch and dependencies:

conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 tensorboard -c pytorch -c conda-forge
pip install easydict h5py matplotlib open3d opencv-python pyyaml timm tqdm transforms3d termcolor scipy ninja plyfile numpy==1.23.4
pip install setuptools==59.5.0

Load CUDA module:
```
module load CUDA/11.3.1
```

Clone PointGPT repository:

cd /lustreFS/data/mygroup/$USER
git clone https://github.com/CGuangyan-BIT/PointGPT.git
cd PointGPT

Submit interactive job for compilation:

srun -n 1 -c 4 --gres=gpu:1 --mem=20000 --pty /bin/bash

Install extensions:

conda activate pointgpt

# Chamfer Distance & EMD
cd ./extensions/chamfer_dist
python setup.py install --user
cd ../emd
python setup.py install --user
cd ../

# PointNet++
pip install "git+https://github.com/erikwijmans/Pointnet2_PyTorch.git#egg=pointnet2_ops&subdirectory=pointnet2_ops_lib"

# GPU kNN
pip install --upgrade https://github.com/unlimblue/KNN_CUDA/releases/download/0.2/KNN_CUDA-0.2-py3-none-any.whl

Computer Vision Libraries

OpenCV Installation

conda activate myenv
conda install opencv -c conda-forge

# Or install from pip
pip install opencv-python opencv-contrib-python

Open3D for 3D Processing

conda activate myenv
pip install open3d

# Test installation
python -c "import open3d as o3d; print(f'Open3D {o3d.__version__}')"

PIL/Pillow for Image Processing

conda install pillow
# or
pip install Pillow

Scientific Computing

NumPy, SciPy, Pandas

conda install numpy scipy pandas matplotlib seaborn
# or
pip install numpy scipy pandas matplotlib seaborn

Jupyter and IPython

conda install jupyter ipython ipykernel
# or
pip install jupyter ipython ipykernel

# Add environment to Jupyter
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

Scikit-learn

conda install scikit-learn
# or
pip install scikit-learn

Development Tools

Git and Version Control

# Git is usually available by default
git --version

# Configure Git
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

Build Tools

# Install build essentials
conda install cmake make ninja

# For C++ development
conda install gxx_linux-64 gcc_linux-64

Debugging Tools

# Install debugging tools
pip install pdb++ ipdb

# Memory profiling
pip install memory_profiler

# Line profiling
pip install line_profiler

Installation in SLURM Jobs

Interactive Installation

# Submit interactive job for installation
srun --partition=defq --qos=normal --gres=gpu:1 --mem=16000 --time=2:00:00 --pty /bin/bash

# Load modules
module load CUDA/11.3.1 Python/3.9.5

# Activate environment
conda activate myenv

# Install packages
pip install package-name

Batch Installation Script

#!/bin/bash
#SBATCH -J install_packages
#SBATCH --partition=defq
#SBATCH --qos=normal
#SBATCH --cpus-per-task=4
#SBATCH --mem=8000
#SBATCH --time=1:00:00

# Load modules
module load Python/3.9.5

# Activate environment
conda activate myenv

# Install packages
pip install -r requirements.txt

echo "Installation completed"

Package Management Best Practices

Requirements Files

Create requirements.txt for reproducibility:

torch==1.12.1+cu117
torchvision==0.13.1+cu117
torchaudio==0.12.1+cu117
numpy==1.23.4
pandas==1.5.2
matplotlib==3.6.2
jupyter==1.0.0

Install from requirements:

pip install -r requirements.txt

Environment Files

Create environment.yml for conda:

name: myproject
channels:
  - pytorch
  - nvidia
  - conda-forge
dependencies:
  - python=3.9
  - pytorch=1.12.1
  - torchvision=0.13.1
  - torchaudio=0.12.1
  - pytorch-cuda=11.7
  - numpy
  - pandas
  - matplotlib
  - jupyter
  - pip
  - pip:
    - some-pip-package

Create environment:

conda env create -f environment.yml

Storage Considerations

Install packages in shared group storage to avoid quota issues:

# Set conda environments path
echo "envs_dirs:
  - /lustreFS/data/mygroup/conda/envs" > ~/.condarc

# Set pip cache directory
export PIP_CACHE_DIR=/lustreFS/data/mygroup/pip-cache
echo 'export PIP_CACHE_DIR=/lustreFS/data/mygroup/pip-cache' >> ~/.bashrc

Troubleshooting

Common Installation Issues

CUDA compatibility errors:

# Check CUDA version
nvidia-smi

# Install matching PyTorch version
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Memory errors during installation:

# Request more memory for installation
srun --mem=32000 --pty /bin/bash

# Or increase pip timeout
pip install --timeout 1000 package-name

Permission errors:

# Install in user space
pip install --user package-name

# Or check conda environment ownership
ls -la ~/miniconda3/envs/

Network timeouts:

# Use conda-forge channel
conda install -c conda-forge package-name

# Or use pip with retries
pip install --retries 10 package-name

Compilation Issues

Missing compilers:

# Load compiler modules
module load GCC/10.3.0

# Check compiler availability
gcc --version
nvcc --version

Missing headers:

# Install development packages
conda install gxx_linux-64 gcc_linux-64

# For CUDA development
module load CUDA/11.3.1
echo $CUDA_HOME

Environment Conflicts

Package conflicts:

# Create fresh environment
conda create -n clean-env python=3.9
conda activate clean-env

# Install packages one by one
conda install pytorch -c pytorch

Module vs conda conflicts:

# Always load modules before activating conda
module load Python/3.9.5
conda activate myenv

Package Documentation

Keep track of installed packages:

# List conda packages
conda list > conda_packages.txt

# List pip packages
pip freeze > pip_requirements.txt

# Environment information
conda info --envs > environments.txt

Next Steps

Learn job submission: Job Submission
Explore GPU programming: GPU Computing
Set up monitoring: Performance Monitoring

Prometheus Cluster Documentation

Overview

Quick Start

Cluster Specifications

Compute Resources

Storage

Networking Infrastructure

Software Environment

Support & Resources

1 - Getting Started

Prerequisites

Generate SSH Keys

Step 1: Create SSH Key Pair

Step 2: Secure Your Private Key

Step 3: Request Cluster Access

Step 4: Add Passphrase (Optional but Recommended)

Connect to Prometheus

Configure SSH Client

Connect via SSH

First Login Setup

Check Your Environment

Understand the File System

Cluster Architecture Overview

Head Node

Compute Nodes

Storage Systems

Important Usage Rules

Critical

Checking Cluster Status

Next Steps

Getting Help

Common First-Time Issues

“Permission denied (publickey)”

“Connection refused”

“Quota exceeded”

2 - Hardware Specifications

Cluster Overview

Head Node

Hardware Configuration

Purpose

Compute Nodes

GPU Nodes gpu[01-08] (8 nodes)

Hardware Configuration

GPU Specifications

Total Resources (gpu[01-08])

GPU Node gpu09 (1 node)

Hardware Configuration

GPU Specifications

Total Resources (gpu09)

Storage Nodes

Storage Architecture

Hardware Configuration (2 storage nodes)

Storage Specifications

Software Environment

Operating System

Management Software

Development Tools

Network Architecture

Interconnect

Bandwidth

Performance Characteristics

Compute Performance

Storage Performance

Resource Allocation

Per-Node Resources

Total Cluster Resources

Use Cases and Workloads

Optimized For

Performance Considerations

3 - Environment Setup

Development Environment Options

Container-Based Setup

Using Pre-built Containers

Building Custom Containers

Python Environment Setup

Using Conda

Using pip with virtual environments

GPU Environment Configuration

Checking GPU Availability

Setting GPU Visibility

GPU Nodes `gpu[01-08]` (8 nodes)

GPU Node `gpu09` (1 node)

`defq` Partition (Default)

`a6000` Partition

`defq` Partition Queues

`a6000` Partition Queues

Use `normal` or `normal-a6000` for:

Use `long` or `long-a6000` for:

Use `preemptive` queues sparingly for:

Use `defq` partition when:

Use `a6000` partition when: