Prometheus Cluster Documentation

Complete guide to using the Prometheus deep learning cluster at CYENS

Categories:

This section contains comprehensive documentation for the Prometheus cluster - a high-performance computing environment for deep learning research at CYENS.

Overview

The Prometheus cluster is a state-of-the-art deep learning computing facility featuring:

64 NVIDIA A5000 GPUs (24GB each) across 8 compute nodes
4 NVIDIA A6000 Ada GPUs (48GB each) on a dedicated node
4.6TB total GPU memory for large-scale model training
High-performance Lustre storage with 305TB capacity
SLURM job scheduler for efficient resource management

This documentation will guide you through:

Getting Started - SSH access and account setup
Hardware Specifications - Detailed cluster architecture
Job Submission - SLURM batch and interactive jobs
Partitions & Queues - Resource allocation policies
Storage Systems - File systems and quotas
Environment Modules - Software stack management
VS Code Setup - Remote development environment
Software Installation - Third-party libraries and tools

Quick Start

Generate SSH keys and request cluster access
Connect via SSH to prometheus.cyens.org.cy
Submit your first job using SLURM
Set up development environment with modules or containers

Cluster Specifications

Compute Resources

9 compute nodes total
GPU nodes gpu[01-08]: 8×A5000 GPUs each (64 total GPUs)
GPU node gpu09: 4×A6000 Ada GPUs (48GB VRAM each)
512GB RAM per compute node
32 CPU cores per node (AMD EPYC 7313)

Storage

Home directories: 20GB SSD per user
Shared storage: 30TB Lustre filesystem per group
Local storage: 1TB NVMe SSD per compute node

Networking Infrastructure

Management Network: Netgear M4300-52G switch with 48×1G ports plus 2×10GBASE-T and 2×SFP+
High-Performance Interconnect: Mellanox HDR InfiniBand switch with 40×QSFP56 ports
InfiniBand Speed: 200Gb/s HDR connectivity with hybrid copper cables
Low Latency: Sub-microsecond messaging for distributed computing workloads