________  ________  ________  _____ ______   _______  _________  ___  ___  _______   ___  ___  ________        
  |\   __  \|\   __  \|\   __  \|\   _ \  _   \|\  ___ \|\___   ___\\  \|\  \|\  ___ \ |\  \|\  \|\   ____\       
  \ \  \|\  \ \  \|\  \ \  \|\  \ \  \\\__\ \  \ \   __/\|___ \  \_\ \  \\\  \ \   __/|\ \  \\\  \ \  \___|_      
   \ \   ____\ \   _  _\ \  \\\  \ \  \\|__| \  \ \  \_|/__  \ \  \ \ \   __  \ \  \_|/_\ \  \\\  \ \_____  \     
    \ \  \___|\ \  \\  \\ \  \\\  \ \  \    \ \  \ \  \_|\ \  \ \  \ \ \  \ \  \ \  \_|\ \ \  \\\  \|____|\  \    
     \ \__\    \ \__\\ _\\ \_______\ \__\    \ \__\ \_______\  \ \__\ \ \__\ \__\ \_______\ \_______\____\_\  \   
      \|__|     \|__|\|__|\|_______|\|__|     \|__|\|_______|   \|__|  \|__|\|__|\|_______|\|_______|\_________\  
                                                                                                   \|_________|  

High-Performance GPU Computing Cluster at CYENS Centre of Excellence

Prometheus is the flagship high-performance computing cluster at CYENS Centre of Excellence, specifically designed for deep learning and artificial intelligence research. This state-of-the-art facility combines powerful NVIDIA GPUs, AMD EPYC processors, and enterprise-grade storage to deliver exceptional performance for the most demanding computational workloads.

Running Rocky Linux 8.5 with professional-grade job scheduling and environment management, Prometheus provides researchers with a robust, scalable platform for breakthrough AI research and development.

Hardware Architecture

Head Node

  • Chassis: GIGABYTE R182-Z90-00
  • Motherboard: GIGABYTE MZ92-FS0-00
  • CPU: 2x AMD EPYC 7313 (16C/32T each)
  • RAM: 16x 32GB Samsung modules (512GB total)
  • Storage: 2x 1.92TB Intel SSDs

Compute Nodes (gpu01-08)

  • Chassis: Supermicro AS-4124GS-TNR
  • Motherboard: Supermicro H12DSG-O-CPU
  • CPU: 2x AMD EPYC 7313 (16C/32T each)
  • GPU: 8x NVIDIA A5000 (24GB VRAM, 8192 CUDA cores, 256 Tensor cores)
  • Performance: 27.8 TFLOPS FP32 per GPU
  • RAM: 16x 32GB SK Hynix modules (512GB total)
  • Storage: 1x 1TB Samsung SSD 980

Specialized Compute Node (gpu09)

  • Chassis: ASUS RS720A-E11-RS12
  • Motherboard: ASUS KMPP-D32
  • CPU: 2x AMD EPYC 7313 (16C/32T each)
  • GPU: 4x NVIDIA A6000 Ada (48GB VRAM, 18176 CUDA cores, 568 Tensor cores)
  • Performance: 91.06 TFLOPS FP32 per GPU
  • RAM: 16x 32GB SK Hynix modules (512GB total)
  • Storage: 1x 1TB Samsung SSD 980

Storage Infrastructure

  • Storage Nodes: 2x Supermicro servers
  • CPU: AMD EPYC 7302P (16C/32T each)
  • RAM: 8x 16GB Samsung modules (256GB total)
  • Storage System: 24x 7.68TB Samsung SSDs
  • Total Capacity: 305TB Lustre parallel filesystem

Networking Infrastructure

  • Management Switch: Netgear M4300-52G - 48x1G Stackable Managed Switch with 2x10GBASE-T and 2xSFP+
  • Interconnection Switch: Mellanox HDR InfiniBand QM8700 Switch, 40x QSFP56 ports, 2 Power Supplies (AC), x86 dual core, standard depth, P2C airflow
  • Interconnection Cables: 11x Mellanox passive copper hybrid cable, IB HDR 200Gb/s to 2x100Gb/s, QSFP28 to 2xQSFP28, colored pulltabs, 2m, 26AWG

Operating Environment

System Software

  • Operating System: Rocky Linux 8.5 (Green Obsidian)
  • Kernel: Linux 4.18.0-348.23.1.el8_5.x86_64
  • Job Scheduler: Slurm Workload Manager
  • Module System: Lmod Environment Modules

Partitions and Scheduling

  • defq Partition: 8 nodes (gpu01-08) with NVIDIA A5000 GPUs
  • a6000 Partition: 1 node (gpu09) with NVIDIA A6000 Ada GPUs
  • Priority Queues: normal, long, preemptive with different time limits and priorities
  • Resource Limits: Per-group quotas for fair resource allocation

Storage Quotas

  • Home Directories: 20GB per user (/trinity/home)
  • Work Directories: 30TB per group (/lustreFS/data)
  • Performance: High-throughput Lustre filesystem optimized for parallel I/O

Access and Development

Connection Methods

  • SSH Access: Secure shell connections with RSA key authentication
  • VS Code Remote: Direct IDE integration via Remote-SSH extension
  • Interactive Jobs: Real-time development with srun command
  • Batch Processing: Automated job submission with sbatch

Software Environment

  • Environment Modules: Easy software stack management with Lmod
  • Pre-configured Libraries: CUDA, deep learning frameworks, scientific computing tools
  • Container Support: Singularity/Apptainer for reproducible environments
  • Development Tools: Git, compilers, debuggers, and profiling tools

Supported Frameworks

  • Deep Learning: PyTorch, TensorFlow, JAX, Hugging Face Transformers
  • Scientific Computing: NumPy, SciPy, scikit-learn, OpenCV
  • Specialized Libraries: MinkowskiEngine, PointGPT, and domain-specific tools
  • Data Processing: Pandas, Dask, and high-performance data analytics

Research Capabilities

AI and Machine Learning

Prometheus excels at training large neural networks, from computer vision models to natural language processing systems. The combination of NVIDIA A5000 and A6000 Ada GPUs provides the computational power needed for state-of-the-art research in artificial intelligence.

Scientific Computing

Beyond AI, the cluster supports a wide range of scientific applications including computational biology, physics simulations, climate modeling, and data-intensive research across multiple disciplines.

Collaborative Research

With group-based resource allocation and shared storage, Prometheus facilitates collaborative research projects while maintaining security and fair resource distribution among research teams.

Performance Optimization

Professional-grade job scheduling ensures optimal resource utilization, while the high-performance Lustre filesystem minimizes I/O bottlenecks for data-intensive applications.