Storage Systems
Categories:
Storage Overview
The Prometheus cluster provides multiple storage systems optimized for different use cases, from personal files to high-performance parallel computing workloads.
Storage Architecture
Home Directories (/trinity/home/)
- Type: SSD-backed storage
- Mount point:
/trinity/home/<username> - Quota: 20GB per user
- Purpose: Personal configuration files, small scripts
- Backup: Regular backups maintained
- Performance: Fast random I/O, moderate capacity
Shared Group Storage (/lustreFS/data/)
- Type: Lustre parallel file system
- Mount point:
/lustreFS/data/<group-name> - Quota: 30TB per group (or 20,971,520 files)
- Purpose: Primary workspace for research data and results
- Performance: High-throughput parallel I/O
- Shared: All group members have access
Local Node Storage
- Type: NVMe SSD (1TB per compute node)
- Purpose: Temporary files during job execution
- Access: Only available during allocated jobs
- Performance: Highest IOPS for temporary data
File System Details
Home Directory Usage
Quota Limit
Home directories have a strict 20GB limit. Use them only for configuration files, not data or models.# Check your home directory quota
quota -us
# View home directory contents
ls -la /trinity/home/$USER
# Typical home directory structure
/trinity/home/username/
├── .bashrc # Shell configuration
├── .ssh/ # SSH keys and config
├── .jupyter/ # Jupyter configuration
├── .conda/ # Conda configuration
├── scripts/ # Small utility scripts
└── .local/ # Local Python packages
Best practices for home directories:
- Store only configuration files and small scripts
- Link to shared storage for data access
- Use symbolic links to avoid quota issues
Shared Group Storage
The /lustreFS/data/ directory provides high-performance storage for your research work:
# Access your group's shared storage
cd /lustreFS/data/<group-name>
# Check group quota
lfs quota -gh <group-name> /lustreFS/
# Example group directory structure
/lustreFS/data/mygroup/
├── datasets/ # Shared datasets
├── models/ # Pre-trained and trained models
├── experiments/ # Individual user experiments
│ ├── user1/
│ ├── user2/
│ └── shared/
├── code/ # Shared code repositories
├── results/ # Experiment results
└── tmp/ # Temporary files
Quota information:
- Space limit: 30TB per group
- File limit: 20,971,520 files per group
- Shared: All group members can read/write
Local Node Storage
Each compute node has local NVMe storage for temporary files:
# During a SLURM job, use local storage for temporary files
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR
# Example usage in job script
#SBATCH --job-name=training
#SBATCH --gres=gpu:1
# Create temporary directory
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR
# Copy data to local storage for faster I/O
cp /lustreFS/data/mygroup/dataset.tar.gz $TMPDIR/
cd $TMPDIR
tar -xzf dataset.tar.gz
# Run training with local data
python train.py --data-dir $TMPDIR/dataset
# Copy results back to shared storage
cp -r $TMPDIR/results /lustreFS/data/mygroup/experiments/
Quota Management
Checking Quotas
# Check your home directory quota
quota -us
# Check group quota on Lustre filesystem
lfs quota -gh <group-name> /lustreFS/
# Example quota output:
# Disk quotas for group mygroup (gid 1001):
# Filesystem used quota limit grace files quota limit grace
# /lustreFS 15.2T 30T 30T - 1234567 20971520 20971520 -
Understanding Quota Output
- used: Current usage
- quota: Soft limit (warning threshold)
- limit: Hard limit (cannot exceed)
- grace: Time allowed to exceed soft quota
- files: Number of files/inodes used
Managing Quota Issues
When approaching quota limits:
Clean up temporary files:
# Find large files find /lustreFS/data/mygroup -type f -size +1G -ls # Find old temporary files find /lustreFS/data/mygroup -name "*.tmp" -mtime +7 -deleteArchive old data:
# Compress old experiments tar -czf old_experiments.tar.gz experiments/2023/ rm -rf experiments/2023/Use efficient storage:
# Use compressed formats for datasets # Store checkpoints selectively # Remove duplicate files
Data Management Best Practices
Directory Organization
Organize your group’s shared storage efficiently:
# Recommended structure
/lustreFS/data/mygroup/
├── datasets/
│ ├── imagenet/ # Large shared datasets
│ ├── coco/
│ └── custom/
├── models/
│ ├── pretrained/ # Downloaded pre-trained models
│ └── checkpoints/ # Training checkpoints
├── experiments/
│ ├── user1/
│ │ ├── project_a/
│ │ └── project_b/
│ └── user2/
├── code/
│ ├── shared_utils/ # Shared code libraries
│ └── experiments/ # Experiment code
└── results/
├── papers/ # Results for publications
└── ongoing/ # Current experiment results
File Permissions
Set appropriate permissions for shared access:
# Make directories group-writable
chmod g+w /lustreFS/data/mygroup/datasets/
# Set default permissions for new files
umask 002
# Change group ownership if needed
chgrp -R mygroup /lustreFS/data/mygroup/shared/
Data Transfer
Small Files (< 1GB)
# Copy from local machine using scp
scp large_dataset.tar.gz prometheus:/lustreFS/data/mygroup/datasets/
# Copy between directories on cluster
cp -r /lustreFS/data/mygroup/datasets/source /lustreFS/data/mygroup/experiments/
Large Files (> 1GB)
# Use rsync for large transfers with progress
rsync -avP large_dataset/ prometheus:/lustreFS/data/mygroup/datasets/
# Parallel compression for large datasets
tar -cf - dataset/ | pigz > dataset.tar.gz
Download Datasets
# Download directly to shared storage
cd /lustreFS/data/mygroup/datasets/
wget https://example.com/large_dataset.tar.gz
# Use aria2 for faster parallel downloads
aria2c -x 8 -s 8 https://example.com/dataset.tar.gz
Backup Strategies
While the cluster provides reliable storage, implement your own backup strategy:
- Important results: Copy to external storage
- Code: Use git repositories
- Large datasets: Document download sources for re-acquisition
- Models: Keep important checkpoints on external storage
Performance Optimization
Lustre File System Tips
Use parallel I/O for large files:
# PyTorch DataLoader with multiple workers dataloader = DataLoader(dataset, batch_size=64, num_workers=8)Avoid small random writes:
# Bad: Many small writes for i in {1..1000}; do echo $i >> file.txt; done # Good: Batch writes seq 1 1000 > file.txtUse appropriate stripe settings for large files:
# Set stripe count for large files (> 1GB) lfs setstripe -c 4 /lustreFS/data/mygroup/large_dataset/
Local Storage Performance
- Copy frequently accessed data to local storage during jobs
- Use local storage for temporary files and intermediate results
- Copy final results back to shared storage
# Example job with local storage optimization
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --time=4:00:00
# Set up local temporary directory
export TMPDIR=/tmp/$SLURM_JOB_ID
mkdir -p $TMPDIR
# Copy dataset to local storage
echo "Copying dataset to local storage..."
cp /lustreFS/data/mygroup/dataset.tar.gz $TMPDIR/
cd $TMPDIR
tar -xzf dataset.tar.gz
# Run training with local data (much faster I/O)
python train.py --data-dir $TMPDIR/dataset --output-dir $TMPDIR/results
# Copy results back to shared storage
echo "Copying results back..."
cp -r $TMPDIR/results /lustreFS/data/mygroup/experiments/
Environment Variables
Set up useful environment variables for data management:
# Add to your ~/.bashrc
export GROUP_DATA="/lustreFS/data/mygroup"
export DATASETS="$GROUP_DATA/datasets"
export MODELS="$GROUP_DATA/models"
export EXPERIMENTS="$GROUP_DATA/experiments/$USER"
export RESULTS="$GROUP_DATA/results"
# Create your experiment directory
mkdir -p $EXPERIMENTS
Common Storage Issues
Quota Exceeded
# Error: "Disk quota exceeded"
# Solution: Check and clean up usage
lfs quota -gh mygroup /lustreFS/
find $GROUP_DATA -type f -size +1G -ls
Permission Denied
# Error: "Permission denied"
# Solution: Check file permissions and group membership
ls -la /lustreFS/data/mygroup/
groups # Check your group membership
Slow I/O Performance
# Solutions:
# 1. Use local storage for temporary files
# 2. Reduce number of small files
# 3. Use parallel I/O libraries
# 4. Check stripe settings for large files
lfs getstripe /lustreFS/data/mygroup/large_file
File System Full
# Check available space
df -h /lustreFS
# If file system is full, clean up:
# 1. Remove temporary files
# 2. Compress old data
# 3. Archive completed experiments
Next Steps
- Set up your development environment: Environment Modules
- Configure VS Code for remote development: VS Code Setup
- Learn about software installation: Software Installation