Skip to content

debnsuma/vhol-ray-train

Repository files navigation

Getting started with Distributed Training with Ray, PyTorch and DeepSpeed

What is Distributed Training?

Distributed training enables training deep learning models across multiple GPUs or machines by parallelizing computation. The most common approach is data parallelism, where:

  • Each GPU holds a complete copy of the model
  • Training data is split across GPUs (each GPU processes different batches)
  • Gradients are synchronized across all GPUs after each backward pass
  • Model weights are updated identically on all GPUs

This allows training larger batches and reduces training time proportionally to the number of GPUs.

For large models that don't fit in a single GPU's memory, you can use:

  • FSDP2 (Fully Sharded Data Parallel): PyTorch's native solution for model sharding across GPUs
  • DeepSpeed ZeRO: Microsoft's memory optimization technology that partitions optimizer states, gradients, and parameters

PyTorch DistributedDataParallel (DDP)

DistributedDataParallel is PyTorch's native solution for distributed training:

DDP implements data parallelism at the module level by synchronizing gradients across processes. It uses collective communications from torch.distributed to coordinate gradient averaging during the backward pass.

Key characteristics:

  • One process per GPU (unlike DataParallel which uses threads)
  • Uses NCCL backend for efficient GPU-to-GPU communication
  • Requires manual setup of process groups, distributed samplers, and multi-node coordination

Ray Train

Ray Train is a library for distributed deep learning that simplifies scaling PyTorch training:

Ray Train handles the complexity of distributed training setup, including process management, data distribution, and checkpoint handling across workers.

Key benefits:

  • Single command to scale from 1 to N GPUs
  • Automatic process group initialization
  • Built-in fault tolerance and checkpointing
  • Seamless integration with Ray's distributed computing ecosystem

Getting Started with Anyscale

This workshop is designed to run on Anyscale, a managed Ray platform.

Creating a Cluster

  1. Sign in to Anyscale Console

  2. Create a new Workspace

    • Click "Workspaces" in the left sidebar
    • Click "Create Workspace"
    • Select a compute configuration with GPUs (e.g., g5.4xlarge instances)
    • Choose the number of worker nodes based on your GPU requirements
  3. Clone this repository

    git clone https://github.com/debnsuma/vhol-ray-train.git
    cd vhol-ray-train
  4. Set up

    # Install all dependencies from requirements.txt
    pip install -r requirements.txt
    

Workshop Structure

This workshop covers distributed training from basic DDP to advanced memory optimization techniques, culminating in a real-world voice cloning project:

vhol-ray-train/
├── 01-ddp-pytorch-only/              # Vanilla PyTorch DDP
│   ├── train_ddp.py                  # Training script with manual DDP setup
│   └── launch_multinode_ddp.sh       # Multi-node launch instructions
│
├── 02-ddp-pytorch-ray/               # Ray Train (DDP)
│   ├── Ray_Train_Intro.ipynb         # Comprehensive tutorial notebook (start here)
│   ├── train_ray_ddp.py              # Ray Train with PyTorch DataLoader
│   └── train_ray_ddp_with_ray_data.py # Ray Train with Ray Data
│
├── 03-fsdp-pytorch-ray-deepspeed/    # Advanced: FSDP2 & DeepSpeed
│   ├── FSDP2_RayTrain_Tutorial.ipynb # PyTorch FSDP2 with Ray Train
│   └── DeepSpeed_RayTrain_Tutorial.ipynb # DeepSpeed ZeRO with Ray Train
│
└── 04-qwen3-tts-ft-ray/              # Real-World Project: Voice Cloning
    ├── README.md                     # Detailed project documentation
    ├── src/
    │   ├── data_processing.py        # Ray Data for audio processing
    │   ├── prepare_data.py           # Audio code extraction
    │   ├── train_qwen_tts.py         # SFT with Ray Train
    │   └── inference.py              # Base vs fine-tuned comparison
    └── scripts/
        ├── run_pipeline.py           # End-to-end pipeline runner
        └── zero_shot_clone.py        # Standalone zero-shot voice cloning

Quick Start

Vanilla PyTorch DDP

cd 01-ddp-pytorch-only

# Single node, 4 GPUs
torchrun --nproc_per_node=4 train_ddp.py --epochs 3

# For multi-node, see the launch instructions
./launch_multinode_ddp.sh

Ray Train (DDP)

cd 02-ddp-pytorch-ray

# Start with the interactive tutorial notebook
jupyter notebook Ray_Train_Intro.ipynb

# Or run the scripts directly
python train_ray_ddp.py --num-workers 2 --epochs 3

# With Ray Data for distributed preprocessing
python train_ray_ddp_with_ray_data.py --num-workers 2 --epochs 3

FSDP2 and DeepSpeed

For training large models that don't fit in a single GPU's memory:

cd 03-fsdp-pytorch-ray-deepspeed

# Start with FSDP2 tutorial (PyTorch native)
jupyter notebook FSDP2_RayTrain_Tutorial.ipynb

# Then try DeepSpeed (Microsoft's ZeRO technology)
jupyter notebook DeepSpeed_RayTrain_Tutorial.ipynb

Voice Cloning with Qwen3-TTS

Apply everything you've learned to fine-tune a 1.7B parameter TTS model:

cd 04-qwen3-tts-ft-ray

# Run the complete pipeline
python scripts/run_pipeline.py --all --num-epochs 10

# Or zero-shot voice cloning (no fine-tuning needed)
python scripts/zero_shot_clone.py \
    --ref-audio samples/voice.wav \
    --text "Hello, this is my cloned voice." \
    --output output.wav

Comparison

Aspect Vanilla PyTorch DDP Ray Train (DDP) FSDP2 + Ray Train DeepSpeed + Ray Train
Launch torchrun on each node Single Python command Single Python command Single Python command
Process Groups Manual init/cleanup Automatic Automatic Automatic
Distributed Sampler Must create manually Handled by prepare_data_loader() Handled by prepare_data_loader() Manual DistributedSampler
Multi-node Setup SSH, shared storage, coordination Cluster handles it Cluster handles it Cluster handles it
Fault Tolerance None - any failure stops training Built-in recovery Built-in recovery Built-in recovery
Checkpointing Manual implementation Integrated with ray.train.report() PyTorch DCP + Ray Train DeepSpeed built-in + Ray Train
Memory Optimization None (full model per GPU) None (full model per GPU) Model sharding, CPU offload, mixed precision ZeRO stages, CPU/NVMe offload
Best For Small-medium models Small-medium models Large models (PyTorch native) Very large models (LLMs)

Further Reading

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors