Distributed training enables training deep learning models across multiple GPUs or machines by parallelizing computation. The most common approach is data parallelism, where:
- Each GPU holds a complete copy of the model
- Training data is split across GPUs (each GPU processes different batches)
- Gradients are synchronized across all GPUs after each backward pass
- Model weights are updated identically on all GPUs
This allows training larger batches and reduces training time proportionally to the number of GPUs.
For large models that don't fit in a single GPU's memory, you can use:
- FSDP2 (Fully Sharded Data Parallel): PyTorch's native solution for model sharding across GPUs
- DeepSpeed ZeRO: Microsoft's memory optimization technology that partitions optimizer states, gradients, and parameters
DistributedDataParallel is PyTorch's native solution for distributed training:
DDP implements data parallelism at the module level by synchronizing gradients across processes. It uses collective communications from
torch.distributedto coordinate gradient averaging during the backward pass.
Key characteristics:
- One process per GPU (unlike DataParallel which uses threads)
- Uses NCCL backend for efficient GPU-to-GPU communication
- Requires manual setup of process groups, distributed samplers, and multi-node coordination
Ray Train is a library for distributed deep learning that simplifies scaling PyTorch training:
Ray Train handles the complexity of distributed training setup, including process management, data distribution, and checkpoint handling across workers.
Key benefits:
- Single command to scale from 1 to N GPUs
- Automatic process group initialization
- Built-in fault tolerance and checkpointing
- Seamless integration with Ray's distributed computing ecosystem
This workshop is designed to run on Anyscale, a managed Ray platform.
-
Sign in to Anyscale Console
- Go to https://console.anyscale.com/
- Create an account or sign in
-
Create a new Workspace
- Click "Workspaces" in the left sidebar
- Click "Create Workspace"
- Select a compute configuration with GPUs (e.g.,
g5.4xlargeinstances) - Choose the number of worker nodes based on your GPU requirements
-
Clone this repository
git clone https://github.com/debnsuma/vhol-ray-train.git cd vhol-ray-train -
Set up
# Install all dependencies from requirements.txt pip install -r requirements.txt
This workshop covers distributed training from basic DDP to advanced memory optimization techniques, culminating in a real-world voice cloning project:
vhol-ray-train/
├── 01-ddp-pytorch-only/ # Vanilla PyTorch DDP
│ ├── train_ddp.py # Training script with manual DDP setup
│ └── launch_multinode_ddp.sh # Multi-node launch instructions
│
├── 02-ddp-pytorch-ray/ # Ray Train (DDP)
│ ├── Ray_Train_Intro.ipynb # Comprehensive tutorial notebook (start here)
│ ├── train_ray_ddp.py # Ray Train with PyTorch DataLoader
│ └── train_ray_ddp_with_ray_data.py # Ray Train with Ray Data
│
├── 03-fsdp-pytorch-ray-deepspeed/ # Advanced: FSDP2 & DeepSpeed
│ ├── FSDP2_RayTrain_Tutorial.ipynb # PyTorch FSDP2 with Ray Train
│ └── DeepSpeed_RayTrain_Tutorial.ipynb # DeepSpeed ZeRO with Ray Train
│
└── 04-qwen3-tts-ft-ray/ # Real-World Project: Voice Cloning
├── README.md # Detailed project documentation
├── src/
│ ├── data_processing.py # Ray Data for audio processing
│ ├── prepare_data.py # Audio code extraction
│ ├── train_qwen_tts.py # SFT with Ray Train
│ └── inference.py # Base vs fine-tuned comparison
└── scripts/
├── run_pipeline.py # End-to-end pipeline runner
└── zero_shot_clone.py # Standalone zero-shot voice cloning
cd 01-ddp-pytorch-only
# Single node, 4 GPUs
torchrun --nproc_per_node=4 train_ddp.py --epochs 3
# For multi-node, see the launch instructions
./launch_multinode_ddp.shcd 02-ddp-pytorch-ray
# Start with the interactive tutorial notebook
jupyter notebook Ray_Train_Intro.ipynb
# Or run the scripts directly
python train_ray_ddp.py --num-workers 2 --epochs 3
# With Ray Data for distributed preprocessing
python train_ray_ddp_with_ray_data.py --num-workers 2 --epochs 3For training large models that don't fit in a single GPU's memory:
cd 03-fsdp-pytorch-ray-deepspeed
# Start with FSDP2 tutorial (PyTorch native)
jupyter notebook FSDP2_RayTrain_Tutorial.ipynb
# Then try DeepSpeed (Microsoft's ZeRO technology)
jupyter notebook DeepSpeed_RayTrain_Tutorial.ipynbApply everything you've learned to fine-tune a 1.7B parameter TTS model:
cd 04-qwen3-tts-ft-ray
# Run the complete pipeline
python scripts/run_pipeline.py --all --num-epochs 10
# Or zero-shot voice cloning (no fine-tuning needed)
python scripts/zero_shot_clone.py \
--ref-audio samples/voice.wav \
--text "Hello, this is my cloned voice." \
--output output.wav| Aspect | Vanilla PyTorch DDP | Ray Train (DDP) | FSDP2 + Ray Train | DeepSpeed + Ray Train |
|---|---|---|---|---|
| Launch | torchrun on each node |
Single Python command | Single Python command | Single Python command |
| Process Groups | Manual init/cleanup | Automatic | Automatic | Automatic |
| Distributed Sampler | Must create manually | Handled by prepare_data_loader() |
Handled by prepare_data_loader() |
Manual DistributedSampler |
| Multi-node Setup | SSH, shared storage, coordination | Cluster handles it | Cluster handles it | Cluster handles it |
| Fault Tolerance | None - any failure stops training | Built-in recovery | Built-in recovery | Built-in recovery |
| Checkpointing | Manual implementation | Integrated with ray.train.report() |
PyTorch DCP + Ray Train | DeepSpeed built-in + Ray Train |
| Memory Optimization | None (full model per GPU) | None (full model per GPU) | Model sharding, CPU offload, mixed precision | ZeRO stages, CPU/NVMe offload |
| Best For | Small-medium models | Small-medium models | Large models (PyTorch native) | Very large models (LLMs) |