Getting started with Distributed Training with Ray, PyTorch and DeepSpeed

What is Distributed Training?

Distributed training enables training deep learning models across multiple GPUs or machines by parallelizing computation. The most common approach is data parallelism, where:

Each GPU holds a complete copy of the model
Training data is split across GPUs (each GPU processes different batches)
Gradients are synchronized across all GPUs after each backward pass
Model weights are updated identically on all GPUs

This allows training larger batches and reduces training time proportionally to the number of GPUs.

For large models that don't fit in a single GPU's memory, you can use:

FSDP2 (Fully Sharded Data Parallel): PyTorch's native solution for model sharding across GPUs
DeepSpeed ZeRO: Microsoft's memory optimization technology that partitions optimizer states, gradients, and parameters

PyTorch DistributedDataParallel (DDP)

DistributedDataParallel is PyTorch's native solution for distributed training:

DDP implements data parallelism at the module level by synchronizing gradients across processes. It uses collective communications from torch.distributed to coordinate gradient averaging during the backward pass.

Key characteristics:

One process per GPU (unlike DataParallel which uses threads)
Uses NCCL backend for efficient GPU-to-GPU communication
Requires manual setup of process groups, distributed samplers, and multi-node coordination

Ray Train

Ray Train is a library for distributed deep learning that simplifies scaling PyTorch training:

Ray Train handles the complexity of distributed training setup, including process management, data distribution, and checkpoint handling across workers.

Key benefits:

Single command to scale from 1 to N GPUs
Automatic process group initialization
Built-in fault tolerance and checkpointing
Seamless integration with Ray's distributed computing ecosystem

Getting Started with Anyscale

This workshop is designed to run on Anyscale, a managed Ray platform.

Creating a Cluster

Sign in to Anyscale Console
- Go to https://console.anyscale.com/
- Create an account or sign in
Create a new Workspace
- Click "Workspaces" in the left sidebar
- Click "Create Workspace"
- Select a compute configuration with GPUs (e.g., g5.4xlarge instances)
- Choose the number of worker nodes based on your GPU requirements

Clone this repository

git clone https://github.com/debnsuma/vhol-ray-train.git
cd vhol-ray-train

Set up

# Install all dependencies from requirements.txt
pip install -r requirements.txt

Workshop Structure

This workshop covers distributed training from basic DDP to advanced memory optimization techniques, culminating in a real-world voice cloning project:

vhol-ray-train/
├── 01-ddp-pytorch-only/              # Vanilla PyTorch DDP
│   ├── train_ddp.py                  # Training script with manual DDP setup
│   └── launch_multinode_ddp.sh       # Multi-node launch instructions
│
├── 02-ddp-pytorch-ray/               # Ray Train (DDP)
│   ├── Ray_Train_Intro.ipynb         # Comprehensive tutorial notebook (start here)
│   ├── train_ray_ddp.py              # Ray Train with PyTorch DataLoader
│   └── train_ray_ddp_with_ray_data.py # Ray Train with Ray Data
│
├── 03-fsdp-pytorch-ray-deepspeed/    # Advanced: FSDP2 & DeepSpeed
│   ├── FSDP2_RayTrain_Tutorial.ipynb # PyTorch FSDP2 with Ray Train
│   └── DeepSpeed_RayTrain_Tutorial.ipynb # DeepSpeed ZeRO with Ray Train
│
└── 04-qwen3-tts-ft-ray/              # Real-World Project: Voice Cloning
    ├── README.md                     # Detailed project documentation
    ├── src/
    │   ├── data_processing.py        # Ray Data for audio processing
    │   ├── prepare_data.py           # Audio code extraction
    │   ├── train_qwen_tts.py         # SFT with Ray Train
    │   └── inference.py              # Base vs fine-tuned comparison
    └── scripts/
        ├── run_pipeline.py           # End-to-end pipeline runner
        └── zero_shot_clone.py        # Standalone zero-shot voice cloning

Quick Start

Vanilla PyTorch DDP

cd 01-ddp-pytorch-only

# Single node, 4 GPUs
torchrun --nproc_per_node=4 train_ddp.py --epochs 3

# For multi-node, see the launch instructions
./launch_multinode_ddp.sh

Ray Train (DDP)

cd 02-ddp-pytorch-ray

# Start with the interactive tutorial notebook
jupyter notebook Ray_Train_Intro.ipynb

# Or run the scripts directly
python train_ray_ddp.py --num-workers 2 --epochs 3

# With Ray Data for distributed preprocessing
python train_ray_ddp_with_ray_data.py --num-workers 2 --epochs 3

FSDP2 and DeepSpeed

For training large models that don't fit in a single GPU's memory:

cd 03-fsdp-pytorch-ray-deepspeed

# Start with FSDP2 tutorial (PyTorch native)
jupyter notebook FSDP2_RayTrain_Tutorial.ipynb

# Then try DeepSpeed (Microsoft's ZeRO technology)
jupyter notebook DeepSpeed_RayTrain_Tutorial.ipynb

Voice Cloning with Qwen3-TTS

Apply everything you've learned to fine-tune a 1.7B parameter TTS model:

cd 04-qwen3-tts-ft-ray

# Run the complete pipeline
python scripts/run_pipeline.py --all --num-epochs 10

# Or zero-shot voice cloning (no fine-tuning needed)
python scripts/zero_shot_clone.py \
    --ref-audio samples/voice.wav \
    --text "Hello, this is my cloned voice." \
    --output output.wav

Comparison

Aspect	Vanilla PyTorch DDP	Ray Train (DDP)	FSDP2 + Ray Train	DeepSpeed + Ray Train
Launch	`torchrun` on each node	Single Python command	Single Python command	Single Python command
Process Groups	Manual init/cleanup	Automatic	Automatic	Automatic
Distributed Sampler	Must create manually	Handled by `prepare_data_loader()`	Handled by `prepare_data_loader()`	Manual `DistributedSampler`
Multi-node Setup	SSH, shared storage, coordination	Cluster handles it	Cluster handles it	Cluster handles it
Fault Tolerance	None - any failure stops training	Built-in recovery	Built-in recovery	Built-in recovery
Checkpointing	Manual implementation	Integrated with `ray.train.report()`	PyTorch DCP + Ray Train	DeepSpeed built-in + Ray Train
Memory Optimization	None (full model per GPU)	None (full model per GPU)	Model sharding, CPU offload, mixed precision	ZeRO stages, CPU/NVMe offload
Best For	Small-medium models	Small-medium models	Large models (PyTorch native)	Very large models (LLMs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Getting started with Distributed Training with Ray, PyTorch and DeepSpeed

What is Distributed Training?

PyTorch DistributedDataParallel (DDP)

Ray Train

Getting Started with Anyscale

Creating a Cluster

Workshop Structure

Quick Start

Vanilla PyTorch DDP

Ray Train (DDP)

FSDP2 and DeepSpeed

Voice Cloning with Qwen3-TTS

Comparison

Further Reading

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
01-ddp-pytorch-only		01-ddp-pytorch-only
02-ddp-pytorch-ray		02-ddp-pytorch-ray
03-fsdp-pytorch-ray-deepspeed		03-fsdp-pytorch-ray-deepspeed
04-qwen3-tts-ft-ray		04-qwen3-tts-ft-ray
.gitignore		.gitignore
README.md		README.md
cleanup_cluster.py		cleanup_cluster.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Getting started with Distributed Training with Ray, PyTorch and DeepSpeed

What is Distributed Training?

PyTorch DistributedDataParallel (DDP)

Ray Train

Getting Started with Anyscale

Creating a Cluster

Workshop Structure

Quick Start

Vanilla PyTorch DDP

Ray Train (DDP)

FSDP2 and DeepSpeed

Voice Cloning with Qwen3-TTS

Comparison

Further Reading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages