A research project exploring the adaptation of diffusion models from image generation to structured text generation, with comparative analysis against transformer-based approaches.
Can diffusion models, originally designed for continuous image data, successfully generate coherent and stylistically authentic poetic text when adapted with appropriate conditioning mechanisms and embedding strategies?
📄 For detailed research methodology, architecture deep-dives, and comprehensive analysis, see TextDiffusion-Research_Details.md
Instead of using frozen pre-trained embeddings or separate decoder networks, we developed a trainable embedding system:
logits = denoised_embedding @ text_embeddings.weight.T
# Embeddings optimize for both noise prediction AND reconstructionThe transpose of the embedding matrix serves as the decoder, allowing gradients from both tasks to reshape the embedding space.
- N-gram position markers (START/MIDDLE/END/COMPLETE)
- Positional triplets (first, middle, last word)
- Word count and position embeddings
- Provides structural guidance while maintaining creative freedom
- Fully trainable 768-dimensional embeddings
- Joint optimization with dual loss:
diffusion_loss + α × reconstruction_loss - Embeddings self-organize for effective decoding through learned geometry
Core Components:
- U-Net structure with channels: [768, 512, 384, 256]
- Dilated convolutions (dilations: 1, 2, 4, 8) for multi-scale text pattern capture
- Multi-head cross-attention (8 heads) to conditioning embeddings
- Skip connections between encoder and decoder paths
- DDPM scheduler for training, DDIM for inference
Embeddings:
- Text embeddings: (3,432 vocab × 768 dim) - Trainable
- Length embeddings: (12 × 768 dim) - Encode n-gram length
- Position embeddings: (4 × 768 dim) - Encode structural position
Training Process:
- Convert text to embeddings
- Add noise according to timestep
- Predict noise with U-Net conditioned on position/length
- Denoise and decode via weight-tying
- Optimize:
diffusion_loss + α × reconstruction_loss
Core Components:
- 2-layer encoder with multi-head self-attention (8 heads)
- 2-layer decoder with self-attention + cross-attention
- 512-dimensional model, 2048-dimensional feed-forward networks
- Sinusoidal positional encoding
- Teacher forcing with shifted decoder inputs
Training Process:
- Encoder processes input sequence with self-attention
- Decoder generates output with:
- Self-attention (understanding previous outputs)
- Cross-attention (attending to encoder outputs)
- Final dense layer projects to vocabulary
- Optimize with sparse categorical cross-entropy
Key Mechanism: Attention heads learn what words relate to each other, enabling strong sequential coherence and context understanding.
- Source: Shakespeare's sonnets (~2,185 lines)
- N-gram Expansion: 60,426 training samples (2-20 word sequences)
- Rationale: Shakespeare's archaic syntax makes standard SVO extraction ineffective; n-grams capture poetic patterns better
- Distribution: 54% MIDDLE, 21.2% START, 21.2% END, 3.6% COMPLETE
| Metric | Diffusion Model | Transformer Model |
|---|---|---|
| Final Loss | 0.93 (epoch 50) | 0.67 (epoch 40) |
| Training Samples | 60,426 n-grams | 60,426 n-grams |
| Parameters | 70.4M | ~45M |
| Training Time/Epoch | ~4.5 min | ~12 min |
Prompt: ['love', 'time', 'beauty']
Buried acceptable remove elder captive junes familiar lying
Thou best though dressing wood candles privilege; unswayed
Thou felt at dressing chaste familiar increase; but
Delayed dial's composition bitter o'ersways life's neck redeem
Prompt: "Shall I compare thee to a summer's day"
Line 1: Shall I compare thee to a summer's day
Line 2: spirit of youth winter's day
Line 3: and barren rage of death's eternal barren thine day
Line 4: of thine eye in thy view
Line 5: in thy view is pleased to dote
Sequential Coherence: Transformer > Diffusion
- Transformer: Attention mechanisms enable strong grammatical structure and contextual flow. Each line builds naturally from the previous, with clear subject-verb relationships.
- Diffusion: Words are thematically connected but lack sentence-level grammar. Coherence degrades as sequence length increases.
Creative Diversity: Diffusion > Transformer
- Diffusion: Explores wider vocabulary space with unexpected combinations ("captive junes," "forests feathered dressing")
- Transformer: Falls into repetitive patterns (e.g., "of view" repeated 4 times), relies on frequently-seen n-grams
Vocabulary Authenticity: Diffusion ≈ Transformer
- Both successfully learned Shakespearean vocabulary and archaic expressions
- Diffusion: More varied word choices per generation
- Transformer: More consistent poetic meter
Controllability: Diffusion > Transformer
- Diffusion: Responds to multiple simultaneous conditions (position, length, specific words)
- Transformer: Primarily continues from textual prompts with limited attribute control
Why These Differences?
The diffusion model operates through iterative refinement in embedding space, allowing it to explore multiple word possibilities at each position simultaneously. This parallel processing encourages creativity but lacks the sequential dependency modeling that transformers achieve through attention.
The transformer's attention mechanism explicitly models "what word comes after what," creating strong sequential dependencies. It understands context across the entire sequence but can get trapped in high-probability n-gram patterns it learned during training.
Despite being adapted from image generation, the diffusion model successfully:
- Learned authentic Shakespearean language - Captured archaic vocabulary, poetic expressions, and thematic elements
- Generated creative word combinations - Produced novel phrases like "death's eternal barren" and "fire--my candles dreading"
- Responded to positional conditioning - Incorporated prompt words and structural guidance effectively
- Navigated semantic embedding space - Self-organized embeddings for meaningful word relationships
- Maintained poetic meter - Lines often follow appropriate syllable counts and rhythm
- Demonstrated vocabulary diversity - Explored wider lexical range than transformer baseline
Technical Achievement: Successfully bridged discrete text tokens to continuous embedding space, enabling diffusion processes to work on linguistic data. The weight-tying strategy solved the fundamental challenge of decoding without separate networks.
- Latent diffusion for text: Compress sentences into latent space for better long-range coherence
- Enhanced conditioning: Add rhyme scheme, meter, semantic theme controls
- Larger training corpus: Expand to complete Shakespeare works and Elizabethan poetry
- Hybrid architectures: Combine diffusion's creativity with transformer's coherence
- Cross-modal generation: Poetry conditioned on images or music
- Controllable creative writing tools: Interactive refinement with multi-constraint optimization
This research proves that diffusion models can successfully generate stylistically authentic poetic text, despite being designed for continuous image data. The weight-tied decoding innovation enables trainable embeddings to self-organize for effective generation.
While transformers currently produce more grammatically coherent sonnets, diffusion models offer unique advantages in creative exploration and controllable generation. The promising results—authentic vocabulary, thematic coherence, and responsive conditioning—suggest diffusion-based text generation could become valuable for applications requiring controlled creativity.