Comparative Genomics Final Project: SNP-Based Outbreak Analysis

This README documents a comparative genomics pipeline that performs phylogenetic analysis using both read-based (Snippy) and assembly-based (Parsnp) approaches. The pipeline aims to assess genetic relatedness among bacterial isolates by detecting SNPs, aligning core genomes, and visualizing their evolutionary relationships. Results from both approaches are compared to highlight isolate similarities and differences, with a focus on distinguishing outbreak-associated strains from sporadic ones using phylogenetic trees.

Pipeline Overview

This project consists of two main analysis branches:

1. Read-Based Analysis using Snippy

2. Assembly-Based Analysis using SKESA and Parsnp

Both pipelines produce phylogenetic trees that are visualized with ggtree for interpretation.

Reference Genome Selection

Two different reference genomes were used to evaluate their impact on alignment and tree quality:

NCBI Ref Genome: GCF_000009085.1_ASM908v1_genomic.fna.gz

Sample-based Ref Genome: E1376901.fna (chosen from our assembly dataset)

Why E1376901?

We selected E1376901 based on:

- High completeness (99.99%)

- Low contamination (0.05%)

- High N50 (226,829)

- Fewest contigs (12)

Read-Based Analysis

Read-based analysis uses raw sequencing reads of our assembly files to perform SNP detection and build phylogenetic trees. This approach is often more accurate for SNP calling, particularly in outbreak scenarios, because it preserves original read information and reduces assembly-induced artifacts.

Tool Description: Snippy

Snippy is a rapid and lightweight bacterial variant calling pipeline. It aligns raw Illumina reads to a reference genome, calls SNPs, and generates core genome alignments that can be used for phylogenetic inference. Snippy is highly effective for microbial comparative genomics and is widely used in pathogen surveillance and outbreak tracing.

GitHub: https://github.com/tseemann/snippy

Installation

Fastp

Version: fastp v0.24.1

    conda install -c bioconda fastp

Snippy

Version: Snippy v4.6.0

    conda create -n snippy_env -c bioconda -c conda-forge snippy

Step by Step Analysis

Step 1: Quality Control with Fastp

Paired-end FASTQ files were cleaned using fastp to remove low-quality reads. This was performed in a loop for all samples.

mkdir -p cleaned_reads

for read in raw_reads/*_R1_001.fastq.gz; do
    sample=$(basename "${read}" _R1_001.fastq.gz)
    fastp \
        -i "${read}" \
        -I "${read/_R1_/_R2_}" \
        -o cleaned_reads/"${sample}_R1_trimmed.fq.gz" \
        -O cleaned_reads/"${sample}_R2_trimmed.fq.gz"
done

Step 2: SNP Calling with Snippy

Snippy was used to call variants using the E1376901 reference genome located in the ref/ folder. Each cleaned sample was processed and the outputs were saved in snippy_outputs/.

       mkdir -p snippy_outputs

for R1 in cleaned_reads/*_R1_trimmed.fq.gz; do
    sample=$(basename "${R1}" _R1_trimmed.fq.gz)
    R2=cleaned_reads/${sample}_R2_trimmed.fq.gz
    snippy \
        --cpus 4 \
        --outdir snippy_outputs/mysnps-${sample} \
        --ref ref/E1376901_S01_L001_contigs.fasta \
        --R1 "${R1}" \
        --R2 "${R2}"
done

Step 3: Core SNP Alignment

Snippy-core was used to combine all individual sample SNPs into a single core SNP alignment.

    snippy-core \
--prefix snippy_outputs/core \
--ref ref/E1376901_S01_L001_contigs.fasta \
snippy_outputs/mysnps-*

Assembly-Based Analysis

Assembly-based analysis uses the fully assembled genome sequences, rather than raw sequencing reads. In this approach, paired-end reads are first assembled into longer contigs, which are then used to identify core genome alignments and construct phylogenetic trees. This method provides a higher-quality comparisons at the whole-genome level.

Tool Description: Parsnp and Skesa

SKESA

Version: Skesa 2.4.0.

SKESA is a fast and reliable tool used to build genomes from raw Illumina sequencing reads. It creates contigs and automatically removes low-quality or very short ones. It works well for assembling bacterial genomes.

GitHub: https://github.com/ncbi/SKESA

Parsnp

Version: Parsnp v2.1.3

Parsnp is a tool for efficient core genome alignment of microbial assemblies. It rapidly identifies homologous genomic regions among multiple assemblies, aligns them, and builds a phylogenetic tree. It is commonly used for outbreak analysis, comparative pathogen genomics and phylogenetic reconstruction.

GitHub: https://github.com/marbl/parsnp

Installation

Fastp

    conda install -c bioconda fastp

SKESA

    conda create -n skesa_env -c bioconda skesa
    conda activate skesa_env

Parsnp

    conda create -n harvestsuite -c bioconda parsnp fasttree
    conda activate harvestsuite

Step by Step Analysis

Step 1: Read Quality Control using fastp

Paired-end FASTQ files were filtered and trimmed to remove low-quality reads using fastp

    for read in raw_reads/*_R1_001.fastq.gz; do
sample="$(basename "${read}" _R1_001.fastq.gz)"
fastp \
    -i "${read}" \
    -I "${read/_R1_/_R2_}" \
    -o cleaned_reads/"${sample}_R1_cleaned.fq.gz" \
    -O cleaned_reads/"${sample}_R2_cleaned.fq.gz"
done

Step 2: Genome Assembly using SKESA

Filtered reads were assembled using SKESA

   mkdir -p assemblies
for read in cleaned_reads/*_R1_cleaned.fq.gz; do
    sample="$(basename "${read}" _R1_cleaned.fq.gz)"
    skesa \
        --reads "${read}" "${read/_R1_cleaned.fq.gz/_R2_cleaned.fq.gz}" \
        --cores 4 \
        --min_contig 1000 \
        --contigs_out assemblies/"${sample}.fna"
done

Step 3: Core Genome Alignment with Parsnp

Core genome alignment was performed on all assemblies using E1376901 as the reference genome. The --use-fasttree flag was added to build a tree directly from the aligned regions.

parsnp \
-d assemblies \
-r E1376901_S01_L001_contigs.fasta \
-o parsnp_outdir \
-p 8 \
--use-fasttree --fo

Tree Visualization and Interpretation

This section describes the visualization of phylogenetic trees derived from both assembly-based and read-based alignments. The trees were visualized using ggtree and further refined using Inkscape and ggplot2 for better visual presentation.

Tool Description: ggtree, IQ-TREE, Inkscape

version: ggtree 3.14.0

ggtree is an R package that is used for generating phylogenetic trees from tree files.

You can find the full code for clustering and visualization of the tree using this method, in the final results folder.

Inkscape

Inkscape is a free and open-source vector graphics editor. It was used to refine outputs from ggtree such as adjusting fonts, labels, node shapes.

Installation:

https://inkscape.org/

IQ-TREE

version: IQ-TREE v2.4.0

IQ-TREE is a fast and efficient phylogenetic software used to construct maximum likelihood trees. It supports a wide range of evolutionary models and includes tools like:

- ModelFinder: For automatic model selection

- UFBoot: Ultrafast bootstrap approximation

- SH-aLRT: Approximate likelihood ratio test for branch support

IQ-TREE is optimized for multi-threaded execution, making it well-suited for large-scale or high-throughput genomic analyses.

Installation:

http://www.iqtree.org/#download
conda install iqtree

IQ-TREE was used to construct a phylogenetic tree from the core SNP alignment generated by Snippy.

Run 1: With E1376901 reference genome

iqtree -s core.aln -nt AUTO

Run 2: With GCF_000009085.1_ASM908v1 as reference genome

iqtree -s core.aln -nt AUTO -redo

Performance Metrics and Result Interpretation

Tool	Reference Used	Threads	Runtime (Real Time)	CPU Time (User)	Max Memory Usage
Snippy	GCF_000009085.1_ASM908v1	4	34m 48s	96m 54s	~200–300 MB
Snippy	E1376901	4	31m 17s	84m 35s	~200–300 MB
Parsnp	GCF_000009085.1_ASM908v1	8	1m 06s	2m 5s	~100–150 MB
Parsnp	E1376901	8	1m 09s	1m 53s	~100–150 MB
IQ-TREE	GCF_000009085.1_ASM908v1	AUTO (1)	43.5s	~43s	18.1 MB
IQ-TREE	E1376901	AUTO (1)	51.2s	~51s	15.3 MB

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
Final_Analysis		Final_Analysis
Preliminary_Analysis		Preliminary_Analysis
Presentations		Presentations
nextflow_pipeline		nextflow_pipeline
.DS_Store		.DS_Store
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Comparative Genomics Final Project: SNP-Based Outbreak Analysis

Pipeline Overview

1. Read-Based Analysis using Snippy

2. Assembly-Based Analysis using SKESA and Parsnp

Both pipelines produce phylogenetic trees that are visualized with ggtree for interpretation.

Reference Genome Selection

Two different reference genomes were used to evaluate their impact on alignment and tree quality:

NCBI Ref Genome: GCF_000009085.1_ASM908v1_genomic.fna.gz

Sample-based Ref Genome: E1376901.fna (chosen from our assembly dataset)

Why E1376901?

- High completeness (99.99%)

- Low contamination (0.05%)

- High N50 (226,829)

- Fewest contigs (12)

Read-Based Analysis

Tool Description: Snippy

Installation

Fastp

Version: fastp v0.24.1

Snippy

Version: Snippy v4.6.0

Step by Step Analysis

Step 1: Quality Control with Fastp

Paired-end FASTQ files were cleaned using fastp to remove low-quality reads. This was performed in a loop for all samples.

Step 2: SNP Calling with Snippy

Snippy was used to call variants using the E1376901 reference genome located in the ref/ folder. Each cleaned sample was processed and the outputs were saved in snippy_outputs/.

Step 3: Core SNP Alignment

Snippy-core was used to combine all individual sample SNPs into a single core SNP alignment.

Assembly-Based Analysis

Tool Description: Parsnp and Skesa

SKESA

Version: Skesa 2.4.0.

Parsnp

Version: Parsnp v2.1.3

Installation

Fastp

SKESA

Parsnp

Step by Step Analysis

Step 1: Read Quality Control using fastp

Paired-end FASTQ files were filtered and trimmed to remove low-quality reads using fastp

Step 2: Genome Assembly using SKESA

Filtered reads were assembled using SKESA

Step 3: Core Genome Alignment with Parsnp

Core genome alignment was performed on all assemblies using E1376901 as the reference genome. The --use-fasttree flag was added to build a tree directly from the aligned regions.

Tree Visualization and Interpretation

Tool Description: ggtree, IQ-TREE, Inkscape

version: ggtree 3.14.0

Inkscape

IQ-TREE

version: IQ-TREE v2.4.0

IQ-TREE is a fast and efficient phylogenetic software used to construct maximum likelihood trees. It supports a wide range of evolutionary models and includes tools like:

- ModelFinder: For automatic model selection

- UFBoot: Ultrafast bootstrap approximation

- SH-aLRT: Approximate likelihood ratio test for branch support

IQ-TREE was used to construct a phylogenetic tree from the core SNP alignment generated by Snippy.

Run 1: With E1376901 reference genome

Run 2: With GCF_000009085.1_ASM908v1 as reference genome

Performance Metrics and Result Interpretation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages