Skip to content

Lasya-Pasumarthy/Comp-Genomics-P4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comparative Genomics Final Project: SNP-Based Outbreak Analysis

This README documents a comparative genomics pipeline that performs phylogenetic analysis using both read-based (Snippy) and assembly-based (Parsnp) approaches. The pipeline aims to assess genetic relatedness among bacterial isolates by detecting SNPs, aligning core genomes, and visualizing their evolutionary relationships. Results from both approaches are compared to highlight isolate similarities and differences, with a focus on distinguishing outbreak-associated strains from sporadic ones using phylogenetic trees.

Pipeline Overview

This project consists of two main analysis branches:

1. Read-Based Analysis using Snippy

2. Assembly-Based Analysis using SKESA and Parsnp

Both pipelines produce phylogenetic trees that are visualized with ggtree for interpretation.

Reference Genome Selection

Two different reference genomes were used to evaluate their impact on alignment and tree quality:

NCBI Ref Genome: GCF_000009085.1_ASM908v1_genomic.fna.gz

Sample-based Ref Genome: E1376901.fna (chosen from our assembly dataset)

Why E1376901?

We selected E1376901 based on:

- High completeness (99.99%)

- Low contamination (0.05%)

- High N50 (226,829)

- Fewest contigs (12)

Read-Based Analysis

Read-based analysis uses raw sequencing reads of our assembly files to perform SNP detection and build phylogenetic trees. This approach is often more accurate for SNP calling, particularly in outbreak scenarios, because it preserves original read information and reduces assembly-induced artifacts.

Tool Description: Snippy

Snippy is a rapid and lightweight bacterial variant calling pipeline. It aligns raw Illumina reads to a reference genome, calls SNPs, and generates core genome alignments that can be used for phylogenetic inference. Snippy is highly effective for microbial comparative genomics and is widely used in pathogen surveillance and outbreak tracing.

GitHub: https://github.com/tseemann/snippy

Installation

Fastp

Version: fastp v0.24.1

    conda install -c bioconda fastp

Snippy

Version: Snippy v4.6.0

    conda create -n snippy_env -c bioconda -c conda-forge snippy

Step by Step Analysis

Step 1: Quality Control with Fastp

Paired-end FASTQ files were cleaned using fastp to remove low-quality reads. This was performed in a loop for all samples.

mkdir -p cleaned_reads

for read in raw_reads/*_R1_001.fastq.gz; do
    sample=$(basename "${read}" _R1_001.fastq.gz)
    fastp \
        -i "${read}" \
        -I "${read/_R1_/_R2_}" \
        -o cleaned_reads/"${sample}_R1_trimmed.fq.gz" \
        -O cleaned_reads/"${sample}_R2_trimmed.fq.gz"
done

Step 2: SNP Calling with Snippy

Snippy was used to call variants using the E1376901 reference genome located in the ref/ folder. Each cleaned sample was processed and the outputs were saved in snippy_outputs/.

       mkdir -p snippy_outputs

for R1 in cleaned_reads/*_R1_trimmed.fq.gz; do
    sample=$(basename "${R1}" _R1_trimmed.fq.gz)
    R2=cleaned_reads/${sample}_R2_trimmed.fq.gz
    snippy \
        --cpus 4 \
        --outdir snippy_outputs/mysnps-${sample} \
        --ref ref/E1376901_S01_L001_contigs.fasta \
        --R1 "${R1}" \
        --R2 "${R2}"
done

Step 3: Core SNP Alignment

Snippy-core was used to combine all individual sample SNPs into a single core SNP alignment.

    snippy-core \
--prefix snippy_outputs/core \
--ref ref/E1376901_S01_L001_contigs.fasta \
snippy_outputs/mysnps-*

Assembly-Based Analysis

Assembly-based analysis uses the fully assembled genome sequences, rather than raw sequencing reads. In this approach, paired-end reads are first assembled into longer contigs, which are then used to identify core genome alignments and construct phylogenetic trees. This method provides a higher-quality comparisons at the whole-genome level.

Tool Description: Parsnp and Skesa

SKESA

Version: Skesa 2.4.0.

SKESA is a fast and reliable tool used to build genomes from raw Illumina sequencing reads. It creates contigs and automatically removes low-quality or very short ones. It works well for assembling bacterial genomes.

GitHub: https://github.com/ncbi/SKESA

Parsnp

Version: Parsnp v2.1.3

Parsnp is a tool for efficient core genome alignment of microbial assemblies. It rapidly identifies homologous genomic regions among multiple assemblies, aligns them, and builds a phylogenetic tree. It is commonly used for outbreak analysis, comparative pathogen genomics and phylogenetic reconstruction.

GitHub: https://github.com/marbl/parsnp

Installation

Fastp

    conda install -c bioconda fastp

SKESA

    conda create -n skesa_env -c bioconda skesa
    conda activate skesa_env

Parsnp

    conda create -n harvestsuite -c bioconda parsnp fasttree
    conda activate harvestsuite

Step by Step Analysis

Step 1: Read Quality Control using fastp

Paired-end FASTQ files were filtered and trimmed to remove low-quality reads using fastp

    for read in raw_reads/*_R1_001.fastq.gz; do
sample="$(basename "${read}" _R1_001.fastq.gz)"
fastp \
    -i "${read}" \
    -I "${read/_R1_/_R2_}" \
    -o cleaned_reads/"${sample}_R1_cleaned.fq.gz" \
    -O cleaned_reads/"${sample}_R2_cleaned.fq.gz"
done

Step 2: Genome Assembly using SKESA

Filtered reads were assembled using SKESA

   mkdir -p assemblies
for read in cleaned_reads/*_R1_cleaned.fq.gz; do
    sample="$(basename "${read}" _R1_cleaned.fq.gz)"
    skesa \
        --reads "${read}" "${read/_R1_cleaned.fq.gz/_R2_cleaned.fq.gz}" \
        --cores 4 \
        --min_contig 1000 \
        --contigs_out assemblies/"${sample}.fna"
done

Step 3: Core Genome Alignment with Parsnp

Core genome alignment was performed on all assemblies using E1376901 as the reference genome. The --use-fasttree flag was added to build a tree directly from the aligned regions.

parsnp \
-d assemblies \
-r E1376901_S01_L001_contigs.fasta \
-o parsnp_outdir \
-p 8 \
--use-fasttree --fo

Tree Visualization and Interpretation

This section describes the visualization of phylogenetic trees derived from both assembly-based and read-based alignments. The trees were visualized using ggtree and further refined using Inkscape and ggplot2 for better visual presentation.

Tool Description: ggtree, IQ-TREE, Inkscape

version: ggtree 3.14.0

ggtree is an R package that is used for generating phylogenetic trees from tree files.

You can find the full code for clustering and visualization of the tree using this method, in the final results folder.

Inkscape

Inkscape is a free and open-source vector graphics editor. It was used to refine outputs from ggtree such as adjusting fonts, labels, node shapes.

Installation:

IQ-TREE

version: IQ-TREE v2.4.0

IQ-TREE is a fast and efficient phylogenetic software used to construct maximum likelihood trees. It supports a wide range of evolutionary models and includes tools like:

- ModelFinder: For automatic model selection

- UFBoot: Ultrafast bootstrap approximation

- SH-aLRT: Approximate likelihood ratio test for branch support

IQ-TREE is optimized for multi-threaded execution, making it well-suited for large-scale or high-throughput genomic analyses.

Installation:

IQ-TREE was used to construct a phylogenetic tree from the core SNP alignment generated by Snippy.

Run 1: With E1376901 reference genome

iqtree -s core.aln -nt AUTO

Run 2: With GCF_000009085.1_ASM908v1 as reference genome

iqtree -s core.aln -nt AUTO -redo

Performance Metrics and Result Interpretation

Tool Reference Used Threads Runtime (Real Time) CPU Time (User) Max Memory Usage
Snippy GCF_000009085.1_ASM908v1 4 34m 48s 96m 54s ~200–300 MB
Snippy E1376901 4 31m 17s 84m 35s ~200–300 MB
Parsnp GCF_000009085.1_ASM908v1 8 1m 06s 2m 5s ~100–150 MB
Parsnp E1376901 8 1m 09s 1m 53s ~100–150 MB
IQ-TREE GCF_000009085.1_ASM908v1 AUTO (1) 43.5s ~43s 18.1 MB
IQ-TREE E1376901 AUTO (1) 51.2s ~51s 15.3 MB

About

Phase 4 of BIOL7210 (Computational Genomics)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages