This README documents a comparative genomics pipeline that performs phylogenetic analysis using both read-based (Snippy) and assembly-based (Parsnp) approaches. The pipeline aims to assess genetic relatedness among bacterial isolates by detecting SNPs, aligning core genomes, and visualizing their evolutionary relationships. Results from both approaches are compared to highlight isolate similarities and differences, with a focus on distinguishing outbreak-associated strains from sporadic ones using phylogenetic trees.
This project consists of two main analysis branches:
We selected E1376901 based on:
Read-based analysis uses raw sequencing reads of our assembly files to perform SNP detection and build phylogenetic trees. This approach is often more accurate for SNP calling, particularly in outbreak scenarios, because it preserves original read information and reduces assembly-induced artifacts.
Snippy is a rapid and lightweight bacterial variant calling pipeline. It aligns raw Illumina reads to a reference genome, calls SNPs, and generates core genome alignments that can be used for phylogenetic inference. Snippy is highly effective for microbial comparative genomics and is widely used in pathogen surveillance and outbreak tracing.
GitHub: https://github.com/tseemann/snippy
conda install -c bioconda fastp
conda create -n snippy_env -c bioconda -c conda-forge snippy
Paired-end FASTQ files were cleaned using fastp to remove low-quality reads. This was performed in a loop for all samples.
mkdir -p cleaned_reads
for read in raw_reads/*_R1_001.fastq.gz; do
sample=$(basename "${read}" _R1_001.fastq.gz)
fastp \
-i "${read}" \
-I "${read/_R1_/_R2_}" \
-o cleaned_reads/"${sample}_R1_trimmed.fq.gz" \
-O cleaned_reads/"${sample}_R2_trimmed.fq.gz"
done
Snippy was used to call variants using the E1376901 reference genome located in the ref/ folder. Each cleaned sample was processed and the outputs were saved in snippy_outputs/.
mkdir -p snippy_outputs
for R1 in cleaned_reads/*_R1_trimmed.fq.gz; do
sample=$(basename "${R1}" _R1_trimmed.fq.gz)
R2=cleaned_reads/${sample}_R2_trimmed.fq.gz
snippy \
--cpus 4 \
--outdir snippy_outputs/mysnps-${sample} \
--ref ref/E1376901_S01_L001_contigs.fasta \
--R1 "${R1}" \
--R2 "${R2}"
done
snippy-core \
--prefix snippy_outputs/core \
--ref ref/E1376901_S01_L001_contigs.fasta \
snippy_outputs/mysnps-*
Assembly-based analysis uses the fully assembled genome sequences, rather than raw sequencing reads. In this approach, paired-end reads are first assembled into longer contigs, which are then used to identify core genome alignments and construct phylogenetic trees. This method provides a higher-quality comparisons at the whole-genome level.
SKESA is a fast and reliable tool used to build genomes from raw Illumina sequencing reads. It creates contigs and automatically removes low-quality or very short ones. It works well for assembling bacterial genomes.
GitHub: https://github.com/ncbi/SKESA
Parsnp is a tool for efficient core genome alignment of microbial assemblies. It rapidly identifies homologous genomic regions among multiple assemblies, aligns them, and builds a phylogenetic tree. It is commonly used for outbreak analysis, comparative pathogen genomics and phylogenetic reconstruction.
GitHub: https://github.com/marbl/parsnp
conda install -c bioconda fastp
conda create -n skesa_env -c bioconda skesa
conda activate skesa_env
conda create -n harvestsuite -c bioconda parsnp fasttree
conda activate harvestsuite
for read in raw_reads/*_R1_001.fastq.gz; do
sample="$(basename "${read}" _R1_001.fastq.gz)"
fastp \
-i "${read}" \
-I "${read/_R1_/_R2_}" \
-o cleaned_reads/"${sample}_R1_cleaned.fq.gz" \
-O cleaned_reads/"${sample}_R2_cleaned.fq.gz"
done
mkdir -p assemblies
for read in cleaned_reads/*_R1_cleaned.fq.gz; do
sample="$(basename "${read}" _R1_cleaned.fq.gz)"
skesa \
--reads "${read}" "${read/_R1_cleaned.fq.gz/_R2_cleaned.fq.gz}" \
--cores 4 \
--min_contig 1000 \
--contigs_out assemblies/"${sample}.fna"
done
Core genome alignment was performed on all assemblies using E1376901 as the reference genome. The --use-fasttree flag was added to build a tree directly from the aligned regions.
parsnp \
-d assemblies \
-r E1376901_S01_L001_contigs.fasta \
-o parsnp_outdir \
-p 8 \
--use-fasttree --fo
This section describes the visualization of phylogenetic trees derived from both assembly-based and read-based alignments. The trees were visualized using ggtree and further refined using Inkscape and ggplot2 for better visual presentation.
ggtree is an R package that is used for generating phylogenetic trees from tree files.
You can find the full code for clustering and visualization of the tree using this method, in the final results folder.
Inkscape is a free and open-source vector graphics editor. It was used to refine outputs from ggtree such as adjusting fonts, labels, node shapes.
Installation:
IQ-TREE is a fast and efficient phylogenetic software used to construct maximum likelihood trees. It supports a wide range of evolutionary models and includes tools like:
IQ-TREE is optimized for multi-threaded execution, making it well-suited for large-scale or high-throughput genomic analyses.
Installation:
- http://www.iqtree.org/#download
conda install iqtree
iqtree -s core.aln -nt AUTO
iqtree -s core.aln -nt AUTO -redo
| Tool | Reference Used | Threads | Runtime (Real Time) | CPU Time (User) | Max Memory Usage |
|---|---|---|---|---|---|
| Snippy | GCF_000009085.1_ASM908v1 | 4 | 34m 48s | 96m 54s | ~200–300 MB |
| Snippy | E1376901 | 4 | 31m 17s | 84m 35s | ~200–300 MB |
| Parsnp | GCF_000009085.1_ASM908v1 | 8 | 1m 06s | 2m 5s | ~100–150 MB |
| Parsnp | E1376901 | 8 | 1m 09s | 1m 53s | ~100–150 MB |
| IQ-TREE | GCF_000009085.1_ASM908v1 | AUTO (1) | 43.5s | ~43s | 18.1 MB |
| IQ-TREE | E1376901 | AUTO (1) | 51.2s | ~51s | 15.3 MB |