15 Phylogenetics
15.1 A quick primer on Phylogenetic Analysis in Bioinformatics
Phylogenetic analysis is a key step in understanding evolutionary relationships among biological entities, such as species, genes, or populations. These relationships are represented using phylogenetic trees, which are graphical hypotheses about evolutionary history. Trees are built using molecular sequence data (DNA, RNA, or proteins), providing insight into lineage divergence and ancestral relationships.
15.1.1 Key Concepts
- Nodes: Represent taxa (species, genes, etc.). Leaves are current taxa, while internal nodes represent hypothetical ancestors.
- Branches: Connect nodes and can represent genetic change or time.
- Rooted vs. Unrooted Trees: Rooted trees suggest a common ancestor, while unrooted trees don’t infer directionality in evolution.
15.1.2 Tree Types
- Cladogram: Depicts relationships; branch lengths have no meaning.
- Phylogram: Branch lengths reflect genetic change or evolutionary time.
15.1.3 Methods for Constructing Trees
- Distance-Based Methods: Use sequence data to calculate genetic distances and cluster taxa (e.g., UPGMA, Neighbor-Joining).
- Character-Based Methods: Use models of sequence evolution (e.g., Maximum Likelihood, Bayesian Inference).
15.1.4 Molecular Evolution Models
Because the observed distance in our sequences at this point in time do not neessecarily reflect the true evolutionary distance, models like Jukes-Cantor have been devised to correct observed sequence differences. This results in a more accurate estimate of the true evolutionary distances, accounting for mutation rates and substitution patterns.
15.1.5 Applications
Phylogenetic trees are widely used to trace the evolution of pathogens, understand species divergence, and even in forensic cases involving pathogen transmission. Nowadays, they are the preferred method for outbreak tracking. Also some bioinformatics algorithms use a clustering or phylogenetic tree construction step to speed up analyses.
15.1.6 Caveats
During the lecture, we discussed several important caveats to keep in mind when interpreting phylogenetic trees:
- Phylogenetic trees illustrate evolutionary relatedness, but do not represent evolutionary progress or advancement. - Branch lengths may reflect evolutionary distances, indicating the degree of genetic divergence from the nearest internal node. - To compare the “relatedness” of taxa, focus on shared common ancestors rather than just the proximity of branches. - Phylogenetic analyses often rely on incomplete data. Your evolutionary hypothesis may change as more samples are included, so be cautious about drawing definitive conclusions.
15.2 Making a Phylogenetic Tree from WGS Data
By the time you reach this point in the course (week 2), you will have downloaded or generated your own whole-genome sequencing (WGS) data. Whether you’ve worked with parasite genomes using BWA
and GATK
, or prokaryotic genomes using snippy
, you will have VCF files containing your called variants. The good news is that you’re almost ready to build a phylogenetic tree! Most tree-building programs require a FASTA file of aligned SNPs as input, so you just need to carry out some additional processing.
15.2.1 Option 1: Combine VCFs from GATK
If you have been working with Plasmodium or other parasites and used BWA
and GATK
, there are multiple ways to combine your data into a format suitable for phylogenetic tree-building programs.
One option is to use the python script that we used during the course. You can find it here: https://github.com/FreBio/mytools/blob/master/vcf2fasta.py . You can use on your combined VCF files produced with GATK.
bcftools view -v snps filtered_output.vcf.gz -o filtered_output.snps.vcf.gz
vcf2fasta.py filtered_output.snps.vcf.gz
After running BCFtools, the script will generate you a fasta file that you can use as input for a phylogenetic tree building program as such:
FastTree -gtr -nt filtered_output.snps.vcf.gz.fa > mytree.nw
15.2.2 Option 2: Run snippy Core
If your project involves bacteria or viruses, you may have used snippy
to perform mapping and variant calling on your trimed FASTQ files. The command might have looked something like this:
cpus=8
REFERENCE=""
for read_1 in *_R1_001.fastq.gz
do
sample_name=$(basename ${read_1} _R1_001.fastq.gz)
snippy \
--cpus "$cpus" \
--ref "$REFERENCE" \
--R1 "$read_1" \
--R2 ${sample_name}_R2_001.fastq.gz \
--outdir snippy/snippy_"$sample_name" \
--prefix "$sample_name"
done
This command will generate a series of folders inside the snippy
directory, with each folder named snippy_$sample_name
. These folders contain multiple output files, including consensus VCF files.
snippy
simplifies the process of creating a core SNP alignment. You can run the following command:
REF=""
prefix=""
snippy core --reference $REF --prefix $prefix snippy_*
If the command runs successfully, you will find two files: core.aln
and core.full.aln
. These files contain the SNPs for each sample, with or without invariant sites, respectively.
At this point, it’s useful to check whether all your samples are included, as well as the number of variant and invariant sites. You can use the following bash commands along with seqtk
:
# Number of isolates with aligned sites
grep -c ">" core.full.aln
grep -c ">" core.aln
# Number of aligned sites, including invariant sites
seqtk comp core.full.aln | cut -f1,2
# Number of core SNP sites
seqtk comp core.aln | cut -f1,2
Sometimes, you might need to perform an additional cleanup step to remove ‘N’ characters. For this, the snp-sites
program is very handy. After cleaning, you can perform phylogenetic tree inference on the alignment:
FastTree -gtr -nt clean.core.aln > clean.core.tree
This process will yield a phylogenetic tree based on the cleaned core SNP alignment.
15.3 Tree visualisation
TBC
15.4 More complex tree building programs
TBC
Now, it’s time to apply your phylogenetics skills to determine where the patient in our storyline contracted their malaria infection.
You will be working with a dataset called ‘contextual data.’ Although the data is simulated for this exercise, in real scenarios, we often rely on public datasets with robust metadata to make such inferences.
Start your analysis and explore what insights you can draw from the phylogeny. Don’t forget to create a visualization of the phylogenetic tree!