15  Phylogenetics

15.1 A quick primer on Phylogenetic Analysis in Bioinformatics

Phylogenetic analysis is a key step in understanding evolutionary relationships among biological entities, such as species, genes, or populations. These relationships are represented using phylogenetic trees, which are graphical hypotheses about evolutionary history. Trees are built using molecular sequence data (DNA, RNA, or proteins), providing insight into lineage divergence and ancestral relationships.

15.1.1 Key Concepts

  • Nodes: Represent taxa (species, genes, etc.). Leaves are current taxa, while internal nodes represent hypothetical ancestors.
  • Branches: Connect nodes and can represent genetic change or time.
  • Rooted vs. Unrooted Trees: Rooted trees suggest a common ancestor, while unrooted trees don’t infer directionality in evolution.

15.1.2 Tree Types

  1. Cladogram: Depicts relationships; branch lengths have no meaning.
  2. Phylogram: Branch lengths reflect genetic change or evolutionary time.

15.1.3 Methods for Constructing Trees

  • Distance-Based Methods: Use sequence data to calculate genetic distances and cluster taxa (e.g., UPGMA, Neighbor-Joining).
  • Character-Based Methods: Use models of sequence evolution (e.g., Maximum Likelihood, Bayesian Inference).

15.1.4 Molecular Evolution Models

Because the observed distance in our sequences at this point in time do not neessecarily reflect the true evolutionary distance, models like Jukes-Cantor have been devised to correct observed sequence differences. This results in a more accurate estimate of the true evolutionary distances, accounting for mutation rates and substitution patterns.

15.1.5 Applications

Phylogenetic trees are widely used to trace the evolution of pathogens, understand species divergence, and even in forensic cases involving pathogen transmission. Nowadays, they are the preferred method for outbreak tracking. Also some bioinformatics algorithms use a clustering or phylogenetic tree construction step to speed up analyses.

15.1.6 Caveats

During the lecture, we discussed several important caveats to keep in mind when interpreting phylogenetic trees:
- Phylogenetic trees illustrate evolutionary relatedness, but do not represent evolutionary progress or advancement. - Branch lengths may reflect evolutionary distances, indicating the degree of genetic divergence from the nearest internal node. - To compare the “relatedness” of taxa, focus on shared common ancestors rather than just the proximity of branches. - Phylogenetic analyses often rely on incomplete data. Your evolutionary hypothesis may change as more samples are included, so be cautious about drawing definitive conclusions.

15.2 Making a Phylogenetic Tree from WGS Data

By the time you reach this point in the course (week 2), you will have downloaded or generated your own whole-genome sequencing (WGS) data. Whether you’ve worked with parasite genomes using BWA and GATK, or prokaryotic genomes using snippy, you will have VCF files containing your called variants. The good news is that you’re almost ready to build a phylogenetic tree! Most tree-building programs require a FASTA file of aligned SNPs as input, so you just need to carry out some additional processing.

15.2.1 Option 1: Combine VCFs from GATK

If you have been working with Plasmodium or other parasites and used BWA and GATK, there are multiple ways to combine your data into a format suitable for phylogenetic tree-building programs.

One option is to use the python script that we used during the course. You can find it here: https://github.com/FreBio/mytools/blob/master/vcf2fasta.py . You can use on your combined VCF files produced with GATK.

bcftools view -v snps filtered_output.vcf.gz -o filtered_output.snps.vcf.gz
vcf2fasta.py filtered_output.snps.vcf.gz

After running BCFtools, the script will generate you a fasta file that you can use as input for a phylogenetic tree building program as such:

FastTree -gtr -nt filtered_output.snps.vcf.gz.fa > mytree.nw

15.2.2 Option 2: Run snippy Core

If your project involves bacteria or viruses, you may have used snippy to perform mapping and variant calling on your trimed FASTQ files. The command might have looked something like this:

cpus=8
REFERENCE=""

for read_1 in *_R1_001.fastq.gz
do
  sample_name=$(basename ${read_1} _R1_001.fastq.gz)

  snippy \
    --cpus "$cpus" \
    --ref "$REFERENCE" \
    --R1 "$read_1" \
    --R2 ${sample_name}_R2_001.fastq.gz \
    --outdir snippy/snippy_"$sample_name" \
    --prefix "$sample_name"
done

This command will generate a series of folders inside the snippy directory, with each folder named snippy_$sample_name. These folders contain multiple output files, including consensus VCF files.

snippy simplifies the process of creating a core SNP alignment. You can run the following command:

REF=""
prefix=""
snippy core --reference $REF --prefix $prefix snippy_*

If the command runs successfully, you will find two files: core.aln and core.full.aln. These files contain the SNPs for each sample, with or without invariant sites, respectively.

At this point, it’s useful to check whether all your samples are included, as well as the number of variant and invariant sites. You can use the following bash commands along with seqtk:

# Number of isolates with aligned sites
grep -c ">" core.full.aln
grep -c ">" core.aln

# Number of aligned sites, including invariant sites
seqtk comp core.full.aln | cut -f1,2

# Number of core SNP sites
seqtk comp core.aln | cut -f1,2

Sometimes, you might need to perform an additional cleanup step to remove ‘N’ characters. For this, the snp-sites program is very handy. After cleaning, you can perform phylogenetic tree inference on the alignment:

FastTree -gtr -nt clean.core.aln > clean.core.tree

This process will yield a phylogenetic tree based on the cleaned core SNP alignment.

15.3 Tree visualisation

TBC

15.4 More complex tree building programs

TBC

Storyline: Malaria in Ethiopia [excercise not performed during course, will be updated]

Now, it’s time to apply your phylogenetics skills to determine where the patient in our storyline contracted their malaria infection.

You will be working with a dataset called ‘contextual data.’ Although the data is simulated for this exercise, in real scenarios, we often rely on public datasets with robust metadata to make such inferences.

Start your analysis and explore what insights you can draw from the phylogeny. Don’t forget to create a visualization of the phylogenetic tree!