16  Results

16.1 Example results on one sample (without for loops)

16.1.1 Settings on the AHRI computers

To get started, activate your conda environment and navigate to the directory where the reference genome is located:

conda activate Fa5
cd ~/Desktop/Fa/FA5-bioinformatics/training/data/reference

16.1.2 Indexing the Reference Genome for BWA

Before performing the read mapping, you need to index the reference genome. Run the following command to index it:

bwa index plasmodium_falciparum_3D7.fasta

16.1.3 Mapping Reads, Sorting, and BAM Conversion

Now, navigate to your results directory and create a new folder for storing the BWA output:

cd ~/Desktop/Fa/FA5-bioinformatics/training/results
mkdir bwa
cd bwa

You can adapt the following bwa mem command to map the reads to the indexed reference genome. This example uses paired-end FASTQ files:

bwa mem \
    ../../data/reference/plasmodium_falciparum_3D7.fasta \
    ../../data/fastq/PF0080_S44_L001_R1_001.fastq.gz \
    ../../data/fastq/PF0080_S44_L001_R2_001.fastq.gz \
    -t 4 \
    -Y -K 100000000 \
    -R "@RG\tID:L001\tPL:ILLUMINA\tSM:PF0080_S44" \
    -o PF0080_S44.sam

Note that the data was not in the working directory (/training/results/bwa), so we used relative paths to refere to the FASTQ files (../../data/fastq/).

The resulting .sam file is now in your current working directory. Next, sort the .sam file and convert it to a .bam file using the following command:

samtools sort -o PF0080_S44.sorted.bam PF0080_S44.sam

You now have a sorted BAM file ready for variant calling!

16.1.4 Variant Calling using GATK

Before running GATK, you need to ensure the reference genome is properly indexed. Navigate to the reference directory and run the following commands to index the reference and create the necessary files:

cd ~/Desktop/Fa/FA5-bioinformatics/training/data/reference/
samtools faidx plasmodium_falciparum_3D7.fasta
gatk CreateSequenceDictionary -R plasmodium_falciparum_3D7.fasta

Make sure your BAM file is indexed before proceeding with variant calling:

cd ~/Desktop/Fa/FA5-bioinformatics/training/results/bwa/
samtools index PF0080_S44.sorted.bam

Now, create an output directory for storing VCF files:

cd ~/Desktop/Fa/FA5-bioinformatics/training/results/
mkdir vcf
cd vcf

From this directory, run the GATK HaplotypeCaller to perform variant calling:

gatk HaplotypeCaller \
    -R ../../data/reference/plasmodium_falciparum_3D7.fasta \
    -I ../bwa/PF0080_S44.sorted.bam \
    -O PF0080_S44.g.vcf.gz \
    -ERC GVCF

16.1.5 Combining Multiple GVCF Files

If you have multiple GVCF files, you can combine them using the CombineGVCFs command:


gatk CombineGVCFs \
    -R ../../data/reference/plasmodium_falciparum_3D7.fasta \
    --variant PF0080_S44.g.vcf.gz \
    --variant PF0097_S43.g.vcf.gz \
    --variant PF0157_S55.g.vcf.gz \
    --variant PF0275_S68.g.vcf.gz \
    --variant PF0302_S20.g.vcf.gz \
    -O samples_combined.g.vcf.gz

16.1.6 Genotyping the GVCF Files

Finally, use GenotypeGVCFs to genotype the combined GVCF files:

gatk GenotypeGVCFs \
    -R ../../data/reference/plasmodium_falciparum_3D7.fasta \
    -V samples_combined.g.vcf.gz \
    -O samples_combined.vcf.gz

16.2 Example results using for loops

16.2.1 Initial Setup and Environment

Activate your Fa5 conda environment:

conda activate Fa5

You can automate the mapping of multiple paired-end FASTQ files using a for loop. Start by navigating to the directory containing the reference genome:


cd ~/Desktop/Fa/FA5-bioinformatics/training/results/bwa
# we first put the absolute path to the reference genome in a variable
ref=/home/bioinfo1/Desktop/Fa/FA5-bioinformatics/training/data/reference/plasmodium_falciparum_3D7.fasta


for fastq_file_R1 in ../../data/fastq/*_L001_R1.trim.fastq.gz
do
  
  sample=$(basename ${fastq_file_R1} _L001_R1.trim.fastq.gz)
  fastq_file_R2=${fastq_file_R1/_R1.trim.fastq.gz/_R2.trim.fastq.gz}
  echo "Processing: ${sample}"
  
  bwa mem \
    "${ref}" \
    "${fastq_file_R1}" \
    "${fastq_file_R2}" \
    -t 4 \
    -Y -K 100000000 \
    -R "@RG\tID:L001\tPL:ILLUMINA\tSM:${sample}" \
    -o "${sample}.sam"
done

After mapping, sort the resulting SAM files and convert them into BAM files:

for sam_file in *.sam
do
  bam_file=${sam_file/.sam/.sorted.bam}
  samtools sort -o ${bam_file} ${sam_file}
done

16.2.2 variant calling with GATK

Ensure the reference genome is indexed properly, and index the sorted BAM files for variant calling:

cd ~Desktop/Fa/FA5-bioinformatics/training/data/reference/
samtools faidx plasmodium_falciparum_3D7.fasta
gatk CreateSequenceDictionary -R plasmodium_falciparum_3D7.fasta

cd cd ~/Desktop/Fa/FA5-bioinformatics/training/results/bwa
for bam_file in *.sorted.bam; do
  samtools index ${bam_file}
done

16.2.3 Running GATK HaplotypeCaller for Multiple Samples

Now, create a directory for VCF files and run HaplotypeCaller for each BAM file:

cd ~/Desktop/Fa/FA5-bioinformatics/training/results/
mkdir vcf

for bam_file in *.sorted.bam; do
  sample=$(basename ${bam_file} .sorted.bam)
  gatk HaplotypeCaller \
    -R ../../data/reference/plasmodium_falciparum_3D7.fasta \
    -I ${bam_file} \
    -O ../vcf/${sample}.g.vcf.gz \
    -ERC GVCF
done

After calling variants, you can combine the resulting GVCF files into a cohort file:


gatk CombineGVCFs \
   -R ../../data/reference/plasmodium_falciparum_3D7.fasta \
   --variant PF0080_S44.g.vcf.gz \
   --variant PF0097_S43.g.vcf.gz \
   --variant PF0157_S55.g.vcf.gz \
   --variant PF0275_S68.g.vcf.gz \
   --variant PF0302_S20.g.vcf.gz \
   -O samples_combined.g.vcf.gz

16.2.4 Genotyping the Combined GVCF Files

Finally, run GenotypeGVCFs to generate a VCF file with the genotyped variants:

gatk GenotypeGVCFs \
    -R ../../data/reference/plasmodium_falciparum_3D7.fasta \
    -V samples_combined.g.vcf.gz \
    -O samples_combined.vcf.gz

This approach uses loops to automate the process, which is especially useful when working with multiple samples. Example scripts for automation can also be found in the ./training/scripts/ directory.