16 Results
16.1 Example results on one sample (without for loops)
16.1.1 Settings on the AHRI computers
To get started, activate your conda environment and navigate to the directory where the reference genome is located:
conda activate Fa5
cd ~/Desktop/Fa/FA5-bioinformatics/training/data/reference
16.1.2 Indexing the Reference Genome for BWA
Before performing the read mapping, you need to index the reference genome. Run the following command to index it:
bwa index plasmodium_falciparum_3D7.fasta
16.1.3 Mapping Reads, Sorting, and BAM Conversion
Now, navigate to your results directory and create a new folder for storing the BWA output:
cd ~/Desktop/Fa/FA5-bioinformatics/training/results
mkdir bwa
cd bwa
You can adapt the following bwa mem
command to map the reads to the indexed reference genome. This example uses paired-end FASTQ files:
bwa mem \
\
../../data/reference/plasmodium_falciparum_3D7.fasta \
../../data/fastq/PF0080_S44_L001_R1_001.fastq.gz \
../../data/fastq/PF0080_S44_L001_R2_001.fastq.gz -t 4 \
-Y -K 100000000 \
-R "@RG\tID:L001\tPL:ILLUMINA\tSM:PF0080_S44" \
-o PF0080_S44.sam
Note that the data was not in the working directory (/training/results/bwa
), so we used relative paths to refere to the FASTQ files (../../data/fastq/
).
The resulting .sam
file is now in your current working directory. Next, sort the .sam
file and convert it to a .bam
file using the following command:
samtools sort -o PF0080_S44.sorted.bam PF0080_S44.sam
You now have a sorted BAM file ready for variant calling!
16.1.4 Variant Calling using GATK
Before running GATK, you need to ensure the reference genome is properly indexed. Navigate to the reference directory and run the following commands to index the reference and create the necessary files:
cd ~/Desktop/Fa/FA5-bioinformatics/training/data/reference/
samtools faidx plasmodium_falciparum_3D7.fasta
gatk CreateSequenceDictionary -R plasmodium_falciparum_3D7.fasta
Make sure your BAM file is indexed before proceeding with variant calling:
cd ~/Desktop/Fa/FA5-bioinformatics/training/results/bwa/
samtools index PF0080_S44.sorted.bam
Now, create an output directory for storing VCF files:
cd ~/Desktop/Fa/FA5-bioinformatics/training/results/
mkdir vcf
cd vcf
From this directory, run the GATK HaplotypeCaller to perform variant calling:
gatk HaplotypeCaller \
-R ../../data/reference/plasmodium_falciparum_3D7.fasta \
-I ../bwa/PF0080_S44.sorted.bam \
-O PF0080_S44.g.vcf.gz \
-ERC GVCF
16.1.5 Combining Multiple GVCF Files
If you have multiple GVCF files, you can combine them using the CombineGVCFs
command:
gatk CombineGVCFs \
-R ../../data/reference/plasmodium_falciparum_3D7.fasta \
--variant PF0080_S44.g.vcf.gz \
--variant PF0097_S43.g.vcf.gz \
--variant PF0157_S55.g.vcf.gz \
--variant PF0275_S68.g.vcf.gz \
--variant PF0302_S20.g.vcf.gz \
-O samples_combined.g.vcf.gz
16.1.6 Genotyping the GVCF Files
Finally, use GenotypeGVCFs
to genotype the combined GVCF files:
gatk GenotypeGVCFs \
-R ../../data/reference/plasmodium_falciparum_3D7.fasta \
-V samples_combined.g.vcf.gz \
-O samples_combined.vcf.gz
16.2 Example results using for loops
16.2.1 Initial Setup and Environment
Activate your Fa5
conda environment:
conda activate Fa5
You can automate the mapping of multiple paired-end FASTQ files using a for
loop. Start by navigating to the directory containing the reference genome:
cd ~/Desktop/Fa/FA5-bioinformatics/training/results/bwa
# we first put the absolute path to the reference genome in a variable
ref=/home/bioinfo1/Desktop/Fa/FA5-bioinformatics/training/data/reference/plasmodium_falciparum_3D7.fasta
for fastq_file_R1 in ../../data/fastq/*_L001_R1.trim.fastq.gz
do
sample=$(basename ${fastq_file_R1} _L001_R1.trim.fastq.gz)
fastq_file_R2=${fastq_file_R1/_R1.trim.fastq.gz/_R2.trim.fastq.gz}
echo "Processing: ${sample}"
bwa mem \
"${ref}" \
"${fastq_file_R1}" \
"${fastq_file_R2}" \
-t 4 \
-Y -K 100000000 \
-R "@RG\tID:L001\tPL:ILLUMINA\tSM:${sample}" \
-o "${sample}.sam"
done
After mapping, sort the resulting SAM files and convert them into BAM files:
for sam_file in *.sam
do
bam_file=${sam_file/.sam/.sorted.bam}
samtools sort -o ${bam_file} ${sam_file}
done
16.2.2 variant calling with GATK
Ensure the reference genome is indexed properly, and index the sorted BAM files for variant calling:
cd ~Desktop/Fa/FA5-bioinformatics/training/data/reference/
samtools faidx plasmodium_falciparum_3D7.fasta
gatk CreateSequenceDictionary -R plasmodium_falciparum_3D7.fasta
cd cd ~/Desktop/Fa/FA5-bioinformatics/training/results/bwa
for bam_file in *.sorted.bam; do
samtools index ${bam_file}
done
16.2.3 Running GATK HaplotypeCaller for Multiple Samples
Now, create a directory for VCF files and run HaplotypeCaller
for each BAM file:
cd ~/Desktop/Fa/FA5-bioinformatics/training/results/
mkdir vcf
for bam_file in *.sorted.bam; do
sample=$(basename ${bam_file} .sorted.bam)
gatk HaplotypeCaller \
-R ../../data/reference/plasmodium_falciparum_3D7.fasta \
-I ${bam_file} \
-O ../vcf/${sample}.g.vcf.gz \
-ERC GVCF
done
After calling variants, you can combine the resulting GVCF files into a cohort file:
gatk CombineGVCFs \
-R ../../data/reference/plasmodium_falciparum_3D7.fasta \
--variant PF0080_S44.g.vcf.gz \
--variant PF0097_S43.g.vcf.gz \
--variant PF0157_S55.g.vcf.gz \
--variant PF0275_S68.g.vcf.gz \
--variant PF0302_S20.g.vcf.gz \
-O samples_combined.g.vcf.gz
16.2.4 Genotyping the Combined GVCF Files
Finally, run GenotypeGVCFs
to generate a VCF file with the genotyped variants:
gatk GenotypeGVCFs \
-R ../../data/reference/plasmodium_falciparum_3D7.fasta \
-V samples_combined.g.vcf.gz \
-O samples_combined.vcf.gz
This approach uses loops to automate the process, which is especially useful when working with multiple samples. Example scripts for automation can also be found in the ./training/scripts/
directory.