A simple guide to running GATK
Motivation
There are probably hundreds of “how to use GATK” guides published online, not including the detailed documentation on the GATK website. But in my (admittedly limited) experience, running GATK can be a bit finicky, and the workflow for producing a genotyped VCF using GATK “best practices” can take a little bit of troubleshooting. Below I’ve outlined my own simple workflow for calling SNVs in human data using GATK, including the various preprocessing steps and file/software downloads. This is mostly for my own reference in the future, when the ramblings in my lab notebook eventually lose all meaning to me.
Installing GATK
Go to the releases on the GATK GitHub, and wget
the most recent version (in my case, v.4.1.8.1). Then, just unzip the package.
wget https://github.com/broadinstitute/gatk/releases/download/4.1.8.1/gatk-4.1.8.1.zip
unzip gatk-4.1.8.1.zip
The gatk
binary should be sitting in that directory. Just run as follows:
/path/to/gatk/binary/gatk
Preparing BAM files for initial variant calling
Let’s assume you’re running GATK separately on individual sample BAMs. Even though each BAM contains a single sample’s reads, it’s still necessary to add read groups to the BAM header and read entries.
samtools addreplacerg -r "@RG\tID:${sample_name}\tSM:${sample_name}" -o {output} {input}
Although GATK will run just fine on a BAM with no read groups, exactly zero variants will be output to VCF. In fact, my version of GATK HaplotypeCaller stated that all reads were filtered out by MappingQualityFilter
.
Of course, if you’re planning on merging multiple sample BAMs into a single BAM before running HaplotypeCaller, you should add read groups in order to know exactly which sample a given read is derived from.
Then, as usual, it’s good to sort
and index
the BAM, and mark duplicates.
samtools sort -O BAM -o {output} {input}
samtools index {input}
For duplicate marking, I tend to use picard
(available here).
java -jar /path/to/picard.jar MarkDuplicates INPUT={input} OUTPUT={output} METRICS_FILE={metrics} TMP_DIR={tmpdir}
As it marches along the BAM and marks duplicates, picard
will output lots of temporary files, so it’s good to specify a $tmpdir with enough space for potentially hundreds (or more) temporary files.
Running GATK HaplotypeCaller
The first step of variant calling involves running HaplotypeCaller
, which will produce an initial list of variants. By design, HaplotypeCaller
is very sensitive, so it’s possible that most of the variants in the output VCF will be junk. We’ll do some filtering and “calibration” of variants in later steps.
gatk HaplotypeCaller \
--input ${input_bam} \
--output ${output_VCF} \
--reference ${reference_genome} \
--java-options "-Xmx8G"
-ERC GVCF
Since GATK is written in Java, we can also pass in some standard Java options (such as max memory to be used) as an argument!
Also, by passing in the
-ERC GVCF
argument, we tell HaplotypeCaller to produce GVCF (generic VCF) output, which looks a bit different from normal VCF. The benefit of outputting GVCFs is that we can then run joint genotyping on many samples’ GVCFs together quite quickly.
Joint genotyping GVCFs
gatk GenotypeGVCFs \
--variant ${input_gvcfs} \
--output {output} \
--reference {input.ref} \
--java-options "-Xmx8G"
Here, we can run GenotypeGVCFs on one or many GVCFs together. By passing in multiple GVCFs, we can take advantage of the joint genotyping process to consider evidence from multiple samples at a given variant site. In any case, the output here will be “true” VCF.
Variant quality score recalibration (VQSR)
Now, we arrive at the most arcane (in my opinion) step in running GATK. Essentially, VQSR takes in a few “truth sets,” which are just VCF files containing variants considered to be either “very good” or “very bad,” so that GATK can get a feeling for the qualities that make a “good variant.” But first, we need to get a hold of these truth sets.
Downloading GATK resources for VQSR
GATK hosts its “resource bundle,” containing various files that are important for running GATK, on Google. The four truth sets we can use are hosted here.
To download, we can use a tool called gsutil
, developed by Google specifically for fetching files from Google Cloud.
# download "omni" truth set
/path/to/gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz
# download "hapmap" truth set
/path/to/gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/hapmap_3.3.hg38.vcf.gz
# download "1000G" truth set
/path/to/gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz
# download "dbsnp" truth set
/path/to/gsutil cp gs://genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf
We’ll also need the tabix
indices for these VCFs, which can also be download from the Google Cloud repository, or done by hand. The “dbsnp” truth set also needs to be bgzip
-ed!
Running VQSR
Now that we’ve collected the various files we need, we can run VQSR. Importantly, VQSR does not output VCF files. Instead it outputs a .recal
file and a .tranches
file, which are “applied” to the previous VCF file in the final step of calibration.
A couple of notes here:
- we can specify exactly which annotations we want VQSR to add by specifying
-an
multiple times - we can specify the “tranches” we want VQSR to divide our variants into by specifying
tranche
multiple times - in GATK v4.1.8.1, we must include a space between the
--resource
truth set info and the path to the truth set - it looks as though GATK now has a convolutional neural net approach to VQSR in GATK v4+ for single samples (see here)
More detailed documentation on running VQSR (for human data) is here.
gatk VariantRecalibrator \
--variant {input} \
--output {recal_file_output} \
--tranches-file {tranch_file_output} \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
--mode SNP \
--trust-all-polymorphic true \
--java-options "-Xmx8G" \
-tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \
--resource:omni,known=false,training=true,truth=true,prior=12 /path/to/1000G_omni2.5.hg38.vcf.gz \
--resource:1000G,known=false,training=true,truth=false,prior=10 /path/to/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--resource:hapmap,known=false,training=true,truth=true,prior=15 /path/to/hapmap_3.3.hg38.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=7 /path/to/Homo_sapiens_assembly38.dbsnp138.vcf.gz
Applying VQSR
Now, we arrive at the final step of VQSR. Here, we use the .tranches
and .recal
files produced in the previous step, and “apply” them to the VCF produced in the first step of this section. The output is a recalibrated VCF, which we can then pass into any downstream filtering or analysis pipelines we want!
gatk ApplyVQSR \
--variant {input_vcf} \
--recal-file {recal_file} \
--tranches-file {tranches_file} \
--truth-sensitivity-filter-level 99.7 \
--create-output-variant-index true \
-mode SNP \
--output {output} \
--java-options "-Xmx8G"