An introduction to Snakemake for pipeline management
A quick-start guide for Snakemake
Motivation
There are few (if any) scientific questions that you can answer by running a single program or script. Calling variants involves aligning reads, sorting reads, indexing reads, running a variant caller, filtering those calls, etc. Running a simulation will inevitably require changing parameters or including new combinations of those parameters.
And there are few (if any) programs or scripts that will run/compile correctly on the first try, due to either user error, code bugs, dependency conflicts, improperly configured environments, or any combination therein.
For these reasons, it’s often crucial to package the various steps of your analysis into a “pipeline.” Ideally, this pipeline would accept a file(s) as input, do some stuff with that file, and generate an output file(s). For example, your pipeline might take a FASTQ file and reference genome FASTA as input, and output an aligned, sorted, and indexed BAM.
In theory, a pipeline could just be a Bash script in which you enumerate each step of the process. But what if you want to run the pipeline on hundreds of samples’ FASTQ files? And what if the final step in your Bash script fails? You’ll have to re-run the entire pipeline all over again.
This is where Snakemake comes in. Snakemake is a flexible Python-based pipeline manager, and it’s even tuned for running on the Sage Grid Engine (or pretty much any other compute environment).
As an example, let’s imagine that we want to take paired-end FASTQ from 3 different mouse samples (A, B, and C) and generate a preliminary set of variant calls for each sample.
To start, let’s imagine we only want to process one sample: A.
Every step of the pipeline gets its own “rule”
The first step of the pipeline will be to download an M. musculus reference genome so that we can align reads.
rule download_reference:
input:
output:
"GRCm38.primary_assembly.genome.fa.gz"
shell:
"""
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M10/GRCm38.primary_assembly.genome.fa.gz
"""
This rule, which is named
download_reference
, doesn’t take any input, since it’s just downloading a FASTA directly.
We specify that the expected output of this rule is a single gzipped FASTA.
After
shell:
, we simply list the commands we’d normally type at the command line to produce the specified output. These commands can be wrapped up in a docstring for easy formatting.
Next, we want to align the FASTQ data from sample A to the reference.
rule bwa_align:
input:
ref = "GRCm38.primary_assembly.genome.fa.gz",
fq1 = "A_1.fastq.gz",
fq2 = "A_2.fastq.gz"
output:
"A.sorted.bam"
shell:
"""
bwa mem -t 4 {input.ref} {input.fq1} {input.fq2} | sambamba view -S -f bam /dev/stdin | sambamba sort -o {output} /dev/stdin
"""
This rule takes as input a reference genome and two FASTQ files. As you can see, it’s possible to name individual input or output files (using Python variable assignment) so that we can access particular files in our shell command.
Note that if a rule has more than one input (or output) files, they should be comma-separated.
Snakemake will only run a rule if it has to
Let’s take a look at the full pipeline we’ve defined so far.
rule all:
input:
"A.sorted.bam"
rule download_reference:
input:
output:
"GRCm38.primary_assembly.genome.fa.gz"
shell:
"""
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M10/GRCm38.primary_assembly.genome.fa.gz
"""
rule bwa_align:
input:
ref = "GRCm38.primary_assembly.genome.fa.gz",
fq1 = "A_1.fastq.gz",
fq2 = "A_2.fastq.gz"
output:
"A.sorted.bam"
shell:
"""
bwa mem -t 4 {input.ref} {input.fq1} {input.fq2} | sambamba view -S -f bam /dev/stdin | sambamba sort -o {output} /dev/stdin
"""
A huge advantage of Snakemake is that it will only run a rule if its output is needed by a downstream rule.
You can see that I’ve added a rule (called
all
) to the top of the pipeline. This is because Snakemake runs in a “bottom-up” fashion.
The
all
rule tells Snakemake what the final output of the entire pipeline should be. In this case, we want the final output to be an aligned BAM.
In this example, Snakemake finds the rule that outputs
A.sorted.bam
(which isbwa_align
), and checks to see if that rule has access to all of its necessary inputs (a reference and two FASTQ files). If not, Snakemake finds the rules that produce those files and runs them. And so on. Oncebwa_align
has all of the inputs it needs, Snakemake runs it to produce the final output.
And let’s say that for whatever reason,
bwa_align
fails when its run. When we re-run the pipeline, Snakemake will check that its input files are present. Since all of the previous rules will have been run before invokingbwa_align
, Snakemake will see that its inputs are present and won’t re-run any of the upstream steps!
Using wildcards to avoid re-writing the same command over and over again
In the previous example, we were only running the pipeline on a single sample. But to generalize the pipeline to run on any list of samples, we can make use of the expand
feature, as well as Snakemake “wildcards.”
Were this a bash pipeline we’d have to write out a separate set of commands for every sample. But with Snakemake it’s much easier.
Here’s an example of what our pipeline would look like with wildcard placeholders instead of explicit sample names.
samples = ["A", "B", "C"]
rule all:
input:
expand("{sample}.sorted.bam", sample=samples)
rule download_reference:
input:
output:
"GRCm38.primary_assembly.genome.fa.gz"
shell:
"""
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M10/GRCm38.primary_assembly.genome.fa.gz
"""
rule bwa_align:
input:
ref = "GRCm38.primary_assembly.genome.fa.gz",
fq1 = "{sample}_1.fastq.gz",
fq2 = "{sample}_2.fastq.gz"
output:
"{sample}.sorted.bam"
shell:
"""
bwa mem -t 4 {input.ref} {input.fq1} {input.fq2} | sambamba view -S -f bam /dev/stdin | sambamba sort -o {output} /dev/stdin
"""
Here, I’ve just replaced every instance of a sample name in an output or input file with a
{sample}
wildcard.
In the
all
rule, I’m using theexpand
function to tell Snakemake that the expected output is a list of sorted BAMs, with the sample names insamples
filled in.
In this case, the result of the expand
would just be:
>>> expand("{sample}.sorted.bam", sample=samples)
["A.sorted.bam", "B.sorted.bam", "C.sorted.bam"]
But we can also use expand
for more complicated chaining of multiple sample names and parameters.
Using expand
to run a pipeline on many samples or with many parameters
Let’s say that once we’ve produced aligned and sorted BAMs, we want to then run a simple variant calling program on each sample’s alignments.
To speed up execution of the variant calling pipeline, it might help to parallelize our pipeline to run on every chromosome separately.
To do this, we’ll again make use of expand
.
samples = ["A", "B", "C"]
chromosomes = list(range(1, 20))
chromosomes = list(map(str, chromosomes))
chromosomes.extend(['X', 'Y'])
rule all:
input:
expand("{sample}.{chrom}.vcf", sample=samples, chrom=chromosomes)
rule download_reference:
input:
output:
"GRCm38.primary_assembly.genome.fa.gz"
shell:
"""
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M10/GRCm38.primary_assembly.genome.fa.gz
"""
rule bwa_align:
input:
ref = "GRCm38.primary_assembly.genome.fa.gz",
fq1 = "{sample}_1.fastq.gz",
fq2 = "{sample}_2.fastq.gz"
output:
"{sample}.sorted.bam"
shell:
"""
bwa mem -t 4 {input.ref} {input.fq1} {input.fq2} | sambamba view -S -f bam /dev/stdin | sambamba sort -o {output} /dev/stdin
"""
rule call_variants:
input:
ref = "GRCm38.primary_assembly.genome.fa.gz",
bam = "{sample}.sorted.bam"
output:
"{sample}.{chrom}.vcf"
shell:
"""
freebayes -f {input.ref} -r {wildcards.chrom} {input.bam} > {output}
"""
We’ve now added a step to the pipeline which takes a reference genome and a BAM as input, and outputs a variant call file (VCF).
Notice that in the
call_variants
rule, myfreebayes
command takes a-r
argument that specifies the region we want to analyze. Whenever you want to access a wildcard inside of theshell:
portion of a rule, its necessary to preface the wildcard name withwildcard
.
And in this example, expand
takes the Cartesian product (i.e., itertools.product()
) of the lists of parameters. So, the output of expand
would be:
>>> expand("{sample}.{chrom}.vcf", sample=samples, chrom=chromosomes)
"A.chr1.vcf",
["A.chr2.vcf",
"A.chr3.vcf",
..."C.chrY.vcf"]
This way, we can enumerate every possible combination of input parameters instead of typing out 3 * 21 separate commands.
Including pure python in a rule
So far, our rules have only invoked shell commands like bwa
or wget
. But Snakemake also allows you to make a rule that is just a block of python code. For example, the following rule would be totally acceptable.
rule count_snps:
input:
expand("{sample}.{chrom}.vcf", sample=samples, chrom=chromosomes)
output:
"per_sample.snp_counts.tsv"
run:
"""
from cyvcf2 import VCF
vcf_file = VCF({input})
output_fh = open({output}, "w")
# loop over input files
for vcf_fh in {input}:
sample_name = vcf_fh.split('.')[0]
chrom = vcf_fh.split('.')[1]
# initialize a VCF object using each
# file handle in the input list
vcf = VCF(vcf_fh)
snp_count = 0
for v in vcf:
if v.var_type == "snp": snp_count += 1
print (','.join([sample_name, chrom, str(snp_count)]), file=output_fh)
"""
Notice that when a rule includes python code, we use the
run:
syntax instead of theshell:
syntax at the top of the code block.
Running a Snakemake pipeline on the Sage Grid Engine (SGE)
In our lab, we use the SGE to submit and manage compute jobs on the cluster. Snakemake is actually compatible with SGE (and SLURM, etc.), which makes it super easy to submit jobs to SGE directly from a Snakemake pipeline.
As an example, we could execute our pipeline as follows:
snakemake -j 10 \
--cluster \
--rerun-incomplete \
"qsub -l centos=7 -l mfree=16G -l h_rt=12:0:0 -o /path/to/outdir -e /path/to/errdir"
The
-j
flag specifies the maximum number of jobs Snakemake is allowed to submit to SGE at a time.
The
--rerun-incomplete
flag is awesome. If we run our pipeline and it fails for some reason, using--rerun-incomplete
will tell Snakemake to re-run a rule if the output of that rule is incomplete (i.e., if the rule didn’t finish due to a job failure in the last pipeline execution).
After specifying
--cluster
, we just put the normalqsub
command we’d normally use to submit a.sh
or.sge
script to the cluster, specifying the memory required by each job, wall time, etc.
Using a config file to flexibly change grid requirements for particular rules
One natural issue with the above command is that some rules might require different cluster specifications than others.
For example, I’ve written my bwa_align
rule such that bwa mem
will use 4 threads during alignment, and it’ll probably use much more memory than downloading a reference genome.
Thankfully, Snakemake lets us create individual cluster specifications for each of our rules using a config
file. See the example config
file below, which is written in JSON.
"__default__":
{ "memory": "8G",
"time": "1:0:0",
"threads": "1",
"os": "7" },
"bwa_align":
{ "memory": "4G",
"time": "8:0:0",
"threads": "4",
"os": "7" }
Then, when we run Snakemake, we could do the following:
snakemake -j 10 \
--cluster-config /path/to/config.json \
--cluster \
--rerun-incomplete \
"qsub -l centos={cluster.os} -l mfree={cluster.memory} -l h_rt={cluster.time} -pe serial {cluster.threads}"
Visualizing the full pipeline with a DAG
We can also visualize the various steps of the pipeline in a directed acyclic graph (DAG).
After putting the full pipeline in a file called Snakefile
, we can run the following from the directory in which the Snakefile resides:
snakemake --dag | dot -Tsvg > dag.svg
This will produce an image showing us exactly what steps Snakemake will during execution. This plot ignores the VCF calling steps, since the DAG gets pretty unweildy with that many steps!