An introduction to Snakemake for pipeline management

A quick-start guide for Snakemake

Motivation

There are few (if any) scientific questions that you can answer by running a single program or script. Calling variants involves aligning reads, sorting reads, indexing reads, running a variant caller, filtering those calls, etc. Running a simulation will inevitably require changing parameters or including new combinations of those parameters.

And there are few (if any) programs or scripts that will run/compile correctly on the first try, due to either user error, code bugs, dependency conflicts, improperly configured environments, or any combination therein.

For these reasons, it’s often crucial to package the various steps of your analysis into a “pipeline.” Ideally, this pipeline would accept a file(s) as input, do some stuff with that file, and generate an output file(s). For example, your pipeline might take a FASTQ file and reference genome FASTA as input, and output an aligned, sorted, and indexed BAM.

In theory, a pipeline could just be a Bash script in which you enumerate each step of the process. But what if you want to run the pipeline on hundreds of samples’ FASTQ files? And what if the final step in your Bash script fails? You’ll have to re-run the entire pipeline all over again.

This is where Snakemake comes in. Snakemake is a flexible Python-based pipeline manager, and it’s even tuned for running on the Sage Grid Engine (or pretty much any other compute environment).

As an example, let’s imagine that we want to take paired-end FASTQ from 3 different mouse samples (A, B, and C) and generate a preliminary set of variant calls for each sample.

To start, let’s imagine we only want to process one sample: A.

Every step of the pipeline gets its own “rule”

The first step of the pipeline will be to download an M. musculus reference genome so that we can align reads.

rule download_reference:
  input:
  output:
    "GRCm38.primary_assembly.genome.fa.gz"
  shell:
    """
    wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M10/GRCm38.primary_assembly.genome.fa.gz
    """

This rule, which is named download_reference, doesn’t take any input, since it’s just downloading a FASTA directly.

We specify that the expected output of this rule is a single gzipped FASTA.

After shell:, we simply list the commands we’d normally type at the command line to produce the specified output. These commands can be wrapped up in a docstring for easy formatting.

Next, we want to align the FASTQ data from sample A to the reference.

rule bwa_align:
  input:
    ref = "GRCm38.primary_assembly.genome.fa.gz",
    fq1 = "A_1.fastq.gz",
    fq2 = "A_2.fastq.gz"
  output:
    "A.sorted.bam"
  shell:
    """
    bwa mem -t 4 {input.ref} {input.fq1} {input.fq2} | sambamba view -S -f bam /dev/stdin | sambamba sort -o {output} /dev/stdin
    """

This rule takes as input a reference genome and two FASTQ files. As you can see, it’s possible to name individual input or output files (using Python variable assignment) so that we can access particular files in our shell command.

Note that if a rule has more than one input (or output) files, they should be comma-separated.

Snakemake will only run a rule if it has to

Let’s take a look at the full pipeline we’ve defined so far.

rule all:
  input:
    "A.sorted.bam"

rule download_reference:
  input:
  output:
    "GRCm38.primary_assembly.genome.fa.gz"
  shell:
    """
    wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M10/GRCm38.primary_assembly.genome.fa.gz
    """

rule bwa_align:
  input:
    ref = "GRCm38.primary_assembly.genome.fa.gz",
    fq1 = "A_1.fastq.gz",
    fq2 = "A_2.fastq.gz"
  output:
    "A.sorted.bam"
  shell:
    """
    bwa mem -t 4 {input.ref} {input.fq1} {input.fq2} | sambamba view -S -f bam /dev/stdin | sambamba sort -o {output} /dev/stdin
    """

A huge advantage of Snakemake is that it will only run a rule if its output is needed by a downstream rule.

You can see that I’ve added a rule (called all) to the top of the pipeline. This is because Snakemake runs in a “bottom-up” fashion.

The all rule tells Snakemake what the final output of the entire pipeline should be. In this case, we want the final output to be an aligned BAM.

In this example, Snakemake finds the rule that outputs A.sorted.bam (which is bwa_align), and checks to see if that rule has access to all of its necessary inputs (a reference and two FASTQ files). If not, Snakemake finds the rules that produce those files and runs them. And so on. Once bwa_align has all of the inputs it needs, Snakemake runs it to produce the final output.

And let’s say that for whatever reason, bwa_align fails when its run. When we re-run the pipeline, Snakemake will check that its input files are present. Since all of the previous rules will have been run before invoking bwa_align, Snakemake will see that its inputs are present and won’t re-run any of the upstream steps!

Using wildcards to avoid re-writing the same command over and over again

In the previous example, we were only running the pipeline on a single sample. But to generalize the pipeline to run on any list of samples, we can make use of the expand feature, as well as Snakemake “wildcards.”

Were this a bash pipeline we’d have to write out a separate set of commands for every sample. But with Snakemake it’s much easier.

Here’s an example of what our pipeline would look like with wildcard placeholders instead of explicit sample names.

samples = ["A", "B", "C"]

rule all:
  input:
    expand("{sample}.sorted.bam", sample=samples)

rule download_reference:
  input: 
  output:
    "GRCm38.primary_assembly.genome.fa.gz"
  shell:
    """
    wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M10/GRCm38.primary_assembly.genome.fa.gz
    """

rule bwa_align:
  input: 
    ref = "GRCm38.primary_assembly.genome.fa.gz",
    fq1 = "{sample}_1.fastq.gz",
    fq2 = "{sample}_2.fastq.gz"
  output:
    "{sample}.sorted.bam"
  shell:
    """
    bwa mem -t 4 {input.ref} {input.fq1} {input.fq2} | sambamba view -S -f bam /dev/stdin | sambamba sort -o {output} /dev/stdin
    """

Here, I’ve just replaced every instance of a sample name in an output or input file with a {sample} wildcard.

In the all rule, I’m using the expand function to tell Snakemake that the expected output is a list of sorted BAMs, with the sample names in samples filled in.

In this case, the result of the expand would just be:

>>> expand("{sample}.sorted.bam", sample=samples)

["A.sorted.bam", "B.sorted.bam", "C.sorted.bam"]

But we can also use expand for more complicated chaining of multiple sample names and parameters.

Using expand to run a pipeline on many samples or with many parameters

Let’s say that once we’ve produced aligned and sorted BAMs, we want to then run a simple variant calling program on each sample’s alignments.

To speed up execution of the variant calling pipeline, it might help to parallelize our pipeline to run on every chromosome separately.

To do this, we’ll again make use of expand.

samples = ["A", "B", "C"]

chromosomes = list(range(1, 20))
chromosomes = list(map(str, chromosomes))
chromosomes.extend(['X', 'Y'])

rule all:
  input:
    expand("{sample}.{chrom}.vcf", sample=samples, chrom=chromosomes)

rule download_reference:
  input: 
  output:
    "GRCm38.primary_assembly.genome.fa.gz"
  shell:
    """
    wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M10/GRCm38.primary_assembly.genome.fa.gz
    """

rule bwa_align:
  input: 
    ref = "GRCm38.primary_assembly.genome.fa.gz",
    fq1 = "{sample}_1.fastq.gz",
    fq2 = "{sample}_2.fastq.gz"
  output:
    "{sample}.sorted.bam"
  shell:
    """
    bwa mem -t 4 {input.ref} {input.fq1} {input.fq2} | sambamba view -S -f bam /dev/stdin | sambamba sort -o {output} /dev/stdin
    """

rule call_variants:
  input:
    ref = "GRCm38.primary_assembly.genome.fa.gz",
    bam = "{sample}.sorted.bam"
  output:
    "{sample}.{chrom}.vcf"
  shell:
    """
    freebayes -f {input.ref} -r {wildcards.chrom} {input.bam} > {output}
    """

We’ve now added a step to the pipeline which takes a reference genome and a BAM as input, and outputs a variant call file (VCF).

Notice that in the call_variants rule, my freebayes command takes a -r argument that specifies the region we want to analyze. Whenever you want to access a wildcard inside of the shell: portion of a rule, its necessary to preface the wildcard name with wildcard.

And in this example, expand takes the Cartesian product (i.e., itertools.product()) of the lists of parameters. So, the output of expand would be:

>>> expand("{sample}.{chrom}.vcf", sample=samples, chrom=chromosomes)

["A.chr1.vcf",
 "A.chr2.vcf",
 "A.chr3.vcf",
 ...
 "C.chrY.vcf"]

This way, we can enumerate every possible combination of input parameters instead of typing out 3 * 21 separate commands.

Including pure python in a rule

So far, our rules have only invoked shell commands like bwa or wget. But Snakemake also allows you to make a rule that is just a block of python code. For example, the following rule would be totally acceptable.

rule count_snps:
  input:
    expand("{sample}.{chrom}.vcf", sample=samples, chrom=chromosomes)
  output:
    "per_sample.snp_counts.tsv"
  run:
    """
    from cyvcf2 import VCF

    vcf_file = VCF({input})

    output_fh = open({output}, "w")
    
    # loop over input files
    for vcf_fh in {input}:

      sample_name = vcf_fh.split('.')[0]
      chrom = vcf_fh.split('.')[1]

      # initialize a VCF object using each
      # file handle in the input list
      vcf = VCF(vcf_fh)

      snp_count = 0

      for v in vcf:
        if v.var_type == "snp": snp_count += 1

      print (','.join([sample_name, chrom, str(snp_count)]), file=output_fh)

    """

Notice that when a rule includes python code, we use the run: syntax instead of the shell: syntax at the top of the code block.

Running a Snakemake pipeline on the Sage Grid Engine (SGE)

In our lab, we use the SGE to submit and manage compute jobs on the cluster. Snakemake is actually compatible with SGE (and SLURM, etc.), which makes it super easy to submit jobs to SGE directly from a Snakemake pipeline.

As an example, we could execute our pipeline as follows:

snakemake -j 10 \
      --cluster \
      --rerun-incomplete \
       "qsub -l centos=7 -l mfree=16G -l h_rt=12:0:0 -o /path/to/outdir -e /path/to/errdir"

The -j flag specifies the maximum number of jobs Snakemake is allowed to submit to SGE at a time.

The --rerun-incomplete flag is awesome. If we run our pipeline and it fails for some reason, using --rerun-incomplete will tell Snakemake to re-run a rule if the output of that rule is incomplete (i.e., if the rule didn’t finish due to a job failure in the last pipeline execution).

After specifying --cluster, we just put the normal qsub command we’d normally use to submit a .sh or .sge script to the cluster, specifying the memory required by each job, wall time, etc.

Using a config file to flexibly change grid requirements for particular rules

One natural issue with the above command is that some rules might require different cluster specifications than others.

For example, I’ve written my bwa_align rule such that bwa mem will use 4 threads during alignment, and it’ll probably use much more memory than downloading a reference genome.

Thankfully, Snakemake lets us create individual cluster specifications for each of our rules using a config file. See the example config file below, which is written in JSON.

"__default__":

{ "memory": "8G",
  "time": "1:0:0",
  "threads": "1",
  "os": "7" },

"bwa_align":

{ "memory": "4G",
  "time": "8:0:0",
  "threads": "4",
  "os": "7" }

Then, when we run Snakemake, we could do the following:

snakemake -j 10 \
      --cluster-config /path/to/config.json \
      --cluster \
      --rerun-incomplete \
      "qsub -l centos={cluster.os} -l mfree={cluster.memory} -l h_rt={cluster.time} -pe serial {cluster.threads}"

Visualizing the full pipeline with a DAG

We can also visualize the various steps of the pipeline in a directed acyclic graph (DAG).

After putting the full pipeline in a file called Snakefile, we can run the following from the directory in which the Snakefile resides:

snakemake --dag | dot -Tsvg > dag.svg

This will produce an image showing us exactly what steps Snakemake will during execution. This plot ignores the VCF calling steps, since the DAG gets pretty unweildy with that many steps!