ML for genomics, an evolving reading list

Introductions to ML

The CS230 course at Stanford has made slides and lectures available online. A fantastic resource.

The Amidi brothers have created cheat sheets for CS230 and other Stanford ML courses. Also great.

Andrej Karpathy’s videos are also excellent introductions to ML concepts. In his own words, the first video in his main series (called The spelled-out intro to neural networks and backpropagation: building micrograd) “only assumes basic knowledge of Python and a vague recollection of calculus from high school.” Highly recommended.

Review articles

Deep learning: new computational modeling techniques for genomics

An excellent and detailed overview, probably where I’d start.

A primer on deep learning in genomics

Harnessing deep learning for population genetic inference

Navigating the pitfalls of applying machine learning in genomics

Excellent overview of pitfalls and possible mistakes that can confound ML analyses, with a particular focus on biological inference.

Opportunities and obstacles for deep learning in medicine

Supervised machine learning for population genetics: a new paradigm

To transformers and beyond: large language models for the genome

Applications

Papers are broadly grouped by ML architecture. Many of these papers involve a mix of architectures, so the groups should be considered “fuzzy.”

Transformers

Effective gene expression prediction from sequence by integrating long-range interactions

Describes the “Enformer” model, which utilizes a transformer-based architecture to predict gene expression from sequence alone. Also see a suite of papers describing the limitations of “Enformer” for personal transcriptome inference.

Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation

Generative adversarial networks (GANs)

Automatic inference of demographic parameters using generative adversarial networks

The authors describe a novel GAN architecture that that features a population genetic simulator (in this case, the backwards-in-time msprime tool) as the “generator” and a convolutional neural network as the “discriminator”. The parameters of the msprime generator are randomly initialized, and the discriminator is trained to differentiate between simulated and real “images” of haplotypes in genomic regions of a predefined size. Over time, the generator gets better at simulating realistic-looking data and the discriminator gets better at telling the two classes of data apart. By the end of training, the generator can be interpreted by examining the population genetic parameters (population size, mutation rate, etc.) that optimally confused the discriminator. A well-written and clear overview of a cool (and interpretable) method.

Interpreting generative adversarial networks to infer natural selection from genetic data

A follow-up the the paper listed above. The authors fine-tune the trained discriminator from their GAN to infer regions of the genome under the effects of natural selection.

Language models (LMs)

DNA language models are powerful predictors of genome-wide variant effects

GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction

The nucleotide transformer: building and evaluating robust foundation models for human genomics

A recent attempt to build a “foundation model” for genomics. The authors essentially adapt BERT for DNA sequence by developing an encoder-only architecture that attempts to reconstruct randomly-masked 6-mer DNA “tokens.” The learned embeddings from an input DNA sequence can then be plugged into simple regression models to make predictions about chromatin accessibility, enhancer status, etc., or the model itself can be efficiently fine-tuned for a particular downstream classification task.

Convolutional neural networks (CNNs)

The unreasonable effectiveness of convolutional neural networks in population genetic inference

Check this paper out for a nice introduction to CNNs and how they can be applied to “images” of haplotypes in genomic regions. The associated GitHub repository includes a few simple models (written in TensorFlow/Keras), as well.

Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks

Sequential regulatory activity prediction across chromosomes with convolutional neural networks

Localizing post-admixture adaptive variants with object detection on ancestry-painted chromosomes

The authors train an off-the-shelf object detection model to identify genomic regions with recent (adaptive) admixture events. Nice example of using off-the-shelf models, rather than building architectures from scratch.

Discovery of ongoing selective sweeps within Anopholes mosquito populations using deep learning

A nice example of training CNNs to detect selection using pre-computed features (e.g., a large collection of population genetic summary statistics) rather than “painted haplotype” images.

Autoencoders

Visualizing population structure with variational autoencoders

The authors use a variational autoencoder (VAE) to embed sample genotype vectors into a 2-dimensional latent space that reflects geographical origin.

Haplotype and population structure inference using neural networks in whole-genome sequencing data

A deep learning framework for characterization of genotype data