ML for genomics, an evolving reading list
Introductions to ML
The CS230 course at Stanford has made slides and lectures available online. A fantastic resource.
The Amidi brothers have created cheat sheets for CS230 and other Stanford ML courses. Also great.
Andrej Karpathy’s videos are also excellent introductions to ML concepts. In his own words, the first video in his main series (called The spelled-out intro to neural networks and backpropagation: building micrograd) “only assumes basic knowledge of Python and a vague recollection of calculus from high school.” Highly recommended.
Deep learning: new computational modeling techniques for genomics
An excellent and detailed overview, probably where I’d start.
A primer on deep learning in genomics
Harnessing deep learning for population genetic inference
Navigating the pitfalls of applying machine learning in genomics
Excellent overview of pitfalls and possible mistakes that can confound ML analyses, with a particular focus on biological inference.
Opportunities and obstacles for deep learning in medicine
Supervised machine learning for population genetics: a new paradigm
To transformers and beyond: large language models for the genome
Applications
Papers are broadly grouped by ML architecture. Many of these papers involve a mix of architectures, so the groups should be considered “fuzzy.”
Effective gene expression prediction from sequence by integrating long-range interactions
Describes the “Enformer” model, which utilizes a transformer-based architecture to predict gene expression from sequence alone. Also see a suite of papers describing the limitations of “Enformer” for personal transcriptome inference.
Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation
Automatic inference of demographic parameters using generative adversarial networks
The authors describe a novel GAN architecture that that features a population genetic simulator (in this case, the backwards-in-time
msprime
tool) as the “generator” and a convolutional neural network as the “discriminator”. The parameters of themsprime
generator are randomly initialized, and the discriminator is trained to differentiate between simulated and real “images” of haplotypes in genomic regions of a predefined size. Over time, the generator gets better at simulating realistic-looking data and the discriminator gets better at telling the two classes of data apart. By the end of training, the generator can be interpreted by examining the population genetic parameters (population size, mutation rate, etc.) that optimally confused the discriminator. A well-written and clear overview of a cool (and interpretable) method.
Interpreting generative adversarial networks to infer natural selection from genetic data
A follow-up the the paper listed above. The authors fine-tune the trained discriminator from their GAN to infer regions of the genome under the effects of natural selection.
DNA language models are powerful predictors of genome-wide variant effects
GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction
The nucleotide transformer: building and evaluating robust foundation models for human genomics
A recent attempt to build a “foundation model” for genomics. The authors essentially adapt BERT for DNA sequence by developing an encoder-only architecture that attempts to reconstruct randomly-masked 6-mer DNA “tokens.” The learned embeddings from an input DNA sequence can then be plugged into simple regression models to make predictions about chromatin accessibility, enhancer status, etc., or the model itself can be efficiently fine-tuned for a particular downstream classification task.
The unreasonable effectiveness of convolutional neural networks in population genetic inference
Check this paper out for a nice introduction to CNNs and how they can be applied to “images” of haplotypes in genomic regions. The associated GitHub repository includes a few simple models (written in TensorFlow/Keras), as well.
Sequential regulatory activity prediction across chromosomes with convolutional neural networks
Localizing post-admixture adaptive variants with object detection on ancestry-painted chromosomes
The authors train an off-the-shelf object detection model to identify genomic regions with recent (adaptive) admixture events. Nice example of using off-the-shelf models, rather than building architectures from scratch.
Discovery of ongoing selective sweeps within Anopholes mosquito populations using deep learning
A nice example of training CNNs to detect selection using pre-computed features (e.g., a large collection of population genetic summary statistics) rather than “painted haplotype” images.
Visualizing population structure with variational autoencoders
The authors use a variational autoencoder (VAE) to embed sample genotype vectors into a 2-dimensional latent space that reflects geographical origin.
Haplotype and population structure inference using neural networks in whole-genome sequencing data
A deep learning framework for characterization of genotype data