Interpretable Genomic Language Model Architectures for Large-Scale Variant Effect Prediction
Interpreting the functional impact of human DNA variants is central to personalised medicine, yet current models struggle to capture the long-range, non-coding context in which many causal variants act. Transformer-based genomic language models such as DNABERT-2 and the Nucleotide Transformer have recently achieved state-of-the-art performance on coding and motif-centric tasks, but their quadratic attention complexity makes them computationally inefficient for sequences where regulatory context spans thousands of base pairs. This PhD will develop a new family of small and large genomic language models that retain or exceed the accuracy of existing models while substantially improving efficiency, interpretability, and clinical usability for non-coding variant effect prediction.
Preliminary work has demonstrated that systematic layer-wise pruning of large genomic Transformers can dramatically reduce model size and fine-tuning time while preserving performance on non-coding variant datasets, including eQTL causal variants derived from Enformer benchmarks. Interestingly, the pattern of layer importance differs across architectures, suggesting that current models are over-parameterised in heterogeneous ways and that principled structure-aware compression is possible. Building on these insights, the proposed research will pursue three aims. First, it will characterise layer- and head-level redundancy, information flow, and contextual receptive fields across multiple pre-trained genomic LLMs, establishing design principles for compact small-scale models targeted to specific variant interpretation tasks. Second, it will design and train hybrid small–large model pipelines in which lightweight pruned models perform rapid genome-wide screening, while larger long-context models selectively refine predictions for challenging loci requiring distal regulatory context. Third, it will develop methods for mechanistic interpretation of predictions, linking model attributions and internal representations to known regulatory elements, cell-type specific chromatin states, and disease-associated variant annotations, thereby supporting transparent decision-making.
Methodologically, the project will combine structured pruning, low-rank adaptation, and knowledge distillation with scalable long-context attention or state-space mechanisms to handle sequences on the order of tens of kilobases. Models will be trained and evaluated on diverse human genomic resources, including population variation, regulatory annotations, and expression quantitative trait loci, with a focus on clinically relevant non-coding variants. Wherever possible, training and evaluation will leverage harmonised cohorts from large-scale resources and clinically curated variant databases, enabling rigorous assessment across ancestries, disease areas, and sequencing platforms. The research will also explore hardware-aware training strategies to maximise throughput and energy efficiency for very long contexts. Performance will be assessed not only by predictive accuracy, but also by compute and memory efficiency, calibration, robustness across cohorts, and the faithfulness of interpretable explanations that could be inspected by domain experts.
The expected outcome is a principled framework /architecture/ method for jointly optimising scale, efficiency, and interpretability in genomic language models, producing a suite of small and large models tailored to different stages of variant prioritisation pipelines. By enabling fast yet accurate assessment of non-coding variants at population scale, this work aims to narrow the gap between genome sequencing and actionable insight, directly supporting future applications in rare disease diagnosis, polygenic risk stratification, and personalised therapeutic development, while laying methodological foundations for safe and reliable deployment of genomic AI systems in precision medicine.
Prospective Candidate Profile:
A strong UK undergraduate degree (First Class or Upper Second Class) or an MEng, MSci or an MSc from UK or overseas in Computer Science, Computer Engineering, Physics, or Biomedical Science is required, with demonstrated proficiency in programming (particularly Machine learning/Deep Learning and Statistics) and computational problem-solving.
Skillset Requirements:
Computational Skillset (Primary Requirement)
- Machine Learning & Deep Learning
- Understanding of Natural Language Processing & Representation Learning
- High‑Performance Computing & Software Engineering ( e.g Proficiency in Python, version control (Git), and reproducible ML workflows, Experience running large‑scale models for large genomic datasets is a bonus)
- Algorithmic Thinking & Experimental Design
Genomic Skillset (Supporting Requirement)
- Foundational knowledge of Genomics, biological dataset and their interpretation mechanism is secondary requirement or eagerness to learn and develop the knowledge quickly is essential.
- Ability to understand and interpret biological knowledge base or eagerness to develop quickly in this domain is essential.
Unlock this job opportunity
View more options below
View full job details
See the complete job description, requirements, and application process


