Interpreting the functional impact of human DNA variants is central to personalised medicine, yet current models struggle to capture the long-range, non-coding context in which many causal variants act. Transformer-based genomic language models such as DNABERT-2 and the Nucleotide Transformer have recently achieved state-of-the-art performance on coding and motif-centric tasks, but their quadratic attention complexity makes them computationally inefficient for sequences where regulatory context spans thousands of base pairs. This PhD will develop a new family of small and large genomic language models that retain or exceed the accuracy of existing models while substantially improving efficiency, interpretability, and clinical usability for non-coding variant effect prediction.

Preliminary work has demonstrated that systematic layer-wise pruning of large genomic Transformers can dramatically reduce model size and fine-tuning time while preserving performance on non-coding variant datasets, including eQTL causal variants derived from Enformer benchmarks. Interestingly, the pattern of layer importance differs across architectures, suggesting that current models are over-parameterised in heterogeneous ways and that principled structure-aware compression is possible. Building on these insights, the proposed research will pursue three aims. First, it will characterise layer- and head-level redundancy, information flow, and contextual receptive fields across multiple pre-trained genomic LLMs, establishing design principles for compact small-scale models targeted to specific variant interpretation tasks. Second, it will design and train hybrid small–large model pipelines in which lightweight pruned models perform rapid genome-wide screening, while larger long-context models selectively refine predictions for challenging loci requiring distal regulatory context. Third, it will develop methods for mechanistic interpretation of predictions, linking model attributions and internal representations to known regulatory elements, cell-type specific chromatin states, and disease-associated variant annotations, thereby supporting transparent decision-making.

Methodologically, the project will combine structured pruning, low-rank adaptation, and knowledge distillation with scalable long-context attention or state-space mechanisms to handle sequences on the order of tens of kilobases. Models will be trained and evaluated on diverse human genomic resources, including population variation, regulatory annotations, and expression quantitative trait loci, with a focus on clinically relevant non-coding variants. Wherever possible, training and evaluation will leverage harmonised cohorts from large-scale resources and clinically curated variant databases, enabling rigorous assessment across ancestries, disease areas, and sequencing platforms. The research will also explore hardware-aware training strategies to maximise throughput and energy efficiency for very long contexts. Performance will be assessed not only by predictive accuracy, but also by compute and memory efficiency, calibration, robustness across cohorts, and the faithfulness of interpretable explanations that could be inspected by domain experts.

The expected outcome is a principled framework /architecture/ method for jointly optimising scale, efficiency, and interpretability in genomic language models, producing a suite of small and large models tailored to different stages of variant prioritisation pipelines. By enabling fast yet accurate assessment of non-coding variants at population scale, this work aims to narrow the gap between genome sequencing and actionable insight, directly supporting future applications in rare disease diagnosis, polygenic risk stratification, and personalised therapeutic development, while laying methodological foundations for safe and reliable deployment of genomic AI systems in precision medicine.

Prospective Candidate Profile:

A strong UK undergraduate degree (First Class or Upper Second Class) or an MEng, MSci or an MSc from UK or overseas in Computer Science, Computer Engineering, Physics, or Biomedical Science is required, with demonstrated proficiency in programming (particularly Machine learning/Deep Learning and Statistics) and computational problem-solving.

Skillset Requirements:

Computational Skillset (Primary Requirement)

Machine Learning & Deep Learning
Understanding of Natural Language Processing & Representation Learning
High‑Performance Computing & Software Engineering ( e.g Proficiency in Python, version control (Git), and reproducible ML workflows, Experience running large‑scale models for large genomic datasets is a bonus)
Algorithmic Thinking & Experimental Design

Genomic Skillset (Supporting Requirement)

Foundational knowledge of Genomics, biological dataset and their interpretation mechanism is secondary requirement or eagerness to learn and develop the knowledge quickly is essential.
Ability to understand and interpret biological knowledge base or eagerness to develop quickly in this domain is essential.

Interpretable Genomic Language Model Architectures for Large-Scale Variant Effect Prediction

Post My Job

London, United Kingdom

Interpretable Genomic Language Model Architectures for Large-Scale Variant Effect Prediction

Unlock this job opportunity

View more options below

View full job details

Detecting Hostile Behaviour Through Real-time Voice Analysis

Adaptive and robust deep reinforcement learning for drone systems

Benchmarks for Data Clustering Algorithms in Data Science

Multi-Language Statistical Classification of Natural and Computer-Generated Texts

Intelligent, Energy Efficient and Secure Tactile Communication using Federated Learning in 6G Network

Metadata-based Deep Learning to estimate the quality of Wikipedia Articles

Interpretable Genomic Language Model Architectures for Large-Scale Variant Effect Prediction

Intelligent Hybrid Black-Box Steganography: An AI-Enhanced Approach for Secure Data Embedding

Experiment- and Human-Guided Representation Learning for Accelerated Chemical Discovery (Liverpool–Manchester)

Machine-Based Behaviour and Intent Prediction for Complex Aviation Environments

Unified Theory of Efficient Sequential Architectures: Structured Representations, Approximation Bounds, and Scaling Laws

Robust, Certified, and Scalable Federated Machine Unlearning for Privacy-Preserving AI

Resilient Intelligent Traffic Systems with Learning Optimisation and Chance Constraints (Ref: CO/YX-SF2/2026)

Decision Tree Induction from Small Language Models for Black-Box LLM Adaptation

Automatic calibration of electrooculography data for accurate quantification of eye-movements (NEWMANJ_U26CMP)