Multimodal World Models for Reasoning about the Physical World

Current AI systems are good at recognising and describing what they see, and large language models can reason fluently about the world in text. They are much weaker at predicting how a situation will change when an action is taken, and their reasoning about physical events is often unreliable. One promising idea is the world model: a model that learns to predict how an environment will evolve over time. Most existing world models work only in narrow visual settings, rarely combine different types of input, and are not built to support reasoning.

This PhD will study how to learn world models that combine several modalities, such as vision, language, sound, and records of interaction, and that can be used for reasoning rather than prediction alone. There are several directions the student could take. One is how a predictive model should be structured so that a system can reason over it, for example to answer "what if" questions or to work out causes. Another is how to combine information from different modalities into a single, consistent model of how things behave. A third is how to test whether a model has really learned the dynamics of a situation, rather than picking up on shallow correlations. The precise focus, and the balance between theory, methods, and experiments, would be decided together with the student.

The work relates to robotics, scientific simulation, and the study of human reasoning, but its main aim is more general: to understand how predictive models of the world can support flexible reasoning.

The project would suit someone with a solid background in machine learning and mathematics who enjoys open research questions. Experience with deep learning tools is useful but not essential.

Find Your Best Opportunity

Tell them AcademicJobs.com sent you!