PhD position in Multimodal (Audio and Vision) Conversational Foundation Models

About the Project

A PhD position funded and in collaboration with Tavus inc in designing the next generation of conversation models. Multimodal Large Models that can see, hear, understand and generate audio and video with the responses of the Digital Human.

Context

While there has been an explosion in text-based conversational agents and dialogue systems, those lack the naturalness and richness of a human-to-human interaction. Beyond language, humans communicate with facial expressions, voice intonation, head and body gestures that convey emotions, social signals and semantic information. Recent advances in multimodal generation, including audio and video generation systems and digital avatars are promising directions, however there is a distinct lack of foundation models that model and generate human behaviour in the context of conversations.

Objectives

The position will focus on designing and training components of a Multimodal (Audio and Vision) Conversational Foundation Model that is able to perceive and generate verbal and non-verbal responses in the context of a conversation. Research directions include but are not limited to:

Multimodal perception of human behaviour, including emotions, personality, intentions, backchanneling signals and stages of the conversation, using supervised and unsupervised methodologies and Reinforcement Learning.
Post-training methodologies for multimodal generation aligned with control signals such as conversational goals and personality. This line of research will expand methodologies in mechanistic interpretability and steering in the domain of Large Multimodal Models.
Controllable generation of infinitely long audio-visual output with identity and quality preservation. The work will focus on recent diffusion-based methods for video generation and editing.

Team

You will be part of the Multimedia and Vision Research group and member of the centre for Multimodal AI. The team you will be involved with is regularly publishing in top conferences and journals, including CVPR, ICCV, ECCV, NeurIPS, TPAMI, IJCV, etc, and has access to computational resources, including a computational server with 64 A100s and exclusive access to 3 A100s and other servers.

The projects are defined in collaboration with Tavus inc, a USA based series B startup designing the next generation of Digital Humans. You will be in close collaboration with a dynamic team of 20 researchers and will interact regularly in the London office of Tavus with both the London-based and international team.

For more information please see https://ipatras.github.io

Supervisor

Prof. Ioannis Patras i.patras@qmul.ac.uk

https://ipatras.github.io

https://www.qmul.ac.uk/eecs/people/profiles/patrasioannis.html

Funding

The PhD student will receive a stipend rate annually (currently in 2025/26 of £22,780 per year) during the PhD period, which can span for 3 years. QMUL will provide a full tuition fee (either Home or International student).

Application Deadline

Applications are invited until the position is filled.

Start date: 1st June 2026 (or as soon as possible).

Who can apply

Should hold, or is expected to obtain an MSc in Electronic Engineering, Computer Science, or a closely related discipline.
Having obtained distinction or first-class level degree is highly desirable.

How to apply

Please contact me at i.patras@qmul.ac.uk with your CV. Please include in the subject the string [2026 Conversational Foundation Models PhD Application] Apply to QMUL for the scholarship (as instructed in the ‘How to Apply’ section below).https://www.qmul.ac.uk/postgraduate/research/applying-for-a-phd/

Applicants should work with their prospective supervisor and submit their application following the instructions at: http://eecs.qmul.ac.uk/phd/how-to-apply/

The application should include the following:

CV (max 2 pages)
Cover letter (max 4,500 characters) stating clearly in the first page whether you are eligible for a scholarship as a UK resident (https://epsrc.ukri.org/skills/students/guidance-on-epsrc-studentships/eligibility)
Research proposal (max 500 words)
2 References
Certificate of English Language (for students whose first language is not English)

For general enquiries contact Mrs Melissa Yeo at m.yeo@qmul.ac.uk (administrative enquiries) or Dr Arkaitz Zubiaga at a.zubiaga@qmul.ac.uk (academic enquiries) with the subject [2026 Conversational Foundation Models PhD Application] and cc Prof ioannis Patras i.patras@qmul.ac.uk

For specific enquiries about the project, please contact Prof ioannis Patras at i.patras@qmul.ac.uk

PhD position in Multimodal (Audio and Vision) Conversational Foundation Models

Post My Job

London, United Kingdom