Why focus on Arabic dialects in AI research?

Arabic dialects vary regionally, affecting ~400M speakers daily. Current VLMs trained on MSA or English translations fail cultural nuances, as shown in JEEM's results.

Which models were tested on JEEM?

Five open Arabic VLMs (Maya, PALO, Peacock, AIN, AyaV) and GPT-4o. GPT-4o topped fluency but lagged in dialect authenticity.

What are the main findings of JEEM?

Models excel in fluent descriptions (4.5+ scores) but score low on consistency/relevance (2.5-3.5), especially cultural inferences. Emirati dialect hardest.

How was JEEM dataset created?

Native speakers annotated regionally sourced images over 1,618 hours, with rigorous review for dialect authenticity.

What is MBZUAI's role in Arabic NLP?

UAE's top AI university leads with benchmarks like ArabicMMLU; JEEM advances cultural AI for Arab world. 50

Implications for UAE AI strategy?

Supports Vision 2031's AI literacy goals, fostering dialect-aware models for education, healthcare in Gulf.

How does JEEM differ from other benchmarks?

Uses original dialectal content, not translations; focuses on cultural commonsense via regional visuals.

Future of JEEM and similar research?

Expansion to more dialects, integration in leaderboards; calls for diverse training data.

Where to access JEEM dataset?

Available via Hugging Face and Toloka AI; paper at ACL Anthology .

Impact on higher ed jobs in UAE?

Boosts demand for NLP experts at MBZUAI; check higher ed jobs for opportunities.

MBZUAI JEEM Benchmark: AI Cultural Gaps in Arabic

Q: What is the JEEM benchmark?

JEEM is a vision-language dataset from MBZUAI evaluating AI on culturally rich images in four Arabic dialects: Emirati, Egyptian, Jordanian, and Moroccan. It includes 2,178 images and 10,890 QA pairs for captioning and VQA tasks. 69

a close up of a typewriter with a paper on it — Photo by Markus Winkler on Unsplash

Breaking New Ground at EACL 2026

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), the UAE's pioneering graduate research university dedicated to advancing artificial intelligence, has made headlines at the European Chapter of the Association for Computational Linguistics (EACL) 2026 conference in Rabat, Morocco. A standout paper titled "JEEM: Vision-Language Understanding in Four Arabic Dialects" introduces a groundbreaking benchmark that exposes critical limitations in how vision-language models (VLMs) handle cultural nuances embedded in images when using Arabic dialects.

This research underscores MBZUAI's commitment to developing AI technologies that resonate with the Arab world's linguistic and cultural diversity. As the conference kicks off today, March 24, 2026, JEEM positions UAE higher education at the forefront of natural language processing (NLP) innovation tailored to low-resource languages like Arabic dialects.

What is the JEEM Benchmark?

JEEM, named after the Arabic word for "mosquito" symbolizing something small yet pervasive in everyday life, is a meticulously curated dataset designed to test VLMs' ability to interpret images not just literally, but through the lens of cultural commonsense in dialectal Arabic. Unlike generic benchmarks that rely on English-centric data or translations into Modern Standard Arabic (MSA), JEEM features content sourced from four distinct Arabic-speaking regions: Jordan (Levantine dialect), United Arab Emirates (Gulf/Emirati dialect), Egypt (Egyptian dialect), and Morocco (North African/Moroccan dialect).

The benchmark comprises 2,178 images depicting everyday scenes, traditional artifacts, local customs, and regional landmarks. These are paired with 10,890 question-answer (QA) pairs and captions generated in native dialects, ensuring authenticity. Tasks include image captioning—where models describe scenes in dialectal Arabic—and visual question answering (VQA), covering descriptive, yes/no, categorical, and quantitative queries.

Examples from the JEEM benchmark dataset showcasing culturally specific images across Arabic dialects

The Rich Tapestry of Arabic Dialects

Arabic, spoken by over 400 million people across 25 countries, is far from monolithic. While MSA serves formal contexts like media and literature, daily communication thrives in dialects that vary dramatically by region. Emirati Arabic, for instance, incorporates Gulf-specific vocabulary influenced by Bedouin heritage and maritime trade, while Egyptian Arabic dominates pop culture through film and music, blending Coptic and ancient Egyptian elements. Moroccan Darija mixes Berber and French influences, and Jordanian Levantine reflects Levantine shared history.

These dialects shape how people describe visuals: an Emirati might call a traditional dish "halwa," evoking a specific sweet treat, whereas others might misidentify it based on their cultural frame. JEEM captures this by using native annotators to create dialect-grounded content, revealing how AI, trained mostly on MSA or translated data, falters in real-world, culturally loaded scenarios.

Crafting JEEM: A Human-Centric Annotation Process

Developing JEEM involved over 1,618 hours of annotation by 37 native speakers led by linguistics experts from MBZUAI and Toloka AI. Images were selected for cultural relevance—think kandura robes in UAE scenes or tagine pots in Morocco—avoiding generic stock photos. Annotators first captioned in dialect, then MSA, followed by five diverse questions per image and corresponding answers.

Qualification via dialect proficiency tests ensured quality.
Team leaders reviewed for accuracy, rejecting or editing as needed.
A shared pool of 100 culturally iconic images was cross-annotated to highlight inter-dialect variances.
Group chats fostered natural dialect use, mimicking conversational AI interactions.

This rigorous process yields a high-fidelity dataset free from translation artifacts, setting a gold standard for Arabic multimodal evaluation.

VLMs Under the Microscope: Models Tested

JEEM benchmarks five leading open-source Arabic VLMs—Maya, PALO, Peacock, AIN, and AyaV—alongside GPT-4o. These models, trained on Arabic-inclusive data, excel in MSA but were probed for dialectal prowess. Evaluation combined traditional metrics (BLEU, CIDEr, ROUGE-L, BERTScore), GPT-4o-as-judge (scoring consistency, relevance, fluency, dialect authenticity on 1-5 Likert scales), DCScore (decomposed information units), ALDi (dialectness detector), and human assessments on subsets.

Human evaluation on 350 images and 6,650 captions showed poor alignment between auto-metrics and human judgment (Kendall's τ_c ~0.1-0.2), underscoring the need for nuanced evaluators in morphologically rich languages like Arabic.

Revealing Results: Fluency vs. True Understanding

Key findings paint a stark picture. GPT-4o leads with high fluency (4.67-4.77/5) and relevance (3.70-3.75), but dips in dialect authenticity, especially Emirati (lowest resource). Open models lag: AyaV strongest among them, yet all score below ground-truth (e.g., MSA consistency: GPT-4o 3.67 vs. GT 4.59).

Model	Dialect	Consistency	Relevance	Fluency	Dialect Auth.
GPT-4o	MSA	3.67	3.75	4.77	-
GPT-4o	Emirati	3.22	3.35	4.62	3.81
AyaV	Egyptian	2.76	2.96	4.22	2.55

AI shines in literal description but crumbles on cultural inference—like identifying regional desserts or attire customs. Cross-dialect analysis on shared images shows models homogenize interpretations, ignoring regional lenses.

Cultural Gaps Exposed: Real-World Examples

Consider an image of Omani halwa: Emirati annotators nailed it, but others called it pudding or chocolate, reflecting cultural unfamiliarity. VLMs often generate fluent but semantically off dialectal output, mistaking visual cues without contextual knowledge. This gap widens for low-resource dialects like Emirati, mirroring UAE's push for localized AI amid global models' Western biases.

Performance comparison of VLMs on JEEM benchmark across Arabic dialects

MBZUAI's Pivotal Role in UAE AI Ecosystem

MBZUAI, established in 2019 as the world's first AI graduate university, leads UAE's Vision 2031 to become a global AI powerhouse. With prior benchmarks like ArabicMMLU and cultural VQA datasets, JEEM builds on this legacy. Collaborations with Toloka AI exemplify UAE's open innovation model, attracting global talent to Abu Dhabi.

For more on opportunities at MBZUAI, explore the full MBZUAI announcement.

Implications for Arabic AI and Beyond

JEEM challenges the notion of "multilingual" AI, revealing hidden biases in VLMs. For Arab users, this means unreliable assistants in education, healthcare, or e-commerce—critical for UAE's digital economy. It calls for diverse training data, dialect-aware fine-tuning, and culturally grounded metrics. In higher ed, it inspires curricula integrating regional NLP, positioning UAE universities as hubs for equitable AI.

Future Horizons: Scaling Cultural AI

Authors envision expanding JEEM to more dialects and tasks, integrating it into leaderboards for continuous tracking. MBZUAI plans dialect-specific model training, aligning with UAE's AI Strategy 2031. As EACL unfolds, expect discussions on inclusive benchmarks driving responsible AI.

Stakeholder Views and UAE Context

UAE educators praise JEEM for amplifying underrepresented voices, with experts noting its role in attracting PhD talent. Amid UAE's 100% AI literacy goal by 2031, such research bolsters national pride and global competitiveness.

A wooden table topped with scrabble tiles spelling news and deep seek

Photo by Markus Winkler on Unsplash

Enhances AI for Gulf tourism apps recognizing Emirati landmarks.
Supports edtech personalizing content in local dialects.
Drives research jobs in NLP at UAE institutions.

Breaking New Ground at EACL 2026

What is the JEEM Benchmark?

The Rich Tapestry of Arabic Dialects

Crafting JEEM: A Human-Centric Annotation Process

VLMs Under the Microscope: Models Tested

Revealing Results: Fluency vs. True Understanding

Cultural Gaps Exposed: Real-World Examples

MBZUAI's Pivotal Role in UAE AI Ecosystem

Implications for Arabic AI and Beyond

Future Horizons: Scaling Cultural AI

Stakeholder Views and UAE Context

MBZUAI JEEM Benchmark Exposes AI's Cultural Gaps in Arabic Dialects at EACL 2026

Bridging AI Vision and Arab Cultural Nuance

Frequently Asked Questions

🔍What is the JEEM benchmark?

🗣️Why focus on Arabic dialects in AI research?

🤖Which models were tested on JEEM?

📊What are the main findings of JEEM?

👥How was JEEM dataset created?

🏛️What is MBZUAI's role in Arabic NLP?

🇦🇪Implications for UAE AI strategy?

🌍How does JEEM differ from other benchmarks?

🚀Future of JEEM and similar research?

📚Where to access JEEM dataset?

💼Impact on higher ed jobs in UAE?

Browse by Subject

Browse by Faculty

Postdoctoral Associate

Junior Research Scientist or Engineer

Assistant/Associate/Full Professor in Computer Science-Quantum Computing and Artificial Intelligence

Post-Doctoral Associate in the Center for Interdisciplinary Data Science and Artificial Intelligence

Postdoctoral Associate in Theoretical Foundations of Data Science and AI

Post-Doctoral Associate or Research Associate

Junior Research Scientist

Human-Computer Interaction - Open Rank Faculty

Why Is My Dog Eating Grass? Understanding This Common Behavior

How to Prepare for the TOEFL Test: Proven Strategies for University Aspirants Worldwide

Why Does My Eye Keep Twitching? Common Causes and Relief Strategies

Why Does My Eye Keep Twitching? What Research Reveals About This Common Annoyance

Historic Discoveries That Have Defined Aboriginal Art in Australia

Mubadala and WHOOP Launch Groundbreaking UAE Health Research Initiative for Performance Science

Promote Your Research… Share it Worldwide