What is Humanity's Last Exam (HLE)?

HLE is a 2,500-question benchmark of expert-level academic problems across math, sciences, humanities. Designed to test AI beyond saturated tests like MMLU.

Who is Dr. Syed M. Shahid from EIT?

Senior Lecturer in Health Science at EIT Auckland, PhD in Medical Biochemistry. Contributed health questions to HLE, bridging digital health and AI evaluation.

Current AI benchmarks saturated (LLMs >90%). HLE provides hard, verifiable questions to accurately measure progress toward expert human performance.

How do AIs perform on HLE?

Top models like Gemini 3.1 ~45%, GPT-5 ~44%, far below human experts (~90%). High overconfidence (calibration error 50-70%). See leaderboard .

What subjects are hardest for AI on HLE?

Advanced math, specialized STEM, precise trivia. Requires deep reasoning, not retrieval.

How does EIT contribute to global AI research?

Through experts like Dr. Shahid, EIT shows polytechs' role in benchmarks, health-AI intersection. Boosts NZ's research profile.

Implications for NZ higher education?

Highlights need for AI literacy, ethics training. Opportunities in research jobs at unis/polytechnics.

Can students access HLE?

Yes, public at lastexam.ai . Great for learning AI limits, contributing questions.

Future of AI benchmarks post-HLE?

Dynamic updates, new challenges. Tracks path to expert AI, informs policy.

Career tips from EIT's involvement?

Build domain expertise + AI skills. Pursue research at EIT/unis; check NZ research jobs .

How to read the Nature paper?

Open access summary at Nature , full details on arXiv.

EIT NZ Researcher Nature AI Limits Breakthrough

A lake surrounded by green mountains and trees — Photo by Sandro Scalco on Unsplash

In a landmark achievement for New Zealand's higher education sector, Dr. Syed M. Shahid, a Senior Postgraduate Lecturer in Health and Sport Science at the Eastern Institute of Technology (EIT) Auckland campus, has co-authored a groundbreaking paper published in the prestigious journal Nature. The study, titled "A benchmark of expert-level academic questions to assess AI capabilities," introduces Humanity's Last Exam (HLE), a rigorous new benchmark designed to push the boundaries of artificial intelligence (AI) testing and reveal its current limitations in handling expert-level knowledge.

This collaboration involves over 1,000 global experts and highlights EIT's growing role in international AI research. As AI systems like large language models (LLMs) dominate headlines with their impressive feats, HLE provides a sobering reality check, showing that even the most advanced models still fall short of human expert performance on complex, verifiable academic tasks. For New Zealand's tertiary institutions, this underscores the value of polytechnics like EIT contributing to cutting-edge global science.

Dr. Syed M. Shahid: From Biochemistry to AI Frontiers

Dr. Shahid brings a unique perspective to the project. Holding a PhD in Basic Health Science with a focus on Medical Biochemistry from the University of Karachi, he has over 20 years of experience in health research, including roles at the University of Auckland's Faculty of Medical and Health Sciences and Aspire2 International. At EIT, he supervises postgraduate research and lectures on topics like nutrition, digital health, and health promotion.

His involvement in HLE stems from his expertise in health sciences, where he contributed challenging questions that test deep domain knowledge. "Participating in this global effort was an honor," Dr. Shahid noted in EIT announcements. "It allows researchers from institutions like EIT to influence how we measure AI progress, ensuring benchmarks reflect real-world expert challenges." This marks a significant milestone for EIT, demonstrating how applied research institutes in New Zealand are making waves in theoretical AI evaluation.

Dr. Shahid's career trajectory—from publishing 60+ papers and supervising dozens of theses to now co-authoring in Nature—exemplifies the interdisciplinary paths available in NZ higher education. His work bridges health inequities in ethnic communities with emerging tech like AI-assisted diagnostics, positioning EIT as a hub for practical innovation.

Humanity's Last Exam: The Ultimate AI Stress Test

At its core, HLE comprises 2,500 multi-modal questions spanning dozens of subjects, from advanced mathematics and physics to humanities, biology, and niche areas like local customs or historical trivia. Unlike standard benchmarks, these are crafted to be unambiguous, verifiable, and resistant to simple internet lookups—requiring genuine reasoning and expert insight.

The benchmark's name reflects its ambition: as AI saturates easier tests, HLE aims to be the "last" comprehensive closed-ended academic exam before AI matches or exceeds human experts across the board. Questions include multiple-choice and short-answer formats for automated grading, with an expert disagreement rate of about 15%, ensuring reliability.

Humanity's Last Exam benchmark diagram showing AI vs human performance gap

The Saturation Crisis in AI Benchmarks

Traditional benchmarks like Massive Multitask Language Understanding (MMLU) have become obsolete. Frontier LLMs now score over 90% on them, masking true progress. This saturation leads to unreliable comparisons and overhyping capabilities.

HLE addresses this by targeting graduate-level expertise. Developers filtered questions where LLMs already perform well, ensuring a true measure of the "expert human frontier." Early tests showed models like GPT-4o at just 2.7% accuracy, while human experts hit around 90% in their domains—a stark 87% gap.

Crowdsourcing Expertise: Building HLE Globally

Over 1,000 subject-matter experts worldwide contributed, including several from New Zealand: Dr. Shahid (EIT), Mohinder Maheshbhai Naiya (Auckland University of Technology), Jennifer Zampese (University of Canterbury), and Gaël Gendron (University of Auckland). Questions underwent rigorous validation to confirm difficulty and verifiability.

The process involved crowdsourcing via platforms like Scale AI and the Center for AI Safety, with ongoing "HLE-Rolling" for fresh challenges. This collaborative model democratizes benchmark creation, allowing contributions from diverse institutions like EIT.

Expert vetting for unambiguous solutions
Rejection of retrievable or easy AI-solvable questions
Broad coverage: STEM-heavy but including humanities
Multi-modal: text, images for comprehensive testing

AI's Stumbling Blocks: Low Scores and Overconfidence

Initial results were humbling. As of early 2026 leaderboards, top models like Gemini 3.1 Pro Preview score ~45%, GPT-5 variants ~40-44%, Claude models ~30-35%—still far from human levels. Calibration errors exceed 50-70%, meaning AIs confidently give wrong answers.

Hardest areas: world-class math (deep reasoning), specialized STEM, and trivia requiring precise recall. Multiple-choice slightly easier, but exact-answer questions expose true limits.

Leaderboard of AI models on Humanity's Last Exam showing low accuracies

Implications for AI Development and Governance

HLE clarifies AI isn't yet "expert-level" on structured tasks, informing policy on risks like overreliance in academia or healthcare. It emphasizes reasoning gaps over memorization.

For developers, it's a roadmap: improving calibration and reasoning could close the gap. Policymakers gain a metric for safe deployment. The paper calls for transparent evaluation to guide research.

New Zealand's Emerging Role in AI Research

With contributors from EIT, AUT, Canterbury, and Auckland, NZ punches above its weight. EIT's involvement showcases polytechnics' research prowess, complementing universities.

Government initiatives like the AI strategy boost this. Institutions like EIT foster interdisciplinary talent, vital as AI integrates into health, education, and sustainability.

For students, it highlights opportunities in AI ethics, benchmarking—fields where human insight remains superior.

Transforming Higher Education in Aotearoa

In NZ colleges and universities, HLE prompts reflection on AI tools. Lecturers like Dr. Shahid integrate AI ethically, teaching limits alongside strengths.

Benefits: augmented research, personalized learning. Risks: plagiarism, reduced critical thinking. EIT's health programs now emphasize AI literacy, preparing grads for digital health roles.

Training on benchmark creation
Ethical AI curricula
Interdisciplinary projects

Career Pathways in AI and Research

This breakthrough opens doors. NZ needs AI researchers, ethicists, health data specialists. EIT grads pursue PhDs, industry roles.

Skills: domain expertise + tech savvy. Institutions offer research assistantships, lecturer positions fueling such contributions.

Photo by Lawrence Makoona on Unsplash

The Road Ahead: Evolving Benchmarks and AI

HLE isn't final—dynamic updates ensure relevance. As scores rise (45% now vs 3% initially), watch for 50% threshold signaling expert parity.

For NZ higher ed, it's a call to invest in talent. Dr. Shahid's success inspires: polytechs drive global impact.

Explore the full Nature paper or arXiv preprint for details. Leaderboards at lastexam.ai track progress.

Dr. Syed M. Shahid: From Biochemistry to AI Frontiers

Humanity's Last Exam: The Ultimate AI Stress Test

The Saturation Crisis in AI Benchmarks

Crowdsourcing Expertise: Building HLE Globally

AI's Stumbling Blocks: Low Scores and Overconfidence

Implications for AI Development and Governance

New Zealand's Emerging Role in AI Research

Transforming Higher Education in Aotearoa

Career Pathways in AI and Research

The Road Ahead: Evolving Benchmarks and AI

EIT Researcher Co-Authors Nature Paper Exposing AI's Expert Limits with Humanity's Last Exam

Breakthrough Benchmark Reveals Gaps in Frontier AI Capabilities

Frequently Asked Questions

📚What is Humanity's Last Exam (HLE)?

👨‍🏫Who is Dr. Syed M. Shahid from EIT?

⚖️Why was HLE created?

📊How do AIs perform on HLE?

🧮What subjects are hardest for AI on HLE?

🌏How does EIT contribute to global AI research?

🎓Implications for NZ higher education?

🔗Can students access HLE?

🔮Future of AI benchmarks post-HLE?

💼Career tips from EIT's involvement?

📖How to read the Nature paper?

Browse by Subject

Browse by Faculty

Lecturer / Senior Lecturer in Clinical Exercise Physiology

Tapuhi | Practice Nurse

Associate Professor in Cyber Security

Lecturer - Computer Science

Lecturer / Senior Lecturer / Associate Professor in Mathematics Education

Health Careers Instructor

PhD Scholarship: Sunlight-Backscatter Communication for Maintenance-Free Environmental Sensing

Why Is My Dog Eating Grass? Understanding This Common Behavior

How to Prepare for the TOEFL Test: Proven Strategies for University Aspirants Worldwide

Why Does My Eye Keep Twitching? Common Causes and Relief Strategies

Why Does My Eye Keep Twitching? What Research Reveals About This Common Annoyance

Historic Discoveries That Have Defined Aboriginal Art in Australia

Mubadala and WHOOP Launch Groundbreaking UAE Health Research Initiative for Performance Science

Promote Your Research… Share it Worldwide