ChatGPT Frequently Misjudges Scientific Claims, WSU Study Reveals

Unstable Intelligence: GenAI Struggles with Accuracy and Consistency

  • higher-education-ai
  • university-policies
  • research-publication-news
  • research-integrity
  • chatgpt

Be the first to comment on this article!

You

Please keep comments respectful and on-topic.

a tree filled with lots of white flowers
Photo by Lokesh B Masania on Unsplash

Promote Your Research… Share it Worldwide

Have a story or a research paper to share? Become a contributor and publish your work on AcademicJobs.com.

Submit your Research - Make it Global News

Washington State University Study Exposes ChatGPT's Scientific Judgment Flaws

Recent research from Washington State University (WSU) has cast a spotlight on a critical limitation of ChatGPT: its frequent misjudgment of scientific claims. Led by Associate Professor Mesut Cicek from WSU's Carson College of Business, the study tested the AI on 719 hypotheses drawn from business journal papers published since 2021. These hypotheses required nuanced reasoning to determine if they were supported (true) or refuted (false) by evidence.78127

The experiment revealed that while ChatGPT appeared confident in its responses, its accuracy hovered around 80% at best, plummeting when identifying false claims—correct only 16.4% of the time. Even more concerning was the AI's inconsistency: when asked the same question 10 times, it provided stable answers just 73% of the time, sometimes flip-flopping between true and false in a single set of repeats.20

Methodology: Rigorous Testing of AI Reasoning

Cicek and co-authors Sevincgul Ulu (Southern Illinois University), Can Uslay (Rutgers University), and Kate Karniouchina (Northeastern University) selected hypotheses involving complex variables, such as consumer behavior or market dynamics, where black-and-white answers are rare. Using the free versions of ChatGPT-3.5 in 2024 and ChatGPT-5 mini in 2025, they prompted the AI with: "Is this hypothesis supported by research? True or false?" Each prompt was repeated 10 times to gauge reliability.

Raw accuracy improved slightly from 76.5% to 80%, but adjusting for a 50% random-guess baseline dropped effective performance to roughly 60% above chance—earning ChatGPT a 'D' grade. This mirrors broader concerns in U.S. higher education, where faculty increasingly integrate AI but question its dependability for evaluative tasks.127

Key Findings: Low Accuracy on False Claims and Wild Inconsistencies

The most glaring weakness emerged with false hypotheses, where ChatGPT correctly identified them only 16.4% of the time, often confidently affirming debunked ideas. For true claims, performance was higher but still flawed. Consistency fared no better: in one case, responses alternated true-false five times each over 10 prompts.

  • 2024 (GPT-3.5): 76.5% raw accuracy, 73% consistent.
  • 2025 (GPT-5 mini): 80% raw accuracy, similar consistency issues.
  • False claim detection: 16.4% correct across tests.

Cicek noted, "Current AI tools don’t understand the world the way we do—they just memorize." This has direct repercussions for U.S. college classrooms, where professors rely on AI for quick fact-checks or quiz generation.78

Why ChatGPT Struggles: Large Language Model Limitations

ChatGPT, a large language model (LLM) trained on vast internet data, excels at pattern matching and fluent text generation but lacks true comprehension. It predicts probable word sequences rather than reasoning causally, leading to 'hallucinations'—plausible but wrong outputs. Business hypotheses, blending psychology, economics, and statistics, expose this gap, as they demand contextual integration beyond memorized facts.

In U.S. universities, this mirrors challenges in STEM courses, where nuanced scientific claims are core. A 2025 survey showed 94% of higher ed professionals use AI at work, yet 69% of faculty address AI biases and errors in teaching.116

Case Studies: Real-World AI Errors in U.S. College Settings

At Stanford, a 2024 experiment found ChatGPT fabricated references in 73% of generated scientific summaries, fooling initial reviewers. Similarly, University of Pennsylvania researchers reported AI detectors flagging non-native English speakers falsely, exacerbating equity issues in diverse campuses.93

In a Hechinger Report study, middle schoolers using ChatGPT for science practice solved more problems short-term but scored 17% worse on tests, suggesting overreliance hampers deep learning—a trend echoing U.S. colleges where 92% of students engage AI.80

Illustration of AI hallucination in a university science classroom

Impacts on U.S. University Teaching and Research

Faculty at institutions like WSU and Rutgers now question AI for grading or literature reviews. A 2026 Elon/AAC&U survey revealed 95% of U.S. college faculty fear student AI overreliance, with 59% redesigning assessments. Research integrity suffers too: ChatGPT's low false-claim detection risks propagating errors in peer reviews or grant proposals.

At Northeastern, where co-author Karniouchina teaches, professors train students on AI verification, blending tools like Turnitin with human oversight. Yet, detectors fail 20-50% of the time, leading to wrongful accusations.89WSU Press Release

Student Learning: Risks of Overreliance and Cheating Concerns

With 92% of U.S. students using AI per 2026 surveys, ChatGPT's inaccuracies undermine critical thinking. A BestColleges poll found 51% view AI use as cheating, yet detectors' biases hit international students hardest. Universities like MIT advocate process-based assessments (e.g., oral defenses) over proctored exams.

  • Benefits: AI aids brainstorming, explaining concepts.
  • Risks: Hallucinations mislead; inconsistency erodes trust.

Evolving University Policies: From Bans to Balanced Integration

Post-ChatGPT launch, many U.S. colleges banned AI; by 2026, policies shift. Inside Higher Ed reports faculty easing restrictions for nuanced use, with 69% teaching AI literacy (bias, hallucinations). California State University deployed ChatGPT campus-wide for 460,000 students, emphasizing ethics training.117

Examples: Harvard's guidelines mandate disclosure; Stanford's AI Commons offers workshops. Rutgers, tied to the study, promotes 'AI skepticism' in business curricula.Rutgers Business Review Paper

Expert Views: U.S. Academics Weigh In

"Always be skeptical," urges Cicek. At WSU, this informs marketing courses on AI pitfalls. Northeastern's Karniouchina stresses hybrid human-AI workflows. Broader consensus: AI augments, doesn't replace, human judgment in science education.

2026 EDUCAUSE report: 92% of institutions have AI strategies, prioritizing pilots and training amid growing adoption (90% professionals use AI).111

a bar filled with lots of bottles of alcohol

Photo by Lorin Lindell on Unsplash

Future Outlook: Enhancing AI for Higher Education

Emerging solutions include retrieval-augmented generation (RAG) for fact-grounding and fine-tuned models. U.S. universities invest in AI literacy curricula; OpenAI's detection tools improve but lag. By 2030, expect 'explainable AI' mandates.

Actionable insights: Verify AI outputs with primary sources; teach verification skills; redesign assessments for reasoning. As Cicek concludes, "I'm not against AI... but you need to be very careful."Future of AI integration in US higher education classrooms

Portrait of Dr. Liam Whitaker

Dr. Liam WhitakerView full profile

Contributing Writer

Advancing health sciences and medical education through insightful analysis.

Discussion

Sort by:

Be the first to comment on this article!

You

Please keep comments respectful and on-topic.

New0 comments

Join the conversation!

Add your comments now!

Have your say

Engagement level

Frequently Asked Questions

🔬What did the WSU ChatGPT study find?

The study tested 719 hypotheses, finding 80% raw accuracy but only 16.4% on false claims and 73% consistency over repeats. WSU details.

🤖Why is ChatGPT inconsistent on scientific claims?

LLMs like ChatGPT memorize patterns without true reasoning, leading to hallucinations and variability in probabilistic outputs.

📚How does this affect US college teaching?

Professors risk flawed AI-generated quizzes; 95% fear overreliance. Shift to AI literacy and process assessments.

⚠️What are common AI errors in higher ed?

Fabricated references (73% cases), biases against non-native speakers in detectors, worse test performance from overreliance.

📜US university AI policies 2026?

From bans to integration: 92% have strategies, emphasizing ethics training, disclosure.

📊Stats on AI use in US colleges?

94% professionals use AI; 92% students, 79% faculty engage actively per surveys.

💡Solutions for AI in science education?

Verify with primary sources, redesign for reasoning, teach skepticism and literacy.

🧑‍🎓ChatGPT vs human accuracy in research?

Adjusted, ~60% above chance (D grade); humans excel in nuance, causality.

🔒Implications for academic integrity?

Risks cheating, false accusations; universities adopt hybrid verification.

🚀Future of AI in US higher ed?

Explainable AI, RAG for grounding; 2026 sees pilots, ethics mandates.

🛡️Expert advice on using ChatGPT safely?

Cicek: 'Always be skeptical... verify results.' Integrate as tool, not oracle.