New Insights into ChatGPT's Scientific Accuracy

A recent study has evaluated the performance of ChatGPT in understanding scientific hypotheses, revealing intriguing insights into its accuracy and consistency. The research team analyzed over 700 hyp...

Accuracy and Performance Limitations

Initially tested in 2024, ChatGPT demonstrated a correct response rate of 76.5%. A follow-up in 2025 showed a modest increase to 80%. However, when accounting for random guessing, the AI's performance was only about 60% higher than chance--indicating a level of reliability closer to a low D grade.

One significant challenge for ChatGPT was identifying false statements, with a correct labeling rate of just 16.4%. Furthermore, even when presented with identical prompts multiple times, the AI provided consistent answers only about 73% of the time.

Concerns Over Inconsistency

Lead author Cicek, an associate professor at WSU's Carson College of Business, expressed concerns regarding the inconsistency in ChatGPT's responses. "When asked the same question repeatedly, the answers varied significantly," he noted, highlighting instances where the AI alternated between true and false responses.

Understanding AI's Limitations

The findings, published in the Rutgers Business Review, emphasize the need for caution when relying on AI for critical decisions that demand nuanced reasoning. While generative AI can produce coherent and persuasive text, it lacks true conceptual understanding.

Cicek remarked, "Current AI tools do not comprehend the world as humans do; they lack a 'brain.' They can provide insights but do not genuinely understand the content." This suggests that the development of artificial general intelligence capable of authentic reasoning may still be a distant goal.

Research Methodology

The study involved collaboration among experts from various institutions, employing 719 hypotheses from scientific literature published in business journals since 2021. These hypotheses often contain complex nuances, making the task of simplifying them to a true or false judgment particularly challenging.

The researchers tested both the free version of ChatGPT-3.5 in 2024 and the updated ChatGPT-5 mini in 2025, finding similar performance levels across both iterations. After adjusting for random chance, the AI's overall effectiveness remained approximately 60% above chance in both years.

Advising Caution with AI

The results underscore a fundamental limitation of large language models: while they can generate fluent and persuasive responses, they often struggle with complex reasoning, leading to potentially misleading answers. Cicek advises business leaders to verify AI-generated information and approach it with a critical mindset, emphasizing the importance of training to understand AI's capabilities and limitations.

While this study specifically examined ChatGPT, similar findings have emerged from research on other AI tools, reinforcing the need for skepticism regarding AI's capabilities. Cicek concluded, "Always be cautious. While I utilize AI, careful discernment is essential."