Scientists Create Challenging AI Assessment with Surprising Results

A global consortium of nearly 1,000 researchers, including a professor from Texas A&M University, has developed a groundbreaking assessment known as "Humanity's Last Exam" (HLE). This innovative test aims to challenge AI systems with a comprehensive array of questions that draw on expert human knowledge, particularly in areas where current AI technologies face difficulties.

The HLE consists of 2,500 questions spanning mathematics, humanities, natural sciences, ancient languages, and various specialized academic disciplines. The findings of this project are detailed in a paper published in Nature.

Dr. Tung Nguyen, an instructional associate professor in the Department of Computer Science and Engineering at Texas A&M, played a significant role in crafting and refining many of the exam's questions. He noted, "When AI systems perform exceptionally well on human benchmarks, it can create the illusion that they possess human-like understanding. However, HLE serves as a reminder that true intelligence encompasses depth, context, and specialized knowledge."

The exam is not designed to outsmart human test-takers but to pinpoint specific areas where AI still lags behind. A diverse group of specialists collaborated to write and review the questions, ensuring each one has a definitive, verifiable answer while preventing quick solutions through simple online searches.

The topics covered in the exam are rooted in advanced academic challenges. Tasks include translating ancient Palmyrene inscriptions and identifying intricate anatomical structures in birds, alongside analyzing detailed aspects of Biblical Hebrew pronunciation.

Each question underwent rigorous testing against leading AI models. If any AI system managed to answer a question correctly, that question was subsequently excluded from the final version, ensuring the exam remains a step ahead of current AI capabilities.

Initial tests revealed the effectiveness of this approach. Even the most advanced AI models struggled, with GPT-4o scoring only 2.7 percent, Claude 3.5 Sonnet achieving 4.1 percent, and OpenAI's o1 model performing slightly better at 8 percent. The top systems, including Gemini 3.1 Pro and Claude Opus 4.6, reached accuracy levels between 40 and 50 percent.

Nguyen emphasized the necessity of establishing new benchmarks for AI performance. He contributed 73 of the 2,500 publicly available questions, focusing primarily on mathematics and computer science. "Accurate assessment tools are crucial," he stated. "They help policymakers, developers, and users accurately interpret AI capabilities and identify potential risks."

Importantly, the HLE is not an indication of AI posing a threat to humanity. Instead, it underscores the vast reservoir of knowledge and expertise that remains uniquely human. "This isn't a competition against AI," Nguyen clarified. "It's a framework for understanding the strengths and weaknesses of these systems, which ultimately aids in creating safer technologies and reinforces the importance of human expertise."

Designed to be a lasting benchmark, the HLE will serve as a transparent measure for future AI systems. While some questions have been made public, most remain undisclosed to prevent AI models from simply memorizing answers.

"Humanity's Last Exam represents one of the clearest evaluations of the divide between AI and human intelligence," Nguyen concluded, "and despite the rapid pace of technological advancement, that gap remains significant."

Nguyen highlighted that the project's scale illustrates the power of interdisciplinary collaboration. "What made this initiative exceptional was the collective effort," he said. "Experts from various fields, including historians, physicists, and linguists, came together to identify the limitations of today's AI systems--an endeavor that showcases the strength of human cooperation."