Since the days of Alan Turing, artificial intelligence developers have been captivated by the prospect of AI powerful enough to improve itself. OpenAI, we recently reported, has already developed an internal AI “research assistant” tool to help its researchers work faster, a possible first step in the development of AI that can conduct AI research on its own.
Later this week, researchers at Model Evaluation and Threat Research, a nonprofit group, will publish a first-of-its-kind evaluation of how large language models from OpenAI and Anthropic perform when asked to solve seven AI research problems. An early look at the results show Anthropic did very well compared to OpenAI.
In five of the seven tests METR ran, the latest version of Anthropic’s most advanced model, Claude Sonnet 3.5, outperformed OpenAI’s most advanced model, o1-preview—and Claude won by a wide margin in two of them. O1-preview beat Claude on the other two tests, including a decisive win in one of the tests.
(Each test took 8 hours. For these results, the models were reset every 30 minutes because they tended to do best when they tried multiple times in that window. When the models were only allowed a single 8-hour attempt, Claude’s average score fell below o1-preview’s.)
Notably, both models were no match for the top human researchers who took the same tests, who scored more than twice as high as the models on average.
But METR gave the AI credit where credit was due: Claude was basically as good as the average human researcher in solving two of the problems, and o1-preview was about as good as an average human researcher in another problem.
In the AIs’ defense, the problems METR chose are difficult. Solving them requires a researcher’s full arsenal of techniques, from thinking up hypotheses to designing and running experiments, analyzing results and revising the hypotheses. It also requires a lot of creativity.
For example, one of the problems involved writing code for a language model from scratch, without using division or exponents—which are usually essential for that task. Another problem involves experimenting with traditional AI scaling laws, just like an employee at OpenAI might do, but using only a small amount of computing power.
Human Disadvantage
These problems aren’t exactly representative of the day-to-day work of an AI researcher. Most AI projects are not wrapped up in eight hours, and who ever heard of an AI researcher who couldn’t use division?
In fact, these tests are designed to put human participants at a disadvantage. That way, even if AI models catch up to humans on these tests, that would still mean the models are less capable than top human researchers overall and would give the AI firms time to make adjustments to improve their safety, said Hjalmar Wijk, who oversaw the tests.
For their part, OpenAI, Anthropic and other major AI developers also say that for the same reasons, they will measure their models’ ability to do AI research before releasing them. And both OpenAI and Anthropic let METR run its tests on their latest models before the models were released publicly.
The feds will be watching too, of course. A recent Biden memo on AI directed the U.S. AI Safety Institute, which sets standards for safe AI development, to test advanced AI models for their ability to “automate development and deployment” of AI with dangerous capabilities. The EU, meanwhile, referred to AI that develops AI as a “systemic risk” in a first draft of rules for the bloc’s AI Act, published last week.AI companies undoubtedly have a strong incentive to develop AI that makes better AI, also known as recursive self-improvement. Our reporting on the recent challenges of traditional AI scaling methods suggest such AI isn’t nigh, but AI researchers have a habit of overcoming obstacles, so it’s never a bad idea to be ready!