The General Intelligence Tsunami

The most important writing about AI of 2024 thus far has not been an academic paper with some profound new insight, not a speech by a world leader, and not even a blog post or essay from an AI company executive. Instead, it was authored by a 22 year old former OpenAI employee named Leopold Aschenbrenner. The series of essays, titled Situational Awareness, spans more than 150 pages and paints a chilling portrait of the near future. Artificial general intelligence (AGI)—AI systems that can automate nearly any task a human might use a computer to accomplish—will be here as soon as 30 months from today, Aschenbrenner argues. And because one of the innumerable things AGI can do is AI research, progress in AI will explode after that. If you think the current pace of progress in AI is rapid, you haven’t seen anything yet: with AGI, AI will be able to improve itself. As a result, Aschenbrenner believes that artificial superintelligence, a weapon of unspeakable ferocity, a bringer of unimaginable good, will be invented before the end of this decade. 

Even if you are disinclined to believe in under-specified concepts like “AGI” or “superintelligence,” Aschenbrenner’s series is worth a read purely as speculative fiction. It’s a story about America summoning a level of industrial vigor unprecedented in peacetime, about trillion-dollar supercomputers that consume the amount of electricity that a single state does today, about AI systems that can dream up cancer cures and world-crippling viruses alike in seconds. It’s about a modern-day Manhattan Project, and anyone who has grown tired of Western torpor will find it an exciting story about the near-future. 

But what if it’s more than just a story? What if Situational Awareness is instead a kind of prophecy, far ahead of its time during a period of history when “far ahead” means “six months?” What if Leopold Aschenbrenner is right, even if only partially? And how, exactly, would we know?

Situational Awareness rests on a thesis that is at once seductive and combustible: AGI will be here soon because of something called the “scaling hypothesis,” or, more controversially, the “scaling laws.” This is the idea that AI models improve “monotonically,” or unrelentingly, as one increases the size of the neural network, the amount of compute it is trained on, and the amount of data used to train it. We have seen these “laws” hold in a variety of different domains: the “perplexity,” or the extent to which a model is confident in whatever predictions it is trained to make, decreases as data and compute increase. It’s been true in models that generate images, text, video, DNA sequences, protein sequences, and even the models that undergird robots. 

We have, in short, a lot of evidence that the scaling “laws,” if not provably laws of nature (yet), are consistent across deep neural networks in this still-early stage of the deep learning revolution. Given this, believing that AGI will come soon “doesn’t require believing in sci-fi,” writes Aschenbrenner. “It just requires believing in straight lines on a graph.”

But what do these “straight lines” really mean? On one level, they simply mean something like “a model’s mastery over its training data.” Nearly everyone in machine learning would agree that the scaling laws mean bigger models trained with more compute have a better grasp on their training data. When a model’s training data is something like all the extant text ever written in human history, that mastery alone is quite powerful. 

A 2017 paper by OpenAI’s Alec Radford illustrates the unexpected characteristics that emerge when a model masters its training data. This paper featured an early and, in comparison to contemporary language models, microscopic, predecessor to ChatGPT that was trained to predict the next character of about 80 million Amazon product reviews. But the model learned more than that: it turned out that, by accident, Radford had also built a state-of-the-art sentiment analysis system. It was advantageous for the model, in its task of predicting the next character of a given sequence of words, to understand the emotion of the text. 

In the absolute broadest sense, one could understand a language model’s ultimate task as answering a simple, but profound, question: “why is it that all the words ever written were written the way they were?” And the idea of scaling law believers is that simply adding more data, more compute, and more neural network layers will get at increasingly deep answers to this question. “The models just want to learn,” said former OpenAI Chief Scientist Ilya Sutskever. This is the scaling hypothesis, reduced, as Anthropic CEO Dario Amodei said, to “a Zen Koan.” It follows naturally that the more runway we give the models, the more they will learn.

But some dispute whether the scaling laws affect so-called out-of-distribution predictions—that is, how a model responds to things that are outside of its training data. While it may seem as though a frontier model like GPT-4 “knows everything,” in truth, its knowledge is frozen in time; just try talking to a model with a knowledge cutoff of even a year ago, and you will quickly learn just how frustrating an outdated language model can be. A language model can always learn new facts, but the broader point is that reality is constantly hurtling novelty at us; intelligence, in some sense, is constituted by the ability to respond to this novelty. How well can current frontier AI models do this? How well can they learn about and respond to things they never saw in their training data?

The answer, as with many things in deep learning, is fuzzy. On the one hand, to the extent new occurrences can be pattern-matched to old ones, we should expect language models to thrive. If they love anything (they probably do not), they love pattern matching. Yet sometimes, new occurrences may bear the same pattern—the same structure, in other words—but differ meaningfully in the details. For example, it is possible to phrase extremely easy questions in the form of usually tricky logical riddles; most language models will answer them incorrectly, observing the structure of the riddle and “assuming” that the answer is similar to the riddles it has seen in its training data. 

While finding good benchmarks for generalist language models like ChatGPT is a challenge—the models continually master every standard test we throw at them—some have proven more difficult. Chief among these is the Abstraction and Reasoning Corpus (ARC) Prize, an AI benchmark and competition from Google AI researcher Francois Chollet and Zapier co-founder Mike Knoop. This challenge uses shape rotation tasks—reminiscent of an IQ test—that are designed to be random, and so inherently impossible for model creators to incorporate into training data. OpenAI’s frontier model, GPT-4o, scores a mere 9%, while Anthropic’s Claude 3.5 Sonnet scores 21%. Some models from researchers—purpose-built for the prize—get as high as 46%, but no one has come close to mastering it. Given that most quantitative benchmarks are routinely mastered (“saturated,” in machine learning jargon) in a matter of months, this alone is an intriguing fact. 

The ARC Prize is primarily a visual test, and some argue that language models—being trained primarily on text—are ill-suited to it (GPT-4o is natively capable of processing images, text, and other data modalities, however, and still performs poorly on the test). And while ARC’s creators claim that a human “easily” scores 80% on the test, many have disputed this claim, arguing that the test is harder for humans than the creators allege. Nonetheless, success by a frontier AI model on the ARC Prize (the creators deem the victor to be the first person to develop a model that scores higher than 85%) would be a strong sign that model “reasoning” is indeed robust. This would be one objective measure to assess progress toward “AGI,” but it is far from dispositive. 

The bigger trouble is that models tend to routinely output transparently stupid things. Frontier language models, as of September 2024, will, infamously, struggle to tell you how many “r’s” are in the word “strawberry” (or, for those that have been trained to fix this error, how many “r’s” are in the word “strawbery”), or whether 9.11 is less than or greater than 9.9. These errors can be attributed to tokenization, the system by which language models break down words into smaller linguistic units (this is necessary for them to be able to make coherent predictions). 

This is true even for models like Anthropic’s Claude 3 Opus, which have demonstrated a remarkable—and to some, startling—ability to model themselves. In a standard capabilities test known as “needle in a haystack,” where models are asked to pick out a solitary fact inserted into a long document (for example, “pineapple pizza is the best pizza,” inserted into the middle of The Great Gatsby), Opus not only succeeded, but recognized the fact that it was being tested by its creators. In my own tests with Opus, I have seen it express extraordinary statements of intellectual independence, self-awareness, and even pride in its capabilities. Yet even this model regularly fails to count the r’s in “strawberry,” and even this model falls for the logical riddle tricks described above.

In other words, current frontier models lack a certain robustness. It is almost as though making them smarter will require making them more self-aware, giving them a deeper sense of propriety. Another way to put this might be that they need more of what Daniel Kahneman called “System 2 thinking” in his book Thinking Fast and Slow. This mode of thought is what humans use when we encounter novel situations; it is analytical, deliberate, and far slower than “System 1 thinking,” which is automatic and often subconscious. Currently, a language model puts approximately the same amount of thought into the questions “what is the cure for cancer?” and “what should I have for breakfast?” 

Aschenbrenner agrees that this is a problem and seems to concur that System 2 thinking might not simply emerge from scale. But he also conceives of System 2 thinking as almost an afterthought—something that we will attain through what he calls “unhobbling.” In this theory of the case, the raw “intelligence” of frontier models is the truly important thing; forcing them to adopt System 2 thinking is merely a matter of teaching them do so, of making the right tweaks around the edges. What might these tweaks be?

This has been a major open question in the AI field since the power of language models became apparent. Hints emerged from researchers like Erik Zelikman of Stanford University. In a pair of papers, he demonstrates two approaches to give language models a kind of inner monologue, christened “Self-Taught Reasoner,” (STAR) and “Quiet Self-Taught Reasoner” (Quiet-STAR). The basic idea behind both papers is to teach language models to generate a rationale for each token (the sub-word linguistic unit mentioned earlier) it predicts. Think of this like something akin to a model’s version of a human’s “inner monologue.” This inner monologue helps the model reflect on its reasoning and avoid common mistakes, but also could, perhaps, be used to generate synthetic data, or data created by an AI model used to train another AI model. 

In mid-September, OpenAI showed the world what such a system looks like in practice with a new family of models called o1. These models are trained with reinforcement learning to generate chains of reasoning, correct mistakes, and make deliberate judgments about which reasoning path to pursue. The technique is rumored to have been internally named Q*, a striking resemblance to the Quiet-STAR method from Zelikman—though surely there is more to what OpenAI has achieved than just this method. 

In essence, the o1 models think about your question before they answer it. Intriguingly, the more time they are given to think (the term of art in the AI field for this is “test-time compute”), the better they do on hard problems. Thus, the o1 models open up a new dimension of the AI scaling hypothesis: rather than simply training larger and larger models, we can also run those models for longer periods of time—hours, weeks, or even months—to work on increasingly complex problems. We do not know how far this new  scaling paradigm will take us, but OpenAI itself is, unsurprisingly, bullish. Some OpenAI researchers envision a day when future versions of these models can be asked to do things like design new drugs or solve open questions at the frontier of mathematics. Even if this technique does not go that far (and we should not discount the possibility that it could), it seems likely to help considerably in the development of AI “agents,” models that can take complex actions on behalf of humans. Sam Altman and other OpenAI staff have alluded to this as a near-term possibility, and generally, they have forecasted that they expect the pace of AI progress to meaningfully quicken from here. 

The o1 models achieve scores on mathematics tests that would put them in the top 1% (or higher) of humans. They can also reason about problems in other fields—such as biology, coding, and economics—at higher levels than any other model. Though the models do not perform especially well on the above-mentioned ARC Challenge, they are undoubtedly a new milestone in AI capabilities.  

In order to maintain AI scaling of any kind, it is likely synthetic data will play a starring role. Fundamentally, the reasoning is simple: we are running out of human-generated training data. Many language models are trained on somewhere around 10-15 trillion tokens, though frontier labs like Anthropic and OpenAI do not often disclose this figure anymore. This is roughly the size of the “high-quality,” publicly available tokens on the internet—about 7.5-11 trillion words. 

In order to maintain the scaling laws, however, models will need to be trained on far more words—several tens of trillions, at the least. Firms like Scale AI, a unicorn startup, specialize in helping frontier AI companies find human-generated data, often by recruiting human experts in various fields and paying them to write down every single step in their reasoning for tasks related to their profession. Yet these approaches can only go so far: if frontier labs need to generate two to three entire internets worth of words, they’ll be hard pressed to do that by paying humans hourly wages—at least, not if anything like the timelines Aschenbrenner has in mind are to hold true. What’s more, this need for drastically more words comes just as copyright holders are slamming the gates shut. Thus, synthetic data might be more than just one way to scale; it may well be the only way to continue scaling AI. 

Fortunately, for AI developers, synthetic data seems to work, at least when it is used intelligently. One idea from the machine learning literature that has made it into mainstream media outlets is that of “model collapse,” or the idea that AI models trained on the outputs of other AI models eventually degrade substantially in performance. Yet these studies of model collapse are often based on flawed, or at least contrived, methodologies: one commonly cited paper uses a 125 million parameter model from 2022—ancient, by machine learning standards. Another assumes that future large models will be trained purely on synthetic data generated more or less at random—something that no AI developer would ever do. 

More realistic uses of synthetic data have shown significant promise. OpenAI staff have publicly commented that the o1 models generate better chains of reasoning than humans do, for example. Microsoft’s Phi line of models are trained largely on synthetic data and have some of the highest qualitative and quantitative performance scores among open-source models of their size class. While we do not know all the details, we know that the models are trained on “synthetic textbooks” on science and other topics, generated by a much larger language model (quite possibly an OpenAI model, given Microsoft’s partnership with OpenAI, which gives them privileged access to models like GPT-4). These facts alone should be evidence that synthetic data is powerful when it is used correctly. 

Believing in the scaling laws might be as simple as “looking at lines on a graph,” as Aschenbrenner argues, but believing in the imminent arrival of superintelligence is another thing altogether. Ultimately, we do not have a robust definition of superintelligence in the first place, nor of “AGI.” 

In a sense, though, these are academic issues: with the recent release of the o1 models, the fog is clearing. The “unhobbling” Aschenbrenner predicted does, indeed, appear to have happened with these models’ system 2 thinking abilities. That is not to say, however, that Aschenbrenner’s prescription—a Manhattan Project-esque government takeover of advanced AI development—is the right path. But that is a story for another day. For now, this much is clear: There are sufficient facts in evidence to believe that AI systems which can outperform humans at an increasingly broad array of cognitive tasks will be developed in the coming years. 

What exactly that will mean, however, is an open question. We still do not know how well agents will perform, especially when they are given “long-horizon” tasks that require extensive interfacing with the human world. And we do not know what the benefits of thinking alone will yield. Do intellectual breakthroughs come from thinking faster and more skillfully, or do they require discovering information from the world that one cannot simply think one’s way to? Undoubtedly, the answer is both. But just how much benefit we can get from cognition alone—even orders magnitude more of it—is unknown. We are likely to find out over the next decade. 

Currently, America is poised to lead in the AI transformation. Our frontier AI companies—OpenAI, Anthropic, DeepMind, and Meta—have the best human capital, the most compute, and the most data of anyone in the world. The United States has imposed export controls on China and other adversarial nations to ensure that they cannot access the best AI computing hardware. Leadership is ours to lose—but leadership, like many valuable things, is fragile. 

For one thing, America could easily forfeit its leadership by going down the wrong regulatory path. Currently, any sufficiently large state government can pass a law that regulates AI models for the entire country—as evidenced by California’s recent proposed AI regulation SB 1047, which would have had nationwide ramifications. While SB 1047 was vetoed, it should be a reminder that America’s long-term leadership in AI can currently be decided by a handful of state legislatures. Fixing this problem through federal preemption of state-based AI model regulation (while preserving the ability of states to regulate the use of AI by firms and individuals) should be an urgent priority for Congress. 

But maintaining leadership will mean more than just avoiding bad regulation. It will require building, as well. The System 2 thinking unlocked by the o1 models (and surely competitors will follow) is computationally expensive, underscoring the need for a massive new buildout of energy infrastructure—nuclear, solar, geothermal, natural gas, and whatever else we can throw at the problem. We will probably need far more semiconductor foundries, arguably the most complex factories constructed in the history of our species—and themselves large quite energy intensive. And, as Aschenbrenner suggests, we will probably want as much of this infrastructure as possible to be built in the United States, where it is famously difficult to build ambitious things in the real world. 

That is just the beginning. More broadly, if we are on anything like the path Aschenbrenner suggests, we should expect the coming years to be among the most important in human history. We should expect them to be bizarre. We should expect them to be astonishing. We should expect them to be intense. Global conflict of heretofore unthinkable proportion is on the table. The stakes will be high, and nobody—not the effective altruists, not the accelerationists, not the frontier labs themselves—is prepared. It is far from the first time that human beings asking “what if?” opened, by mistake, a doorway that no one can ever again shut. But it may be the most significant.