Expected Surprise

What if LLMs are mostly crystallized intelligence?

Ashwin — Wed, 29 Apr 2026 05:19:36 GMT

Summary

LLMs are better at developing crystallized intelligence than fluid intelligence. That is: LLM training is good at building crystallized intelligence by learning patterns from training data, and this is sufficient to make them surprisingly skillful at lots of tasks. But for a given capability level in the areas they’ve trained on, LLMs have very weak fluid intelligence compared to humans. For example, two years ago I thought human-level SAT performance would mean AGI, but turns out LLMs can do great at the SAT while being mediocre at lots of other tasks.

I’m not saying LLMs are just parrots (that’s dumb).1 There’s a continuity between crystallized and fluid intelligence.

At the extreme “crystal” end we have shallow locally-valid heuristics. Pure pattern matching. Now-largely-debunked “stochastic parrot” hypothesis.
At the extreme end of “fluid” you have a cross between an idealized consultant, a Renaissance man, and MacGuyver. A deep world model and general reasoning, able to come to grips with any particular environment and problem, and to invent new tools and concepts on the fly.
Some other ways to gesture at this: what n-gram of Markov chain you’d need to capture a behavioral pattern; number of tasks the pattern is relevant for. More fluid systems compress a lot of useful behavioral detail into a small amount of brain-space.

Empirically, it’s unclear how fluid their intelligence is: we see both general reasoning skills and jaggedness.

e.g. they’re good at playing Diplomacy without specialized RL or (I assume) much raw training data;
They’re good at ARC-AGI despite presumably not seeing this type of challenge before.

It’s worth considering: what if fluid intelligence progress is relatively slow, and LLM capabilities mostly grow with relevant training data?

This could imply slower AI progress, especially if general-purpose data runs dry relatively soon. (Epoch estimates 2026-2032.) That means companies will need to prioritize specialized data collection/generation, which will lead to jagged capabilities growth favoring the prioritized areas.

[Epistemic status: I only put like 20% on worlds where this dynamic puts a serious damper on AI progress compared to e.g. the AI Futures Project’s median timelines. It’s important to stay aware of these possibilities, though, and track the relevant evidence.]

Implications for AI futures

This suggests that we shouldn’t naively extrapolate forward from e.g. the METR AI R&D benchmark to real-world AI R&D improvement, for two reasons:

1) quantitative differences: longer-time tasks will be more data-poor, will rely more on fluid intelligence skills that they don’t have the data or the context to apply. (training data may suggest some of the right heuristics, but they might not know which ones to apply or in what sequence.)
2) qualitative differences: METR is measuring performance on relatively closed-form tasks. Open-ended tasks may be much harder.

Likewise, this suggests that simply scaling LLM training won’t get us to omni-competence.

But “just scaling LLMs” and “scale LLMs ‘til they’re superhuman AI R&D coders, then use those to build next-gen AI” are the two main stories for how we get to AGI very fast!

We should still expect significant progress on AI R&D. The AI labs are explicitly training for AI R&D, and have clearly hit superhuman capability in some coding-related areas (cybersecurity).

But the shape and speed of the takeoff curve matters. It matters a lot if, say, the METR time horizon hits 1 month, but we actually don’t have anything like a drop-in senior AI R&D researcher, just a really really good team of assistants. The labs still need to spend a bunch of serial time running compute-expensive experiments, and their AI tools can only moderately improve experiment selection. That could mean they get to, say, a 10x speedup over years of grueling effort. That’s much slower than AI2027 expects.

Crucially, for as long as AIs are great at technical work but mediocre at fluid intelligence, that’s great news for AI safety.

But a major caveat is: I expect at some point we’ll see people devise new paradigms that are more data-efficient, and at that point all our safety techniques and assumptions might no longer hold.

We should check if this is true!

I’d be really excited for tests of capabilities like:

Recognizing when they’re wrong or uncertain.
Self-management — e.g. Claude Plays Pokemon giving itself bad notes and getting stuck.
Meta reasoning, e.g. identifying “this situation seems contrived”.
Performance on novel games. Especially ones where their heuristics from other games don’t transfer over.
Performing well when their heuristics need to be reversed. You could design a “trap” game that preys on people who are using normal heuristics.

Re-learning. If you unlearn some knowledge or principles from a model, can it rederive them from first principles?

Modeling worlds where AI progress is hungry for domain data

Here’s a set of claims, call this the “hungry for domain data” hypothesis:

We’re approaching the ceiling on human-generated training data.
Further training will need to rely on synthetic data or on massively scaling up domain data, which we can’t easily do for all domains.
Models won’t generalize super well, so performance will be data-bottlenecked in many domains.

What types of areas see progress in this model?

I imagine we’ll have a base AI optimized for AI R&D, which gets trained to develop synthetic-data sources for domains that are amenable to simulation and/or to automated evaluation (for RL). Then those data sources are used to train AIs.

Domains will see more progress if:

AI companies can easily generate syndata.
- Some spaces are already easy to simulate or auto-evaluate. eg: digital games, programming, math.
- Here’s a spicier hypothesis: syndata for robotics might be pretty feasible to generate at scale.
  - Macro-level Newtonian physics isn’t too hard to simulate.
  - Probably there’s a fair bit of schlep in very accurately modeling how a particular actuator moves, so that could lock you in to standard robot designs, or at least standard parts.
  - A friend who knows ML says that most work on robotics these days is on coping with diverse environments, e.g. home layouts, lightning conditions, materials. It seems a lot less obvious that syndata works for these, but it still seems possible.
AI companies can easily generate real-world data.
- Via commercial deployment: the space is easy to directly engage with via a large number of sensors & actuators that AIs can usefully plug into.
  - For today’s AIs: chatbot interactions, practical coding projects.
  - For future AIs: maybe lots of practical business operations for digital agents, and diverse environments (homes, roads, factories to operate in) for robotics
- Via high-throughput experiments.
  - These could be very expensive if you need a lot of data — atoms are much more expensive than bits! So you might only do it if you’re willing to make a big bet on particular domains.
    - But there might be big bets for AI lab leaders to make; there’s long been an expectation among futurists that AI would unlock truly advanced molecular design. (I think this is plausible, but would be highly dual-use and destabilizing by default.)
  - Here’s an interesting lens to apply to different domains: how many high-quality token-equivalents can you get per second of real-world sensory data? Per dollar of capital expenditure?
  - Ramping up experiments seems especially necessary for medicine and in vivo biology, where there are so many complex interactions to predict.
  - 3D printing strikes me as one area where it’s pretty fast to build things, the combinatorial design space is large, and so AI might be able to find a bunch of valuable stuff.
Huge amounts of data exist, and progress so far has been significantly bottlenecked on acquiring or processing this data. Some guesses: protein structure, genomics, finance.
- Note that data might become a bottleneck once AI training eats up the current stock of data. Or once we’re trying to do stuff out of distribution, e.g. designing novel proteins (that look very different from natural ones) as an early step towards advanced nanotechnology.

There are also stories for how advanced AIs could route around data bottlenecks:

Advanced AIs will be able to write much better simulations than currently exist. (Unclear which domains this is true of; seems like a very important question).
With existing data, advanced AIs will be able to intuit key principles that give them a much better ability to “one shot” the problem.
- Might be true for some areas of chemistry, molecular biology, & materials science? We have a lot of data and know the underlying principles, and human intuitions for molecular scale behavior are poor, so there might be a bunch of gains to grab here.
- I expect this effect is most likely to have impact via catalyzing further improvements: enabling AIs to build better simulations or encouraging companies to invest in real-world deployments or experiments.

Which concrete domains see progress?

LLMs will get very good at coding and math, but in a way that doesn’t generalize to other domains; e.g. their time horizon on these tasks, or their capacity for superhuman performance, will outpace ~all other tasks.
Large neural networks will get very good at other data-heavy areas, once the data pipelines are set up:
- Business operations
- Robotic operation
- Maybe some data-rich areas of science
The resulting mixing-and-matching of technical skill, broad domain knowledge, and medium-generality intuitions could dominate existing human organizations in valuable areas.
- E.g. lots of business tasks, smaller-scale coding projects, lots of jobs that mostly need decent qualitative reasoning & attention to detail.
- I wouldn’t be surprised if finance got moderately more efficient, despite being a pretty heavily optimized field, because you can do some really sophisticated processing of qualitative data for patterns. (e.g.- auto ingest all company filings, all news stories, in a sophisticated way; parse historical such data for tradeable patterns.)
  - And maybe because context-aware trading systems can e.g. identify anomalies and prevent bad trades by the rules-based systems that do most trading.

Implications for AI takeoff

While it lasts, weak fluid intelligence is great news for alignment risk

Successful scheming might rely on very good general reasoning. Otherwise, you could do some combo of…

Test the AI — regular capability evaluations, use interpretability to try to ID deception, etc.
Run domain-specialized Ais as control-style overseers
Harden the AI’s environment — cybersec measures
Without the ability to run subtle deceptions around multiple layers of oversight, the AI might be cooked.
- It seems really hard to decide to sandbag without this showing up in some layer of oversight, unless you can do complex sandbagging reasoning in a single forward pass.

A key bifurcation point: can AIs revolutionize AI R&D, or merely speed it up?

Coding is a data-rich domain, and AI companies prioritize generating data on AI R&D tasks, so we should expect AIs to get better at AI R&D over time — as we indeed see.

Case 1: AIs can significantly improve coder productivity and codebase efficiency, but they don’t reach supremacy. R&D progress is gated on compute-hungry research experiments and on expert research taste.

Case 2: AIs are able to intuit key principles of AI R&D. This lets them move smoothly from usefulness to outright supremacy. The best human experts are great at this, but they’re held back by brains with limited working memory and poor native resources for understanding massive-dimensional spaces and inhuman minds.

In both cases, my best guess is that improved AI R&D eventually leads to a paradigm that can scale to superhuman fluid intelligence. And since resources and R&D productivity are scaling so rapidly, “eventually” will probably come pretty soon.

But in case 1 especially, we’re likely to see a period where AI architectures evolves a lot. That has important implications:

AI safety insights might become less transferable across labs
AI safety techniques may no longer work. The fact that current LLMs seem pretty aligned shouldn’t give us much comfort if the final architectures look different.
New, more efficient paradigms could face a dramatic data or compute overhang, leading to a sharp jump in capabilities
Frontier AI might be much more vulnerable to theft, because a lot of value is contained in algorithmic secrets rather than full model weights or training environments.

Is this the world we live in?

Some evidence against: It seems like the “water level” of LLM capabilities is gradually rising in many areas, and that some of this is probably a generalizable-skills thing.

E.g.: probably the amount of chess games in base training data didn’t vary a ton between models, at least since 2023. But chess performance has improved a lot, especially with the advent of reasoning models.
- (The two main benchmarks seem to be Saplin and Dubesor. They have different results, and both result sets are a bit weird to someone used to Number Going Up with each model, but they do broadly show a significant leap in performance between the GPT-3 / Claude-2 generation and later models.)
Reasoning models & the ability to improve performance via longer inference suggest some “fluid intelligence” is going on — enough that you’re not just computing the answer based on simple heuristics and plateauing after a few tokens. (cf Owen)
This is part of why I think it’s useful to have a notion of a sliding scale of fluidity — it’s clear that they’re not just parrots, but also that they’re getting a lot out of domain-specific data that they wouldn’t easily be able to reconstruct via pure reasoning.

How can we test this hypothesis?

Places to look for fluid reasoning capabilities in LLMs:

Recognizing when they’re wrong or uncertain.
- LLMs are pretty bad at this right now, I think.
- Maybe the form of assistant training is partly to blame? If you only get scored based on immediate responses, “can you tell me more?” or “please clarify” isn’t useful. And I dunno how discerning the human evaluators are — maybe they just reward overconfident slop and insight porn.
Self-management — e.g. Claude Plays Pokemon giving itself bad notes and getting stuck.
Meta reasoning, e.g. identifying “this situation seems contrived”.
- They do some of this now, e.g. noticing they’re likely being evaluated.
Performing well when their heuristics need to be reversed. You could design a “trap” game that preys on people who are using normal heuristics (e.g. a chess variant designed so that controlling the center of the board is bad.)
- A simple example: We’re only recently seeing good performance on reversed riddles — LLMs used to instead give the answer to the typical riddle, and now they notice and give the right answer.
- Counterpoint: humans are often bad at “traps”, and require time to learn new situations. So I’m not sure what a fair comparison point is.

Performing well on tasks that seem heavily general-reasoning loaded, and definitely weren’t in the training data.
- Evaluable:
  - A novel game (could compare to dedicated RL systems on the same game).
  - Making money at scale in the real world.
- Are there variants that are more clearly reasoning-loaded?

Re-learning. If you unlearn some data or principles from a model, can it rederive that from first principles?
- This question is also relevant for the usefulness of unlearning as a safeguard against AI misuse or misbehavior.

Subscribe now

Thanks to K, Adria, John, and Abi for comments.

They’re more like a horde of precocious 12-year olds, each with a different hyperfixation.

Decision theory doesn’t prove that useful strong AIs will doom us all

Ashwin — Wed, 24 Dec 2025 05:00:31 GMT

Bottom-line up front

Training for optimal behavior doesn’t inevitably lead to act-utilitarian world optimizers (”WorldSUM agents”).
People will prefer to deploy agents with more virtue-ethicsy / deontological approaches, for a few reasons:
- 1) Traditional misalignment concerns
- 2) Even if they have “the right values”, we don’t trust them to get the calculation right -- just like human subordinates.

Similarly, many people including AI labs will prefer agents whose action space is bounded, because they don’t want to lose all their digital resources to a misaligned or malfunctioning agent. The performance of such agents, at par-human level, won’t suffer from having much more bounded utility functions (e.g. only caring about the quality of their code output).
These factors do weaken as we get closer to ASI:
- It might be much harder to distill nice agents from WorldSUM agents
- Even people who see themselves as responsible will increasingly be handing authority to AIs and wanting them to take broadly-scoped actions.

But building non-WorldSUM agents could help ensure safety in the early stages of the intelligence explosion, and get us more reliable AI advisors who can help us navigate a pause at human level.

Introduction

Here’s a common argument in the AI safety space:^[1]

1) Useful AIs will not be exploitable. Therefore they will be utility maximizers for some VNM utility function.

2) Utility maximizers are scary, because they want to eat the world. In particular they are

Lacking side constraints (no deontological rules. This implies no corrigibility, unless you work very very hard to embed exactly the right notion of “stopping when humans want to”.)
Scope-unbounded (care about the long-term future & work to bring about a world that maximizes U at the expense of everything else.)
Resource-hungry (marginal returns don’t diminish, or at least don’t diminish to zero. A classic argument here is that even if the thing they care about is small — say, a particular house — they always prefer to gain more power to increase the probability of ensuring that the house stays safe.)

This argument is wrong. Agents can maximize a utility function without eating the world.^[2]

There are valid utility functions with nice safety properties.

Agents can care about more than the material state of the world.

In Econ 101 style conversations, it’s convenient to talk about preferences as a simple function of a basket of goods: I value apples at $1 and bananas at $3, or I want to make sandwiches so my utility is min(peanut butter, jelly), or whatever.
Many AI safety folks seem to have the intuition that such preferences are both the most natural, or are synonymous with expected-utility maximization.
But a utility function doesn’t have to be defined just over resources the agent owns, and can even rely on things beyond the material state of the world.^[3]
- Formally: in the RL context, we can consider an agent’s trajectory to be the history of world-states it encountered and actions it made. An agent can have a utility function based on its trajectory, not just the current world-state.

Most importantly, having preferences over actions enables useful safety properties:

Let’s take finance as an example setting, and suppose we have a reference agent A with utility function P = {profit} depending only on the worldstate.
In theory, agents with action-based preferences can be as competitive as you like. For example, we can construct an agent with U(action, state) = {1 if action would be chosen by A in this state, 0 otherwise}.
We can then include deontologically-flavored preferences over actions, for example U(action, state) = {if A would choose a legal action in this state, 1 to copy A and 0 to do otherwise; if A would choose an illegal action in this state, 1 to make no trades and 0 otherwise}
Preferences over the entire history also enable memory-laden features, such as “I didn’t defect first in an iterated prisoner’s dilemma”.

Agents can care about less than the whole world; they can be inexploitable within a limited scope, without being hungry to grab further resources.

Consider a chess-playing agent which only considers actions in the space of “submit a legal move”. This agent will never e.g. try to hack out of the game or even trash-talk its opponent. In some sense it’s leaving value on the table, but it can “go hard” (a phrase Eliezer and Nate Soares use in their book to point to dangerous levels of agency) within the chess context. In fact this is how current chess agents work in practice, and they’re extremely superhuman.
As far as I understand it, this LessWrong post argues that agents can be non-Dutch-bookable without having preferences over all possible world-states; instead, they can follow the history-driven rule “don’t take an action that, together with a previous action of mine, would mean I had been Dutch Booked”.

Should we expect to see these nice utility functions in practice?

Okay, says the imaginary Yudkowsky who lives in my head rent-free,^[4] I don’t actually care about literal VNM expected-utility maximization. (It’s unfortunate that a lot of people downstream of me think that that’s what’s important, but that’s big-name bloggership for you.) The point is about what agents I expect to actually show up in practice: I think agents without preferences over actions, but with preferences over the world-as-a-whole, are a) easier to train, b) more competitive, and c) likely to show up once AIs are the ones doing AI R&D.^[5]

Thanks, Eliezer, that’s a helpful clarification. For now I’ll call the agents you’re worried about WorldState Utility Maximizers, or WorldSUMs. (I like to think of them looking at the world and summing up all its utility.)

Are WorldSUM agents easier to train?

Maybe, but I suspect it’s not so much easier that training nice agents is out of the question. Instead we’re in the realm of practical tradeoffs.

Some costs of nice agents:

For a lot of applications we’d want to train AIs for, it’s much easier to give feedback about the state of the world than about the validity of actions. To stick with the finance example, consider how much slower the feedback loop of the justice system is than the feedback loop of daily profit-and-loss.^[6]
One potential upper bound on the difficulty of training a constrained agent is: train a grasping agent, and then train an imitation learner on “do what the grasping agent would do, except when that action violates constraints we care about.”
- I don’t know the SOTA on imitation learning, but I could imagine this not being an awful additional cost. Very naively it’s 2x (two training runs!) which is both kind of a lot (training runs are expensive) and kind of a little (it’s the same order of magnitude! “we need to spend twice as much / go half as fast on our megaproject” is merely very difficult, not impossible.).
- This doesn’t get around the problem of an ASI-level original model killing you before you distill it. But it could help ensure that we have more reliable agents for some kind of “gradual recursive improvement with AI control” process in the sub-ASI regime. My theory of victory here is something like “ensure trustworthy human-level agents” -> “pause at human level” -> “use AI advisors & coordination tech to figure out how to handle things from there”.
For really advanced AI agents, it might be really hard to evaluate whether they’re obeying the action preferences we care about during training and give appropriate feedback — those preferences might be things like “don’t fool us”, but maybe it’s hard to tell when they’re being deceptive (and training directly on the easiest signals, like the chain of thought, could incentivize more subtle deception).

A cool example of people trying to train agents on nonstandard EU maximization is MONA; they train an RL agent not to maximize expected future reward (as is standard), but instead just current-stage reward plus some overseer feedback on their actions. This is intended to avoid the problem of strategic reward hacking (taking multiple steps to screw with the reward signal) while still giving long-term feedback. They note that the overseer feedback can take various forms, including automated LLM feedback about whether the action agrees with the trained AI’s constitution.

This is pretty similar in structure to “imitate a successful agent, but have side constraints”.
If we can rely on LLMs to evaluate whether actions are acceptable, that’s an amazing boon.

Are worldstate utility maximizers more competitive?

Are people more likely to want to train and deploy these agents?

Some people will definitely do this, just like people are deploying AIs with instructions to be evil or make Fartcoins or whatever.
But the leading AI labs won’t want this, as long as they care about prosaic applications and think they’re in an iterated game with the rest of society.
- Constrained agents are actively super useful for practical applications, while naive-maximizers with access to sensitive actuators can cause lots of trouble even if they aren’t ASIs.
- The danger is that companies might decide they’re just in a race for superintelligence, or that it’s not worth paying the safety tax because their systems look safe according to their tests, or whatever. I think that’s a real danger, but it feels totally disconnected from the original argument from Dutch books.

Do worldstate utility maximizers win once they are deployed?

The answer here seems like an obvious “yes” if you’re Yudkowsky, imagining a world where the capability gap between the latest AI systems and humanity is really high, and the ability of society to oversee AI developers and deployers is weak. Otherwise, though, I think it’s not so obvious. A lot seems to hinge on how defensively stable the world is, and how good oversight is of AIs and AI companies.

Does automated AI R&D naturally lead to worldstate utility-maximizers?

Will recursive improvement naturally remove constraints on behavior or limitations on preference-space?

I’m not sure I can accurately represent this worry — I think it’s something Eliezer and Nate and others really worry about, but I don’t really see it? I think the fear stems in part from an assumption that all the constraints we put on powerful AIs will either be incoherent, have loopholes, or be clearly “unnatural” / “not part of the AI’s True Utility Function”. Incoherent constraints might not do anything, loopholes will be found, and the AIs will prune away constraints that they realize aren’t part of their true utility function.

This seems related to Eliezer’s claim that true morality is act-utilitarianism, but no human can actually follow this so we need deontological side constraints. Equivalently: true optimal behavior is EU maximizing the state of the world, and EU functions over bounded parts of the world are too narrow while EU components based on process are, I guess, too artificial?

I can see the intuition, especially in the first case, but I can imagine entities with either types of preference that are just stable under reflection.

An intuition pump: when I introspect about the preferences I have over process, it feels like there’s a difference between constraints I chafe at (ugh, why do I need to get permission from this person, this would be so much easier) and constraints that feel like part of me (I want to be nice to my partner). What would affect whether self-aware AIs would consider their trained action-preferences as ego-dystonic rather than ego-syntonic?

A related worry: automated AI R&D could lead to a new paradigm that makes value training harder.

Perhaps the major debate in AI today is whether the current LLM paradigm will smoothly scale to ~arbitrary levels of intelligence (look how fast they’re getting good at things!), or peter out (look how expensive they are while still being bad at lots of stuff! We don’t have many OOMs of scaling left! They’re much less power- and data-efficient than the brain!).^[7]

One synthesis of these arguments; LLMs could rapidly scale to great performance at AI R&D, accelerating the top human supergeniuses in developing a new paradigm. This could be bad news for safety, in that 1) LLMs seem like an unusually nice paradigm, and 2) the compute overhang from unlocking a new, more efficient paradigm could mean that we see a big capabilities jump in a short period of time, and 3) this could push us close to really fast AI iteration and scary capabilities, shifting companies from prioritizing prosaic utility and public trust to deploying ASAP at all costs.

Concluding thoughts

Strategic takes:

It seems to me like the “expected utility maximizer” argument doesn’t have much force, and it’s often used to stand in for other, realer concerns. I think it’s better for people to directly argue about what’s easier to build, more preferable / competitive to deploy, and maybe what’s stable under automated AI R&D.
A lot of those realer concerns are downstream of rapid AI progress and poor AI oversight.
- As such, internal AI deployments for automated R&D are particularly high-risk.
- Good societal oversight of AI seems at least necessary, and maybe a really good version of it is basically sufficient (by incentivizing the right safety investments). Will we get there? Mu.

Solving the real concerns seems feasible. It feels like we’re not philosophically barred from a solution, we’re just haggling over the price…but it’s super unclear what the right price is.
- Plus side: it’s not a question of “math proof says we’re doomed”, it’s much more like “we maybe need to spend an extra trillion dollars and three years on this megaproject”. Merely extremely difficult.
- Minus side: maybe it’s ten trillion dollars and ten years.

Technical takes:^[8]

I’m excited about the idea of training agents to have preferences over actions. It seems like a pretty natural thing to try, and stuff like MONA shows that simple versions of it can kinda work. Apparently there have been small contingents of “process-based RL” folks at the labs’ AI safety teams for a few years now.
I’m confused how we’d know if we’d succeeded at training action preferences that are stable under reflection, capability-enhancement, etc. — if the AI intrinsically “cared” about satisfying the constraint. Behavioral tests are good enough for current LLMs, but clearly not enough for really advanced AIs. Ideally we’d have great interpretability tools for this.
I feel more confused about whether it’s feasible or useful to deliberately build powerful agents with preferences about only part of the world.
- One natural version of this is to have a carefully overseen “AI builder”, sort of an ant queen, that builds specialized systems. That could include developing lots of systems that are very clearly not WorldSUM agents, like regular old computer code.
- It doesn’t seem impossible that a mix of technical factors & societal lessons-learned lead to a Drexlerian world of specialized AIs with preferences only over the action-space they’re built to operate on.

Some sources on this topic:

Rohin Shah, 2017: Coherence arguments do not entail goal-directed behavior
Elliott Thornley, Feb 2023: There are no coherence theorems
Anonymous’s comment thread from Aug 2023, especially Keith Wynroe’s comment.
Owen Cotton-Barratt’s thread from 2024.
Anonymous’s LW question from June 2024

Of these, I think Rohin’s post and Keith’s and Owen’s comments most get at what I’m interested in here.

Thanks to Will MacAskill and Owen-Cotton Barratt for comments on a draft of this, and to my friends for encouraging me to finally publish it.

^{^}
I actually don’t know how common this is these days — it used to be very common ~10 years ago, when I first questioned it and got a pretty chilly reception. I expect a lot of people who are AI-safety-adjacent but not experts themselves still have some version of it cached in their heads. And probably even some experts do, especially in areas like interpretability where engagement with this argument isn’t particularly relevant for your day to day. I am very grateful to folks like Rohin Shah for fighting the good fight on this topic over the years.
^{^}
This is trivially true, even — see Rohin Shah’s post about coherence vs goal-directedness. A robot that always twitches can be interpreted as utility-maximizing! The problem with that example is that the twitch-bot doesn’t do anything useful.
^{^}
I also have beef with the intuition about utility being a simple function, by the way, like #paperclips —I think there’s an assumption that complexities either get stripped away by the training process, or don’t matter because they don’t outweigh the most important parts of the utility function. But I’m not sure either of those is right — certainly they’re not good descriptions of present-day LLMs.
But if we’re assuming utility functions can/will be complex, it feels a lot less obvious to me that we’re 100% doomed by default. There’s some discussion of this here; folks like Paul Christiano and Carl Shulman think humanity won’t literally all be killed, and it’s irresponsible to say so when in fact it’s pretty likely we’re kept alive (we’re likely to be of some interest to LLM-style agents) and perhaps even in a pretty good state (this seems like a crux to me - not obvious), but not in control of the cosmic endowment.
^{^}
OK, fine, maybe he pays some rent.
^{^}
This is my attempt at passing the intellectual Turing Test here; let me know if you hold a version of this view that reads differently.
^{^}
“Daily profit” as the target is a simplification, since many trades are meant to be profitable in the longer term, and e.g. there are nuances about what price you mark profits to that allow for reward hacking if you aren’t thoughtful.
^{^}
Actually I’m not sure how the data efficiency compares, please @ me with your wild Fermi estimates.
^{^}
Take these takes with a grain of salt; I am very much not an ML researcher. I just read their papers sometimes.

Wise AI support for government decision-making

Ashwin — Mon, 14 Oct 2024 06:31:44 GMT

Michaelah Gertz-Billingsley and I wrote this quickly as a submission to AI Impacts’ essay competition on automating wisdom and philosophy. Future posts here will be more conversational!

Governments have long been early adopters, or even creators, of sense-making tools — from the census to the computer to the internet. Today, they use complex statistical or mathematical models to make decisions about commensurately complex issues: healthcare, finance, infrastructure, military procurement. Present-day AI systems are growing increasingly helpful for processing information and writing code, and their creators aspire to build powerful assistants with general intelligence.

Atypically, governments are behind the curve on adopting advanced AI, in part due to concerns about its reliability. Suppose that developers gradually solve the issues that make AIs poor advisors — issues like confabulation, systemic bias, misrepresentation of their own thought processes, and limited general reasoning skills. What would make smart AI systems a good or poor fit as government advisors?

The Oracle at Delphi, an early decision-support service for policymakers. Unfortunately, “a great empire will fall if you go to war” is not especially wisdom-serving advice.

Much like smart humans, smart AIs might do a great job at answering the wrong questions. They might focus on legible or short-term benefits over more important factors that are harder to analyze, or take their users’ implicit assumptions for granted.

We recommend the development of wise decision support systems that incorporate the following features:

Helping people figure out their principles and make decisions based on them.
Helping people extend, explore, and structure their options under consideration (not just helping them choose from within the limited option space they initially consider).
Identifying and questioning implicit assumptions; exploring different frames in which to consider the problem.
Far-sightedness — incorporating longer-term and higher-order consequences of a decision (e.g. effects on relationships and long-term collaboration; precedent-setting; long-term budgeting and risk).
Aggregating views and preferences across disparate groups and stakeholders.

How can we ensure government access to wise AI decision support?

1. By default, the private interests leading AI today may focus on smart AI decision support over wisdom, as it is easier to train and test tools for short-term decisionmaking.

Many of their private customers may (rightly or wrongly) see short-term benefits as sufficient. This might be based on the incentive to seek short-term profits at the expense of the long term. But companies might also often have specific decisions that are not that subtle and don’t particularly need wisdom — e.g. “how much of X widget should we make this year”. Unlike governments, they may be able to function well entirely by making such decisions well. By contrast, we would argue that functional governments inherently need to consider some subtle, complex, long-term issues.

Governments could encourage the development of wise AI, much as they have done for technologies like vaccines and autonomous vehicles.

Organizations like DARPA could fund prizes for more testable aspects of wisdom.
New mechanisms such as advance market commitments could provide market incentives for forward-looking product development.

2. Key decision makers may not know to ask for features that will lead to the wisest decisions. Government institutions have typically been slow to adopt AI, and impactful policymakers will rarely be AI experts. Policies are often made on the basis of empirical evidence and expert views, which may not favor wisdom-serving systems if they are a new technology with relatively illegible benefits.

We propose that government-serving think tanks and consultancies start developing AI for wisdom-serving decision processes now. Such efforts might start small with today’s limited models, but could scale up in automation and sophistication as AI decision support capabilities improve. A mature field of wise, AI-powered decision support would let governments benefit from these tools without needing to judge them directly. Instead, they could rely on the judgment and track record of organizations they already have relationships with. Multiple organizations building these tools could compete and learn from each other, ensuring that wise decision support does not lag too far behind its merely smart alternatives.

Adopting AI for wise decision support: a playbook

Start by using AI tools to augment traditional decision-making and analysis processes. This allows the organization to a) make informed decisions from the start b) not rely on AI before it’s ready to automate everything, c) generate training data to improve their AIs, d) get experience with AI’s strengths and weaknesses.
Scale up the use of AI over time — relying on it for more tasks as it becomes more capable.
Clarify the value-add provided by wise systems over standard AI products. Identify and publicize incidents where AI systems that went awry could have gone well by incorporating “wisdom-serving” design principles

Concrete example: using near-future LLMs to support a Delphi process

The Delphi method involves the following steps: session runners want to understand a topic better — often a forecast of the future or a recommendation for policymakers. They come up with a set of quantitative questions, ask a field of experts for responses. The experts share their qualitative views as well; the facilitators guide discussion or transmit information, and the participants revise their estimates. (The process can be repeated if disagreements remain, or the organizers want further clarity on key questions.)

In our (limited) experience, there are many frictions to this process, particularly in generating clear language and context for the questions, understanding where experts’ disagreements lie, and making efficient use of a group of experts’ time while keeping all of them updated on what the others believe. Experts often often have valuable qualitative points to share that reframe the analysis, but which are more difficult to identify and convey than quantitative beliefs.

Roughly current-level AIs could already help with this process at many points:

Helping organizers come up with great questions & phrase them well
Expressing organizers’ views. Organizers could provide an LLM with access to a long document describing their background thinking in detail. Participants could query it and get on the same page without slowing down the process. This could be faster and better than providing the same background reading document for everyone, since participants will have different questions or concerns about the framing of the exercise. (Discussing such framing concerns often takes up a large portion of the allotted time.)
Transcribing the discussion and summarizing it, while identifying key points of disagreement.
Helping organizers rewrite questions on the fly in response to input.
Allowing organizers to ask qualitative questions, since LLMs can auto-summarize responses and aggregate them — e.g. “17 out of 24 respondents mentioned that they thought this level of AI capability was infeasible, so they found it difficult to concretely imagine what its impacts would be.”

How can wise decision-support processes scale up over time to take advantage of accumulated data & increasing AI capabilities?

Formal AI training

What kinds of data are available?

For many forms of wisdom, user feedback can provide useful information about the model’s quality (and can provide small amounts of fine-tuning training data). For example: users may recognize better option sets or useful decision-boundary diagrams when they see them; they may understand and appreciate advice processes that put them in touch with their values, etc.
This may break down for far-sightedness — people may not truly appreciate far-sighted decision support until the long term.
OTOH: maybe people would notice — “yes this serves my long-term interests, good job identifying relevant factors ABC”

The amount of data generated is relatively sparse, so conventional fine-tuning likely wouldn’t work.

But: AIs are already okay at multi-shot learning (learning from information provided to them in a prompt).
So sufficiently powerful AIs might be able to learn useful patterns from summaries or transcripts of earlier decision-support processes, if humans flagged what helped them reach a better decision along various dimensions (far-sightedness, etc.).

The evolving role of AI

Current AIs would typically take on an ancillary role. In the AI-assisted Delphi process we sketched out above, most of the “wisdom” is coming from human participants and the structure of the process. However, this dynamic could change over time as AI capabilities increase and deployers have more data and experience in how to use it.

In a Delphi process, more capable AI could be actively involved in every step:

Human organizers could spend more time in consultation with AIs about how to write and frame the questions, including along axes like the long-term implications of particular decisions. More capable AIs could be instructed to look for unquestioned assumptions or blindspots in human organizers’ thinking. They could even forecast which questions would most likely lead to disagreements between participants, which would be most productive, which would unearth key values questions, and so on. As a result, the process could end up revolving around questions that the organizers initially hadn’t considered at all.

Likewise, participants themselves could be in close feedback loops with their own AI advisors, who could help identify their blindspots, and proactively take actions such as reaching out to other participants’ AIs to seek clarity on a potential misunderstanding. As a result, the process may more closely resemble a series of facilitated dialogues between participants than an inefficient large-group conversation.

Over time, different processes could be developed around AI capabilities. For example, the surveys themselves could be written and explained by AIs fine-tuned on relevant documents. Most of the expert dialogue could take place between AI systems instructed to represent a particular viewpoint, which only occasionally check in with human counterparts for clarification. The output could even be a queryable “living document” — a synthesized AI advisor system designed to reflect lessons from the conversation and from documents and schools of thought recommended by the participants.

Conclusions

In principle, capable AI advisor systems could massively improve society. They could compound the earlier digital revolutions’ effects of making information available and useful. But while the computer and the internet undoubtedly enriched society, they have had their downsides. Applications like social media can harm us in subtle long-term ways by offering what we seem to want in the short term, while ubiquitous internet use makes us vulnerable to cyberattacks and privacy breaches.

Similarly, the high modernist ambitions of the 1800s and 1900s were founded on the hope that top-down analysis and optimization could straightforwardly improve society. Such approaches served well for building bridges and dams, bringing millions out of poverty. But we remember them, too, for their failures when applied to systems that were unexpectedly complex, subtle, or ethically fraught.

Building beneficial AI advisor systems requires taking a lesson from history: knowledge and power must be directed wisely, with thought to long-term consequences and an appreciation for unknown unknowns. Much like previous industrial revolutions, smart AIs could present a new opportunity to reshape the world for both better and worse. We should treat that opportunity with the weight it deserves.

This post, like the rest of Expected Surprise, does not reflect the institutional views of my employer or other affiliated organizations.