Decision theory doesn’t prove that useful strong AIs will doom us all
Here’s a common argument in the AI safety space:1
1) Useful AIs will not be exploitable. Therefore they will be utility maximizers for some VNM utility function.
2) Utility maximizers are scary, because they want to eat the world. In particular they are
Lacking side constraints (no deontological rules. This implies no corrigibility, unless you work very very hard to embed exactly the right notion of “stopping when humans want to”.)
Scope-unbounded (care about the long-term future & work to bring about a world that maximizes U at the expense of everything else.)
Resource-hungry (marginal returns don’t diminish, or at least don’t diminish to zero. A classic argument here is that even if the thing they care about is small — say, a particular house — they always prefer to gain more power to increase the probability of ensuring that the house stays safe.)
This argument is wrong. Agents can maximize a utility function without eating the world.2
There are valid utility functions with nice safety properties.
Agents can care about more than the material state of the world.
In Econ 101 style conversations, it’s convenient to talk about preferences as a simple function of a basket of goods: I value apples at $1 and bananas at $3, or I want to make sandwiches so my utility is min(peanut butter, jelly), or whatever.
Many AI safety folks seem to have the intuition that such preferences are both the most natural, or are synonymous with expected-utility maximization.
But a utility function doesn’t have to be defined just over resources the agent owns, and can even rely on things beyond the material state of the world.3
Formally: in the RL context, we can consider an agent’s trajectory to be the history of world-states it encountered and actions it made. An agent can have a utility function based on its trajectory, not just the current world-state.
Most importantly, having preferences over actions enables useful safety properties:
Let’s take finance as an example setting, and suppose we have a reference agent A with utility function P = {profit} depending only on the worldstate.
In theory, agents with action-based preferences can be as competitive as you like. For example, we can construct an agent with U(action, state) = {1 if action would be chosen by A in this state, 0 otherwise}.
We can then include deontologically-flavored preferences over actions, for example U(action, state) = {if A would choose a legal action in this state, 1 to copy A and 0 to do otherwise; if A would choose an illegal action in this state, 1 to make no trades and 0 otherwise}
Preferences over the entire history also enable memory-laden features, such as “I didn’t defect first in an iterated prisoner’s dilemma”.
Agents can care about less than the whole world; they can be inexploitable within a limited scope, without being hungry to grab further resources.
Consider a chess-playing agent which only considers actions in the space of “submit a legal move”. This agent will never e.g. try to hack out of the game or even trash-talk its opponent. In some sense it’s leaving value on the table, but it can “go hard” (a phrase Eliezer and Nate Soares use in their book to point to dangerous levels of agency) within the chess context. In fact this is how current chess agents work in practice, and they’re extremely superhuman.
As far as I understand it, this LessWrong post argues that agents can be non-Dutch-bookable without having preferences over all possible world-states; instead, they can follow the history-driven rule “don’t take an action that, together with a previous action of mine, would mean I had been Dutch Booked”.
Should we expect to see these nice utility functions in practice?
Okay, says the imaginary Yudkowsky who lives in my head rent-free,4 I don’t actually care about literal VNM expected-utility maximization. (It’s unfortunate that a lot of people downstream of me think that that’s what’s important, but that’s big-name bloggership for you.) The point is about what agents I expect to actually show up in practice: I think agents without preferences over actions, but with preferences over the world-as-a-whole, are a) easier to train, b) more competitive, and c) likely to show up once AIs are the ones doing AI R&D.5
Thanks, Eliezer, that’s a helpful clarification. For now I’ll call the agents you’re worried about WorldState Utility Maximizers, or WorldSUMs. (I like to think of them looking at the world and summing up all its utility.)
Are WorldSUM agents easier to train?
Maybe, but I suspect it’s not so much easier that training nice agents is out of the question. Instead we’re in the realm of practical tradeoffs.
For a lot of applications we’d want to train AIs for, it’s much easier to give feedback about the state of the world than about the validity of actions. To stick with the finance example, consider how much slower the feedback loop of the justice system is than the feedback loop of daily profit-and-loss.6
One potential upper bound on the difficulty of training a constrained agent is: train a grasping agent, and then train an imitation learner on “do what the grasping agent would do, except when that action violates constraints we care about.”
I don’t know the SOTA on imitation learning, but I could imagine this not being an awful additional cost. Very naively it’s 2x (two training runs!) which is both kind of a lot (training runs are expensive) and kind of a little (it’s the same order of magnitude! “we need to spend twice as much on our megaproject” is merely very difficult, not impossible.).
A cool example of people trying to train agents on nonstandard EU maximization is MONA; they train an RL agent not to maximize expected future reward (as is standard), but instead just current-stage reward plus some overseer feedback on their actions. This is intended to avoid the problem of strategic reward hacking (taking multiple steps to screw with the reward signal) while still giving long-term feedback. They note that the overseer feedback can take various forms, including automated LLM feedback about whether the action agrees with the trained AI’s constitution.
This is pretty similar in structure to “imitate a successful agent, but have side constraints”.
If we can rely on LLMs to evaluate whether actions are acceptable, that’s an amazing boon.
For really advanced AI agents, it might be really hard to evaluate whether they’re obeying the action preferences we care about during training and give appropriate feedback — those preferences might be things like “don’t fool us”, but maybe it’s hard to tell when they’re being deceptive (and training directly on the easiest signals, like the chain of thought, could incentivize more subtle deception).
Are worldstate utility maximizers more competitive?
Are people more likely to want to train and deploy these agents?
Yes, some people will definitely do this, just like people are deploying AIs with instructions to be evil or make Fartcoins or whatever.
But the leading AI labs won’t want this, as long as they care about prosaic applications and think they’re in an iterated game with the rest of society.
Constrained agents are actively super useful for practical applications, while .
The danger is that companies might decide they’re just in a race for superintelligence, or that it’s not worth paying the safety tax because their systems look safe according to their tests, or whatever. I think that’s a real danger, but it feels totally disconnected from the original argument from Dutch books.
Do worldstate utility maximizers win once they are deployed?
The answer here seems like an obvious “yes” if you’re Yudkowsky, imagining a world where the capability gap between the latest AI systems and humanity is really high, and the ability of society to oversee AI developers and deployers is weak. Otherwise, though, I think it’s not so obvious. A lot seems to hinge on how defensively stable the world is, and how good oversight is of AIs and AI companies.
Does automated AI R&D naturally lead to worldstate utility-maximizers?
Will recursive improvement naturally remove constraints on behavior or limitations on preference-space?
I’m not sure I can accurately represent this worry — I think it’s something Eliezer and Nate and others really worry about, but I don’t really see it? I think the fear stems in part from an assumption that all the constraints we put on powerful AIs will either be incoherent, have loopholes, or be clearly “unnatural” / “not part of the AI’s True Utility Function”. Incoherent constraints might not do anything, loopholes will be found, and the AIs will prune away constraints that they realize aren’t part of their true utility function.
This seems related to Eliezer’s claim that true morality is act-utilitarianism, but no human can actually follow this so we need deontological side constraints. Equivalently: true optimal behavior is EU maximizing the state of the world, and EU functions over bounded parts of the world are too narrow while EU components based on process are, I guess, too artificial?
I can see the intuition, especially in the first case, but I can imagine entities with either types of preference that are just stable under reflection.
An intuition pump: when I introspect about the preferences I have over process, it feels like there’s a difference between constraints I chafe at (ugh, why do I need to get permission from this person, this would be so much easier) and constraints that feel like part of me (I want to be nice to my partner). What would affect whether self-aware AIs would consider their trained action-preferences as ego-dystonic rather than ego-syntonic?
A related worry: automated AI R&D could lead to a new paradigm that makes value training harder.
Perhaps the major debate in AI today is whether the current LLM paradigm will smoothly scale to ~arbitrary levels of intelligence (look how fast they’re getting good at things!), or peter out (look how expensive they are while still being bad at lots of stuff! We don’t have many OOMs of scaling left! They’re much less power- and data-efficient than the brain!).7
One synthesis of these arguments; LLMs could rapidly scale to great performance at AI R&D, accelerating the top human supergeniuses in developing a new paradigm. This could be bad news for safety, in that 1) LLMs seem like an unusually nice paradigm, and 2) the compute overhang from unlocking a new, more efficient paradigm could mean that we see a big capabilities jump in a short period of time, and 3) this could push us close to really fast AI iteration and scary capabilities, shifting companies from prioritizing prosaic utility and public trust to deploying ASAP at all costs.
Concluding thoughts
Strategic takes:
It seems to me like the “expected utility maximizer” argument doesn’t have much force, and it’s often used to stand in for other, realer concerns. I think it’s better for people to directly argue about what’s easier to build, more preferable / competitive to deploy, and maybe what’s stable under automated AI R&D.
A lot of those realer concerns are downstream of rapid AI progress and poor AI oversight.
As such, internal AI deployments for automated R&D are particularly high-risk.
Good societal oversight of AI seems at least necessary, and maybe a really good version of it is basically sufficient (by incentivizing the right safety investments). Will we get there? Mu.
Solving the real concerns seems feasible. It feels like we’re not philosophically barred from a solution, we’re just haggling over the price…but it’s super unclear what the right price is.
Plus side: it’s not a question of “math proof says we’re doomed”, it’s much more like “we maybe need to spend an extra trillion dollars and three years on this megaproject”. Merely extremely difficult.
Minus side: maybe it’s ten trillion dollars and ten years.
Technical takes:8
I’m excited about the idea of training agents to have preferences over actions. It seems like a pretty natural thing to try, and stuff like MONA shows that simple versions of it can kinda work. Apparently there have been small contingents of “process-based RL” folks at the labs’ AI safety teams for a few years now.
I’m confused how we’d know if we’d succeeded at training action preferences that are stable under reflection, capability-enhancement, etc. — if the AI intrinsically “cared” about satisfying the constraint. Behavioral tests are good enough for current LLMs, but clearly not enough for really advanced AIs. Ideally we’d have great interpretability tools for this.
I feel more confused about whether it’s feasible or useful to deliberately build powerful agents with preferences about only part of the world.
One natural version of this is to have a carefully overseen “AI builder”, sort of an ant queen, that builds specialized systems. That could include developing lots of systems that are very clearly not WorldSUM agents, like regular old computer code.
It doesn’t seem impossible that a mix of technical factors & societal lessons-learned lead to a Drexlerian world of specialized AIs with preferences only over the action-space they’re built to operate on.
Some sources on this topic:
Rohin Shah, 2017: Coherence arguments do not entail goal-directed behavior
Elliott Thornley, Feb 2023: There are no coherence theorems
Anonymous’s comment thread from Aug 2023, especially Keith Wynroe’s comment.
Owen Cotton-Barratt’s thread from 2024.
Anonymous’s LW question from June 2024
Of these, I think Rohin’s post and Keith’s and Owen’s comments most get at what I’m interested in here.
Thanks to Will MacAskill and Owen-Cotton Barratt for comments on a draft of this, and to my friends for encouraging me to finally publish it.
I actually don’t know how common this is these days — it used to be very common ~10 years ago, when I first questioned it and got a pretty chilly reception. I expect a lot of people who are AI-safety-adjacent but not experts themselves still have some version of it cached in their heads. And probably even some experts do, especially in areas like interpretability where engagement with this argument isn’t particularly relevant for your day to day. I am very grateful to folks like Rohin Shah for fighting the good fight on this topic over the years.
This is trivially true, even — see Rohin Shah’s post about coherence vs goal-directedness. A robot that always twitches can be interpreted as utility-maximizing! The problem with that example is that the twitch-bot doesn’t do anything useful.
I also have beef with the intuition about utility being a simple function, by the way, like #paperclips —I think there’s an assumption that complexities either get stripped away by the training process, or don’t matter because they don’t outweigh the most important parts of the utility function. But I’m not sure either of those is right — certainly they’re not good descriptions of present-day LLMs. But if we’re assuming utility functions can/will be complex, it feels a lot less obvious to me that we’re 100% doomed by default. There’s some discussion of this here; folks like Paul Christiano and Carl Shulman think humanity won’t literally all be killed, and it’s irresponsible to say so when in fact it’s pretty likely we’re kept alive (we’re likely to be of some interest to LLM-style agents) and perhaps even in a pretty good state (this seems like a crux to me - not obvious), but not in control of the cosmic endowment.
This is my attempt at passing the intellectual Turing Test here, let me know if you hold a version of this view that reads differently.
“Daily profit” as the target is a simplification, since many trades are meant to be profitable in the longer term, and e.g. there are nuances about what price you mark profits to that allow for reward hacking if you aren’t thoughtful.
Actually I’m not sure how the data efficiency compares, please @ me with your wild Fermi estimates.
Take these takes with a grain of salt; I am very much not an ML researcher. I just read their papers sometimes.

