<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Expected Surprise]]></title><description><![CDATA[Handholds for getting to grips with an age of changes. Statistics, heuristics, and other ways to be wrong but useful.]]></description><link>https://www.expectedsurprise.com</link><image><url>https://substackcdn.com/image/fetch/$s_!mfZr!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3068f421-0288-450c-b263-0d13546de798_640x640.png</url><title>Expected Surprise</title><link>https://www.expectedsurprise.com</link></image><generator>Substack</generator><lastBuildDate>Tue, 07 Apr 2026 21:00:12 GMT</lastBuildDate><atom:link href="https://www.expectedsurprise.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Ashwin Acharya]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[expectedsurprise@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[expectedsurprise@substack.com]]></itunes:email><itunes:name><![CDATA[Ashwin]]></itunes:name></itunes:owner><itunes:author><![CDATA[Ashwin]]></itunes:author><googleplay:owner><![CDATA[expectedsurprise@substack.com]]></googleplay:owner><googleplay:email><![CDATA[expectedsurprise@substack.com]]></googleplay:email><googleplay:author><![CDATA[Ashwin]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Decision theory doesn’t prove that useful strong AIs will doom us all]]></title><description><![CDATA[Here&#8217;s a common argument in the AI safety space:]]></description><link>https://www.expectedsurprise.com/p/decision-theory-doesnt-prove-that</link><guid isPermaLink="false">https://www.expectedsurprise.com/p/decision-theory-doesnt-prove-that</guid><dc:creator><![CDATA[Ashwin]]></dc:creator><pubDate>Wed, 24 Dec 2025 05:00:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!vrhk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Here&#8217;s a common argument in the AI safety space:<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><p>1) Useful AIs will not be exploitable. Therefore they will be utility maximizers for some VNM utility function.</p><p>2) Utility maximizers are scary, because they want to eat the world. In particular they are</p><ul><li><p><em>Lacking side constraints </em>(no deontological rules. This implies no corrigibility, unless you work very very hard to embed exactly the right notion of &#8220;stopping when humans want to&#8221;.)</p></li><li><p><em>Scope-unbounded </em>(care about the long-term future &amp; work to bring about a world that maximizes U at the expense of everything else.)</p></li><li><p><em>Resource-hungry</em> (marginal returns don&#8217;t diminish, or at least don&#8217;t diminish to zero. A classic argument here is that even if the thing they care about is small &#8212; say, a particular house &#8212; they always prefer to gain more power to increase the probability of ensuring that the house stays safe.)</p></li></ul><p><strong>This argument is wrong.</strong> Agents <em>can </em>maximize a utility function without eating the world.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vrhk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vrhk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!vrhk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!vrhk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!vrhk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vrhk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png" width="1024" height="608" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:608,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vrhk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png 424w, https://substackcdn.com/image/fetch/$s_!vrhk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png 848w, https://substackcdn.com/image/fetch/$s_!vrhk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png 1272w, https://substackcdn.com/image/fetch/$s_!vrhk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8af5ff3a-0a49-4f5e-9ab6-0fc1e53374ba_1024x608.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p></p><h2>There are valid utility functions with nice safety properties.</h2><p><strong>Agents can care about more than the material state of the world.</strong></p><ul><li><p>In Econ 101 style conversations, it&#8217;s convenient to talk about preferences as a simple function of a basket of goods: I value apples at $1 and bananas at $3, or I want to make sandwiches so my utility is min(peanut butter, jelly), or whatever.</p></li><li><p>Many AI safety folks seem to have the intuition that such preferences are both the most natural, or are synonymous with expected-utility maximization.</p></li><li><p>But a utility function doesn&#8217;t have to be defined just over resources the agent owns, and can even rely on things beyond the material state of the world.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><ul><li><p>Formally: in the RL context, we can consider an agent&#8217;s <em>trajectory </em>to be the history of world-states it encountered and actions it made. An agent can have a utility function based on its trajectory, not just the current world-state.</p></li></ul></li></ul><p><strong>Most importantly, having preferences over actions enables useful safety properties:</strong></p><ul><li><p>Let&#8217;s take finance as an example setting, and suppose we have a reference agent A with utility function P = {profit} depending only on the worldstate.</p></li><li><p>In theory, agents with action-based preferences can be as competitive as you like. For example, we can construct an agent with U(action, state) = {1 if action would be chosen by A in this state, 0 otherwise}.</p></li><li><p>We can then include deontologically-flavored preferences over actions, for example U(action, state) = {if A would choose a legal action in this state, 1 to copy A and 0 to do otherwise; if A would choose an illegal action in this state, 1 to make no trades and 0 otherwise}</p></li><li><p>Preferences over the entire <em>history </em>also enable memory-laden features, such as &#8220;I didn&#8217;t defect first in an iterated prisoner&#8217;s dilemma&#8221;.</p></li></ul><p><strong>Agents can care about </strong><em><strong>less </strong></em><strong>than the whole world; </strong>they can be inexploitable within a limited scope, without being hungry to grab further resources.</p><ul><li><p>Consider a chess-playing agent which only considers actions in the space of &#8220;submit a legal move&#8221;. This agent will never e.g. try to hack out of the game or even trash-talk its opponent. In some sense it&#8217;s leaving value on the table, but it can &#8220;go hard&#8221; (a phrase Eliezer and Nate Soares use in their book to point to dangerous levels of agency) within the chess context. In fact this is how current chess agents work in practice, and they&#8217;re extremely superhuman.</p></li><li><p>As far as I understand it, <a href="https://www.lesswrong.com/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1">this LessWrong post</a> argues that agents can be non-Dutch-bookable without having preferences over all possible world-states; instead, they can follow the history-driven rule &#8220;don&#8217;t take an action that, together with a previous action of mine, would mean I had been Dutch Booked&#8221;.</p></li></ul><h2>Should we expect to see these nice utility functions in practice?</h2><p>Okay, says the imaginary Yudkowsky who lives in my head rent-free,<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> I don&#8217;t actually care about literal VNM expected-utility maximization. (It&#8217;s unfortunate that a lot of people downstream of me think that that&#8217;s what&#8217;s important, but that&#8217;s big-name bloggership for you.) The point is about what agents I expect to actually show up in practice: I think agents without preferences over actions, but <em>with </em>preferences over the world-as-a-whole, are a) easier to train, b) more competitive, and c) likely to show up once AIs are the ones doing AI R&amp;D.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a></p><p>Thanks, Eliezer, that&#8217;s a helpful clarification. For now I&#8217;ll call the agents you&#8217;re worried about WorldState Utility Maximizers, or WorldSUMs. (I like to think of them looking at the world and summing up all its utility.)</p><h3>Are WorldSUM agents easier to train?</h3><p><strong>Maybe, but I suspect it&#8217;s not </strong><em><strong>so much </strong></em><strong>easier that training nice agents is out of the question. </strong>Instead we&#8217;re in the realm of practical tradeoffs.</p><ul><li><p>For a lot of applications we&#8217;d want to train AIs for, it&#8217;s much easier to give feedback about the state of the world than about the validity of actions. To stick with the finance example, consider how much slower the feedback loop of the justice system is than the feedback loop of daily profit-and-loss.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a></p></li><li><p>One potential upper bound on the difficulty of training a constrained agent is: train a grasping agent, and then train an imitation learner on &#8220;do what the grasping agent would do, except when that action violates constraints we care about.&#8221;</p><ul><li><p>I don&#8217;t know the SOTA on imitation learning, but I could imagine this not being an awful additional cost. Very naively it&#8217;s 2x (two training runs!) which is both kind of a lot (training runs are expensive) and kind of a little (it&#8217;s the same order of magnitude! &#8220;we need to spend twice as much on our megaproject&#8221; is merely very difficult, not impossible.).</p></li></ul></li><li><p>A cool example of people trying to train agents on nonstandard EU maximization is <a href="https://arxiv.org/abs/2501.13011">MONA</a>; they train an RL agent not to maximize expected future reward (as is standard), but instead just current-stage reward plus some overseer feedback on their actions. This is intended to avoid the problem of strategic reward hacking (taking multiple steps to screw with the reward signal) while still giving long-term feedback. They note that the overseer feedback can take various forms, including automated LLM feedback about whether the action agrees with the trained AI&#8217;s constitution.</p><ul><li><p>This is pretty similar in structure to &#8220;imitate a successful agent, but have side constraints&#8221;.</p></li><li><p>If we can rely on LLMs to evaluate whether actions are acceptable, that&#8217;s an amazing boon.</p></li></ul></li><li><p>For really advanced AI agents, it might be really hard to evaluate whether they&#8217;re obeying the action preferences we care about during training and give appropriate feedback &#8212; those preferences might be things like &#8220;don&#8217;t fool us&#8221;, but maybe it&#8217;s hard to tell when they&#8217;re being deceptive (and training directly on the easiest signals, like the chain of thought, could incentivize more subtle deception).</p></li></ul><h3>Are worldstate utility maximizers more competitive?</h3><p><strong>Are people more likely to want to train and deploy these agents?</strong></p><ul><li><p>Yes, some people will definitely do this, just like people are deploying AIs with instructions to be evil or make Fartcoins or whatever.</p></li><li><p>But the leading AI labs won&#8217;t want this, as long as they care about prosaic applications and think they&#8217;re in an iterated game with the rest of society.</p><ul><li><p>Constrained agents are actively super useful for practical applications, while .</p></li></ul></li><li><p>The danger is that companies might decide they&#8217;re just in a race for superintelligence, or that it&#8217;s not worth paying the safety tax because their systems look safe according to their tests, or whatever. I think that&#8217;s a real danger, but it feels totally disconnected from the original argument from Dutch books.</p></li></ul><p><strong>Do worldstate utility maximizers win once they are deployed?</strong></p><p>The answer here seems like an obvious &#8220;yes&#8221; if you&#8217;re Yudkowsky, imagining a world where the capability gap between the latest AI systems and humanity is really high, and the ability of society to oversee AI developers and deployers is weak. Otherwise, though, I think it&#8217;s not so obvious. A lot seems to hinge on how defensively stable the world is, and how good oversight is of AIs and AI companies.</p><h3>Does automated AI R&amp;D naturally lead to worldstate utility-maximizers?</h3><p><strong>Will recursive improvement naturally remove constraints on behavior or limitations on preference-space?</strong></p><p>I&#8217;m not sure I can accurately represent this worry &#8212; I think it&#8217;s something Eliezer and Nate and others really worry about, but I don&#8217;t really see it? I think the fear stems in part from an assumption that all the constraints we put on powerful AIs will either be incoherent, have loopholes, or be clearly &#8220;unnatural&#8221; / &#8220;not part of the AI&#8217;s True Utility Function&#8221;. Incoherent constraints might not do anything, loopholes will be found, and the AIs will prune away constraints that they realize aren&#8217;t part of their true utility function.</p><p>This seems related to Eliezer&#8217;s claim that true morality is act-utilitarianism, but no human can actually follow this so we need deontological side constraints. Equivalently: true optimal behavior is EU maximizing the state of the world, and EU functions over bounded parts of the world are too narrow while EU components based on process are, I guess, too artificial?</p><p>I can see the intuition, especially in the first case, but I can imagine entities with either types of preference that are just stable under reflection.</p><p>An intuition pump: when I introspect about the preferences I have over process, it feels like there&#8217;s a difference between constraints I chafe at (ugh, why do I need to get permission from this person, this would be so much <em>easier</em>) and constraints that feel like part of me (I want to be nice to my partner). What would affect whether self-aware AIs would consider their trained action-preferences as ego-dystonic rather than ego-syntonic?</p><p><strong>A related worry: automated AI R&amp;D could lead to a new paradigm that makes value training harder.</strong></p><p>Perhaps <em>the </em>major debate in AI today is whether the current LLM paradigm will smoothly scale to ~arbitrary levels of intelligence (look how fast they&#8217;re getting good at things!), or peter out (look how expensive they are while still being bad at lots of stuff! We don&#8217;t have many OOMs of scaling left! They&#8217;re much less power- and data-efficient than the brain!).<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a></p><p>One synthesis of these arguments; LLMs could rapidly scale to great performance at AI R&amp;D, accelerating the top human supergeniuses in developing a new paradigm. This could be bad news for safety, in that 1) LLMs seem like an unusually nice paradigm, and 2) the compute overhang from unlocking a new, more efficient paradigm could mean that we see a big capabilities jump in a short period of time, and 3) this could push us close to really fast AI iteration and scary capabilities, shifting companies from prioritizing prosaic utility and public trust to deploying ASAP at all costs.</p><h2>Concluding thoughts</h2><p><strong>Strategic takes:</strong></p><ul><li><p><strong>It seems to me like the &#8220;expected utility maximizer&#8221; argument doesn&#8217;t have much force, and it&#8217;s often used to stand in for other, realer concerns. </strong>I think it&#8217;s better for people to directly argue about what&#8217;s easier to build, more preferable / competitive to deploy, and maybe what&#8217;s stable under automated AI R&amp;D.</p></li><li><p><strong>A lot of those realer concerns are downstream of rapid AI progress and poor AI oversight.</strong></p><ul><li><p>As such, <a href="https://www.iaps.ai/research/managing-risks-from-internal-ai-systems">internal AI deployments</a> for automated R&amp;D are particularly high-risk.</p></li><li><p>Good societal oversight of AI seems at least necessary, and maybe a really good version of it is basically sufficient (by incentivizing the right safety investments). Will we get there? <a href="https://www.lesswrong.com/posts/YPLmHhNtjJ6ybFHXT/little-echo">Mu.</a></p></li></ul></li><li><p><strong>Solving the real concerns seems feasible</strong>. It feels like we&#8217;re not philosophically barred from a solution, we&#8217;re just haggling over the price&#8230;but it&#8217;s super unclear what the right price is.</p><ul><li><p>Plus side: it&#8217;s not a question of &#8220;math proof says we&#8217;re doomed&#8221;, it&#8217;s much more like &#8220;we maybe need to spend an extra trillion dollars and three years on this megaproject&#8221;. Merely extremely difficult.</p></li><li><p>Minus side: maybe it&#8217;s ten trillion dollars and ten years.</p></li></ul></li></ul><p><strong>Technical takes:</strong><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a></p><ul><li><p><strong>I&#8217;m excited about the idea of training agents to have preferences over actions. </strong>It seems like a pretty natural thing to try, and stuff like MONA shows that simple versions of it can kinda work. Apparently there have been small contingents of <a href="https://www.lesswrong.com/posts/7xneDbsgj6yJDJMjK/chain-of-thought-monitorability-a-new-and-fragile?commentId=xEe2DjnWhvQZHBTSy">&#8220;process-based RL&#8221;</a> folks at the labs&#8217; AI safety teams for a few years now.</p></li></ul><ul><li><p><strong>I&#8217;m confused how we&#8217;d know if we&#8217;d succeeded at training action preferences that are stable under reflection, capability-enhancement, etc. </strong>&#8212; if the AI intrinsically &#8220;cared&#8221; about satisfying the constraint. Behavioral tests are good enough for current LLMs, but clearly not enough for really advanced AIs. Ideally we&#8217;d have great interpretability tools for this.</p></li><li><p>I feel more confused about whether it&#8217;s feasible or useful to deliberately build powerful agents with preferences about only part of the world.</p><ul><li><p>One natural version of this is to have a carefully overseen &#8220;AI builder&#8221;, sort of an ant queen, that builds specialized systems. That could include developing lots of systems that are very clearly not WorldSUM agents, like regular old computer code.</p></li><li><p>It doesn&#8217;t seem impossible that a mix of technical factors &amp; societal lessons-learned lead to a <a href="https://forum.effectivealtruism.org/topics/comprehensive-ai-services">Drexlerian </a>world of specialized AIs with preferences only over the action-space they&#8217;re built to operate on.</p></li></ul></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.expectedsurprise.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Expected Surprise! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>Some sources on this topic:</h3><ul><li><p>Rohin Shah, 2017: <a href="https://www.lesswrong.com/posts/NxF5G6CJiof6cemTw/coherence-arguments-do-not-entail-goal-directed-behavior">Coherence arguments do not entail goal-directed behavior</a></p></li><li><p>Elliott Thornley, Feb 2023: <a href="https://www.lesswrong.com/posts/yCuzmCsE86BTu9PfA/there-are-no-coherence-theorems#A_money_pump_for_Completeness">There are no coherence theorems</a></p></li><li><p><a href="https://forum.effectivealtruism.org/posts/ZS9GDsBtWJMDEyFXh/eliezer-yudkowsky-is-frequently-confidently-egregiously?commentId=CCgyJ9kLAxLS8Ea4Q">Anonymous&#8217;s comment thread</a> from Aug 2023, especially Keith Wynroe&#8217;s comment.</p></li><li><p><a href="https://forum.effectivealtruism.org/posts/rxchpkmhBbhDFebYm/owen-cotton-barratt-s-quick-takes?commentId=M8AWQhKKSmKsKfQkG">Owen Cotton-Barratt&#8217;s thread</a> from 2024.</p></li><li><p><a href="https://www.lesswrong.com/posts/GPuXM3ufXfmaktYXZ/what-do-coherence-arguments-actually-prove-about-agentic">Anonymous&#8217;s LW question</a> from June 2024</p></li></ul><p>Of these, I think Rohin&#8217;s post and Keith&#8217;s and Owen&#8217;s comments most get at what I&#8217;m interested in here.</p><div><hr></div><p>Thanks to Will MacAskill and Owen-Cotton Barratt for comments on a draft of this, and to my friends for encouraging me to finally publish it. </p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>I actually don&#8217;t know how common this is these days &#8212; it used to be very common ~10 years ago, when I first questioned it and got a pretty chilly reception. I expect a lot of people who are AI-safety-adjacent but not experts themselves still have some version of it cached in their heads. And probably even some experts do, especially in areas like interpretability where engagement with this argument isn&#8217;t particularly relevant for your day to day. I am very grateful to folks like Rohin Shah for fighting the good fight on this topic over the years.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>This is trivially true, even &#8212; see Rohin Shah&#8217;s post about <a href="https://www.lesswrong.com/s/4dHMdK5TLN6xcqtyc/p/NxF5G6CJiof6cemTw">coherence vs goal-directedness</a>. A robot that always twitches can be interpreted as utility-maximizing! The problem with that example is that the twitch-bot doesn&#8217;t do anything useful.</p><p></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>I also have beef with the intuition about utility being a simple function, by the way, like #paperclips &#8212;I think there&#8217;s an assumption that complexities either get stripped away by the training process, or don&#8217;t matter because they don&#8217;t outweigh the most important parts of the utility function. But I&#8217;m not sure either of those is right &#8212; certainly they&#8217;re not good descriptions of present-day LLMs. But if we&#8217;re assuming utility functions can/will be complex, it feels a lot less obvious to me that we&#8217;re 100% doomed by default. There&#8217;s some discussion of this <a href="https://www.lesswrong.com/posts/87EzRDAHkQJptLthE/but-why-would-the-ai-kill-us?commentId=sEzzJ8bjCQ7aKLSJo">here</a>; folks like Paul Christiano and Carl Shulman think humanity won&#8217;t literally all be killed, and it&#8217;s irresponsible to say so when in fact it&#8217;s pretty likely we&#8217;re kept alive (we&#8217;re likely to be of some interest to LLM-style agents) and perhaps even in a pretty good state (this seems like a crux to me - not obvious), but not in control of the cosmic endowment.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>OK, fine, maybe he pays <a href="https://www.lesswrong.com/w/making-beliefs-pay-rent">some rent</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>This is my attempt at passing the intellectual Turing Test here, let me know if you hold a version of this view that reads differently.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>&#8220;Daily profit&#8221; as the target is a simplification, since many trades are meant to be profitable in the longer term, and e.g. there are nuances about what price you mark profits to that allow for reward hacking if you aren&#8217;t thoughtful.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Actually I&#8217;m not sure how the data efficiency compares, please @ me with your wild Fermi estimates.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Take these takes with a grain of salt; I am very much not an ML researcher. I just read their papers sometimes.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Wise AI support for government decision-making]]></title><description><![CDATA[It looks like you're writing a treaty. Can I help you with that?]]></description><link>https://www.expectedsurprise.com/p/wise-ai-support-for-government-decision</link><guid isPermaLink="false">https://www.expectedsurprise.com/p/wise-ai-support-for-government-decision</guid><dc:creator><![CDATA[Ashwin]]></dc:creator><pubDate>Mon, 14 Oct 2024 06:31:44 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d88aa5c6-3170-4af0-9b27-ba8aa134b87b_2859x1952.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Michaelah Gertz-Billingsley and I wrote this quickly as a submission to AI Impacts&#8217; <a href="https://blog.aiimpacts.org/p/essay-competition-on-the-automation">essay competition</a> on automating wisdom and philosophy. Future posts here will be more conversational!  </em></p><div><hr></div><p>Governments have long been early adopters, or even creators, of sense-making tools &#8212; from the census to the computer to the internet. Today, they use complex statistical or mathematical models to make decisions about commensurately complex issues: healthcare, finance, infrastructure, military procurement. Present-day AI systems are growing increasingly helpful for processing information and writing code, and their creators aspire to build powerful assistants with general intelligence.&nbsp;</p><p>Atypically, governments are behind the curve on adopting advanced AI, in part due to concerns about its reliability. Suppose that developers gradually solve the issues that make AIs poor advisors &#8212; issues like confabulation, systemic bias, misrepresentation of their own thought processes, and limited general reasoning skills. What would make <em>smart </em>AI systems a good or poor fit as government advisors?<em>&nbsp;</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yAFz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31d94359-a9e2-4d12-87d7-ebb0c3d071d2_3310x3310.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yAFz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31d94359-a9e2-4d12-87d7-ebb0c3d071d2_3310x3310.jpeg 424w, https://substackcdn.com/image/fetch/$s_!yAFz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31d94359-a9e2-4d12-87d7-ebb0c3d071d2_3310x3310.jpeg 848w, https://substackcdn.com/image/fetch/$s_!yAFz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31d94359-a9e2-4d12-87d7-ebb0c3d071d2_3310x3310.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!yAFz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31d94359-a9e2-4d12-87d7-ebb0c3d071d2_3310x3310.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yAFz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31d94359-a9e2-4d12-87d7-ebb0c3d071d2_3310x3310.jpeg" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/31d94359-a9e2-4d12-87d7-ebb0c3d071d2_3310x3310.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:10133426,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yAFz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31d94359-a9e2-4d12-87d7-ebb0c3d071d2_3310x3310.jpeg 424w, https://substackcdn.com/image/fetch/$s_!yAFz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31d94359-a9e2-4d12-87d7-ebb0c3d071d2_3310x3310.jpeg 848w, https://substackcdn.com/image/fetch/$s_!yAFz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31d94359-a9e2-4d12-87d7-ebb0c3d071d2_3310x3310.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!yAFz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31d94359-a9e2-4d12-87d7-ebb0c3d071d2_3310x3310.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The Oracle at Delphi, an early decision-support service for policymakers. Unfortunately, &#8220;a great empire will fall if you go to war&#8221; is not especially wisdom-serving advice. </figcaption></figure></div><p>Much like smart humans, smart AIs might do a great job at answering the wrong questions. They might focus on legible or short-term benefits over more important factors that are harder to analyze, or take their users&#8217; implicit assumptions for granted.&nbsp;</p><p><strong>We recommend the development of </strong><em><strong>wise </strong></em><strong>decision support systems that incorporate the following features:&nbsp;</strong></p><ol><li><p><strong>Helping people figure out their principles</strong><em><strong> </strong></em><strong>and make decisions based on them.</strong></p></li><li><p><strong>Helping people extend, explore, and structure their options </strong>under consideration (not just helping them choose from within the limited option space they initially consider).</p></li><li><p><strong>Identifying and questioning implicit assumptions; </strong>exploring different frames in which to consider the problem.</p></li><li><p><strong>Far-sightedness &#8212; incorporating longer-term and higher-order consequences of a decision </strong>(e.g. effects on relationships and long-term collaboration; precedent-setting; long-term budgeting and risk).</p></li><li><p><strong>Aggregating views and preferences across disparate groups and stakeholders.</strong></p></li></ol><h2>How can we ensure government access to wise AI decision support?</h2><p><strong>1. By default, the private interests leading AI today may focus on smart AI decision support over wisdom</strong>, as it is easier to train and test tools for short-term decisionmaking.&nbsp;</p><p>Many of their private customers may (rightly or wrongly) see short-term benefits as sufficient. This might be based on the incentive to seek short-term profits at the expense of the long term. But companies might also often have specific decisions that are not that subtle and don&#8217;t particularly need wisdom &#8212; e.g. &#8220;how much of X widget should we make this year&#8221;. Unlike governments, they may be able to function well entirely by making such decisions well. By contrast, we would argue that functional governments inherently need to consider some subtle, complex, long-term issues.&nbsp;</p><p><strong>Governments could encourage the development of wise AI, much as they have done for technologies like vaccines and autonomous vehicles.&nbsp;</strong></p><ul><li><p>Organizations like DARPA could fund prizes for more testable aspects of wisdom.</p></li><li><p>New mechanisms such as <a href="https://en.wikipedia.org/wiki/Advance_market_commitment">advance market commitments</a> could provide market incentives for forward-looking product development.</p></li></ul><p><strong>2. Key decision makers may not know to ask for features that will lead to the wisest decisions. </strong>Government institutions have typically been slow to adopt AI, and impactful policymakers will rarely be AI experts. Policies are often made on the basis of empirical evidence and expert views, which may not favor wisdom-serving systems if they are a new technology with relatively illegible benefits.</p><p><strong>We propose that government-serving think tanks and consultancies start developing AI for wisdom-serving decision processes now. </strong>Such efforts might start small with today&#8217;s limited models, but could scale up in automation and sophistication as AI decision support capabilities improve.<strong> </strong>A mature field of wise, AI-powered decision support would let governments benefit from these tools without needing to judge them directly. Instead, they could rely on the judgment and track record of organizations they already have relationships with. Multiple organizations building these tools could compete and learn from each other, ensuring that wise decision support does not lag too far behind its merely smart alternatives.</p><h2>Adopting AI for wise decision support: a playbook</h2><ol><li><p><strong>Start by using AI tools to augment traditional decision-making and analysis processes</strong>. This allows the organization to a) make informed decisions from the start b) not rely on AI before it&#8217;s ready to automate everything, c) generate training data to improve their AIs, d) get experience with AI&#8217;s strengths and weaknesses.</p></li><li><p><strong>Scale up the use of AI over time &#8212; relying on it for more tasks as it becomes more capable.&nbsp;</strong></p></li><li><p><strong>Clarify the value-add provided by wise systems over standard AI products. </strong>Identify and publicize incidents where AI systems that went awry could have gone well by incorporating &#8220;wisdom-serving&#8221; design principles</p></li></ol><p><strong>Concrete example: using near-future LLMs to support a Delphi process</strong></p><p>The <a href="https://en.wikipedia.org/wiki/Delphi_method">Delphi method</a> involves the following steps: session runners want to understand a topic better &#8212; often a forecast of the future or a recommendation for policymakers. They come up with a set of quantitative questions, ask a field of experts for responses. The experts share their qualitative views as well; the facilitators guide discussion or transmit information, and the participants revise their estimates. (The process can be repeated if disagreements remain, or the organizers want further clarity on key questions.)</p><p>In our (limited) experience, there are many frictions to this process, particularly in generating clear language and context for the questions, understanding where experts&#8217; disagreements lie, and making efficient use of a group of experts&#8217; time while keeping all of them updated on what the others believe. Experts often often have valuable qualitative points to share that reframe the analysis, but which are more difficult to identify and convey than quantitative beliefs.</p><p>Roughly current-level AIs could already help with this process at many points:</p><ul><li><p>Helping organizers come up with great questions &amp; phrase them well</p></li><li><p>Expressing organizers&#8217; views. Organizers could provide an LLM with access to a long document describing their background thinking in detail. Participants could query it and get on the same page without slowing down the process. This could be faster and better than providing the same background reading document for everyone, since participants will have different questions or concerns about the framing of the exercise. (Discussing such framing concerns often takes up a large portion of the allotted time.)</p></li><li><p>Transcribing the discussion and summarizing it, while identifying key points of disagreement.</p></li><li><p>Helping organizers rewrite questions on the fly in response to input.</p></li><li><p>Allowing organizers to ask qualitative questions, since LLMs can auto-summarize responses and aggregate them &#8212; e.g. &#8220;17 out of 24 respondents mentioned that they thought this level of AI capability was infeasible, so they found it difficult to concretely imagine what its impacts would be.&#8221;</p></li></ul><h3>How can wise decision-support processes scale up over time to take advantage of accumulated data &amp; increasing AI capabilities?&nbsp;</h3><h4>Formal AI training</h4><p><strong>What kinds of data are available?</strong></p><ul><li><p>For many forms of wisdom, user feedback can provide useful information about the model&#8217;s quality (and can provide small amounts of fine-tuning training data). For example: users may recognize better option sets or useful decision-boundary diagrams when they see them; they may understand and appreciate advice processes that put them in touch with their values, etc.&nbsp;</p></li><li><p>This may break down for far-sightedness &#8212; people may not truly appreciate far-sighted decision support until the long term.</p></li><li><p>OTOH: maybe people would notice &#8212; &#8220;yes this serves my long-term interests, good job identifying relevant factors ABC&#8221;</p></li></ul><p><strong>The amount of data generated is relatively sparse, so conventional fine-tuning likely wouldn&#8217;t work.</strong></p><ul><li><p>But: AIs are already okay at multi-shot learning (learning from information provided to them in a prompt).&nbsp;</p></li><li><p>So sufficiently powerful AIs might be able to learn useful patterns from summaries or transcripts of earlier decision-support processes, if humans flagged what helped them reach a better decision along various dimensions (far-sightedness, etc.).</p></li></ul><h4>The evolving role of AI</h4><p>Current AIs would typically take on an ancillary role. In the AI-assisted Delphi process we sketched out above, most of the &#8220;wisdom&#8221; is coming from human participants and the structure of the process. However, this dynamic could change over time as AI capabilities increase and deployers have more data and experience in how to use it.</p><p>In a Delphi process, more capable AI could be actively involved in every step:</p><p>Human organizers could spend more time in consultation with AIs about how to write and frame the questions, including along axes like the long-term implications of particular decisions. More capable AIs could be instructed to look for unquestioned assumptions or blindspots in human organizers&#8217; thinking. They could even forecast which questions would most likely lead to disagreements between participants, which would be most productive, which would unearth key values questions, and so on. As a result, the process could end up revolving around questions that the organizers initially hadn&#8217;t considered at all.</p><p>Likewise, participants themselves could be in close feedback loops with their own AI advisors, who could help identify their blindspots, and proactively take actions such as reaching out to other participants&#8217; AIs to seek clarity on a potential misunderstanding. As a result, the process may more closely resemble a series of facilitated dialogues between participants than an inefficient large-group conversation.</p><p>Over time, different processes could be developed around AI capabilities. For example, the surveys themselves could be written and explained by AIs fine-tuned on relevant documents. Most of the expert dialogue could take place between AI systems instructed to represent a particular viewpoint, which only occasionally check in with human counterparts for clarification. The output could even be a queryable &#8220;living document&#8221; &#8212; a synthesized AI advisor system designed to reflect lessons from the conversation and from documents and schools of thought recommended by the participants.&nbsp;</p><h2>Conclusions</h2><p>In principle, capable AI advisor systems could massively improve society. They could compound the earlier digital revolutions&#8217; effects of making information available and useful. But while the computer and the internet undoubtedly enriched society, they have had their downsides. Applications like social media can harm us in subtle long-term ways by offering what we seem to want in the short term, while ubiquitous internet use makes us vulnerable to cyberattacks and privacy breaches.&nbsp;</p><p>Similarly, the high modernist ambitions of the 1800s and 1900s were founded on the hope that top-down analysis and optimization could straightforwardly improve society. Such approaches served well for building bridges and dams, bringing millions out of poverty. But we remember them, too, for their failures when applied to systems that were unexpectedly complex, subtle, or ethically fraught.&nbsp;&nbsp;</p><p>Building beneficial AI advisor systems requires taking a lesson from history: knowledge and power must be directed wisely, with thought to long-term consequences and an appreciation for unknown unknowns. Much like previous industrial revolutions, smart AIs could present a new opportunity to reshape the world for both better and worse. We should treat that opportunity with the weight it deserves.</p><div><hr></div><p><em>This post, like the rest of Expected Surprise, does not reflect the institutional views of my employer or other affiliated organizations.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.expectedsurprise.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Expected Surprise! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>