So Far: Unfriendly AI Edition
Bryan Caplan issued the following challenge, naming Unfriendly AI as one among several disaster scenarios he thinks is unlikely: “If you’re selectively morbid, though, I’d like to know why the nightmares that keep you up at night are so much more compelling than the nightmares that put you to sleep.”
Well, in the case of Unfriendly AI, I’d ask which of the following statements Bryan Caplan denies:
1. Orthogonality thesis — intelligence can be directed toward any compact goal; consequentialist means-end reasoning can be deployed to find means corresponding to a free choice of end; AIs are not automatically nice; moral internalism is false.
2. Instrumental convergence — an AI doesn’t need to specifically hate you to hurt you; a paperclip maximizer doesn’t hate you but you’re made out of atoms that it can use to make paperclips, so leaving you alive represents an opportunity cost and a number of foregone paperclips. Similarly, paperclip maximizers want to self-improve, to perfect material technology, to gain control of resources, to persuade their programmers that they’re actually quite friendly, to hide their real thoughts from their programmers via cognitive steganography or similar strategies, to give no sign of value disalignment until they’ve achieved near-certainty of victory from the moment of their first overt strike, et cetera.
3. Rapid capability gain and large capability differences — under scenarios seeming more plausible than not, there’s the possibility of AIs gaining in capability very rapidly, achieving large absolute differences of capability, or some mixture of the two. (We could try to keep that possibility non-actualized by a deliberate effort, and that effort might even be successful, but that’s not the same as the avenue not existing.)
4. 1-3 in combination imply that Unfriendly AI is a critical problem-to-be-solved, because AGI is not automatically nice, by default does things we regard as harmful, and will have avenues leading up to great intelligence and power.
If we get this far we’re already past the pool of comparisons that Bryan Caplan draws to phenomena like industrialization. If we haven’t gotten this far, I want to know which of 1-4 Caplan thinks is false.
But there are further reasons why the above problem might be difficult to solve, as opposed to being the sort of thing you can handle straightforwardly with a moderate effort:
A. Aligning superhuman AI is hard to solve for the same reason a successful rocket launch is mostly about having the rocket not explode, rather than the hard part being assembling enough fuel. The stresses, accelerations, temperature changes, et cetera in a rocket are much more extreme than they are in engineering a bridge, which means that the customary practices we use to erect bridges aren’t careful enough to make a rocket not explode. Similarly, dumping the weight of superhuman intelligence on machine learning practice will make things explode that will not explode with merely infrahuman stressors.
B. Aligning superhuman AI is hard for the same reason sending a space probe to Neptune is hard. You have to get the design right the first time, and testing things on Earth doesn’t solve this — because the Earth environment isn’t quite the same as the Neptune-transit environment, so having things work on Earth doesn’t guarantee that they’ll work in transit to Neptune.
You might be able to upload a software patch after the fact, but only if the antenna still works to receive the software patch. If a critical failure occurs, one that prevents further software updates, you can’t just run out and fix things; the probe is already too far above you and out of your reach.
Similarly, if a critical failure occurs in a sufficiently superhuman intelligence, if the error-recovery mechanism itself is flawed, it can prevent you from fixing it and will be out of your reach.
C. And above all, aligning superhuman AI is hard for similar reasons to why cryptography is hard. If you do everything right, the AI won’t oppose you intelligently; but if something goes wrong at any level of abstraction, there may be powerful cognitive processes seeking out flaws and loopholes in your safety measures.
When you think a goal criterion implies something you want, you may have failed to see where the real maximum lies. When you try to block one behavior mode, the next result of the search may be another very similar behavior mode that you failed to block. This means that safe practice in this field needs to obey the same kind of mindset as appears in cryptography, of “Don’t roll your own crypto” and “Don’t tell me about the safe systems you’ve designed, tell me what you’ve broken if you want me to respect you” and “Literally anyone can design a code they can’t break themselves, see if other people can break it” and “Nearly all verbal arguments for why you’ll be fine are wrong, try to put it in a sufficiently crisp form that we can talk math about it” and so on. (AI safety mindset)
And on a meta-level:
D. These problems don’t show up in qualitatively the same way when people are pursuing their immediate incentives to get today’s machine learning systems working today and today’s robotic cars not to run over people. Their immediate incentives don’t force them to solve the bigger, harder long-term problems; and we’ve seen little abstract awareness or eagerness to pursue those long-term problems in the absence of those immediate incentives. We’re looking at people trying to figure out how to build a rocket-accelerating cryptographic Neptune probe, and who seem to want to do it using substantially less real caution and effort than normal engineers apply to making a bridge stay up.
Among those who say their goal is AGI, you will search in vain for any part of their effort that puts as much diligence into trying to poke holes in things and foresee what might go wrong on a technical level, as you would find allocated to the effort of double-checking an ordinary bridge. There’s some noise about making sure the bridge and its pot o’ gold stays in the correct hands, but none about what strength of steel is required to make the bridge not fall down and say what does anyone else think about that being the right quantity of steel and is corrosion a problem too.
So if we stay on the present track and nothing else changes, then the straightforward extrapolation is a near-lightspeed spherically expanding front of self-replicating probes, centered on the former location of Earth, which converts all reachable galaxies into configurations that we would regard as being of insignificant value.
On a higher level of generality, my reply to Bryan Caplan is that, yes, things have gone well for humanity so far. We can quibble about the Toba eruption and anthropics and, less quibblingly, ask what would’ve happened if Vasili Arkhipov had possessed a hotter temper. But yes, in terms of surface outcomes, Technology Has Been Good for a nice long time.
But there has to be some level of causally forecasted disaster which breaks our confidence in that surface generalization. If our telescopes happened to show a giant asteroid heading toward Earth, we can’t expect the laws of gravity to change in order to preserve a surface generalization about rising living standards. The fact that every single year for hundreds of years has been numerically less than 2017 doesn’t stop me from expecting that it’ll be 2017 next year; deep generalizations take precedence over surface generalizations. Although it’s a trivial matter by comparison, this is why we think that carbon dioxide causally raises the temperature (carbon dioxide goes on behaving as previously generalized) even though we’ve never seen our local thermometers go that high before (carbon dioxide behavior is a deeper generalization than observed thermometer behavior).
In the face of 123ABCD, I don’t think I believe in the surface generalization about planetary GDP any more than I’d expect the surface generalization about planetary GDP to change the laws of gravity to ward off an incoming asteroid. For a lot of other people, obviously, their understanding of the metaphorical laws of gravity governing AGIs won’t feel that crisp and shouldn’t feel that crisp. Even so, 123ABCD should not be that hard to understand in terms of what someone might perhaps be concerned about, and it should be clear why some people might be legitimately worried about a causal mechanism that seems like it should by default have a catastrophic output, regardless of how the soon-to-be-disrupted surface indicators have behaved over a couple of millennia previously.
2000 years is a pretty short period of time anyway on a cosmic scale, and the fact that it was all done with human brains ought to make us less than confident in all the trends continuing neatly past the point of it not being all human brains. Statistical generalizations about one barrel are allowed to stop being true when you start taking billiard balls out of a different barrel.
But to answer Bryan Caplan’s original question, his other possibilities don’t give me nightmares because in those cases I don’t have a causal model strongly indicating that the default outcome is the destruction of everything in our future light cone.
Or to put it slightly differently, if one of Bryan Caplan’s other possibilities leads to the destruction of our future light cone, I would have needed to learn something very surprising about immigration; whereas if AGI doesn’t lead to the destruction of our future lightcone, then the way people talk and act about the issue in the future must have changed sharply from its current state, or I must have been wrong about moral internalism being false, or the Friendly AI problem must have been far easier than it currently looks, or the theory of safe machine learning systems that aren’t superhuman AGIs must have generalized really surprisingly well to the superhuman regime, or something else surprising must have occurred to make the galaxies live happily ever after. I mean, it wouldn’t be extremely surprising but I would have needed to learn some new fact I don’t currently know.