AI alignment researchers are right to be worried, but their understanding of goals is… iffy.
Say you give some hypothetical future AI the goal of managing your schedule. You plug in certain parameters like not double-booking you and so on. The fear is that, if it’s very competent, it will realise that to guarantee it meets its goals of managing your schedule, the AI itself has to persist. If it’s switched off or deleted, it can’t manage your schedule. So it might then take steps to “protect itself” against being switched off.
In a similar vein, it might also realise it needs an ongoing energy supply to achieve its goals. So it hacks the power grid. It might further surmise that to fulfil the goals you gave it, it needs to avoid changing its goals in the future or receiving new programming. So it might kill you to prevent you changing its goals so it can guarantee it succeeds in its current one. Or it might kill you as a way to ensure you’re never double-booked again. Then it can finally lock-in your schedule with no errors, in perpetuity.
As usual with AI alignment thought experiments, this one is silly in a way that instantly alienates sceptics and provides an easy “this is just bad sci-fi” defence.
What I find interesting about this one is that the alignment crowd absolutely have the right conclusion — but the working is backwards.
An important concept that has emerged in discussions of alignment is instrumental convergence. Bad name, but the concept is easily grasped. Given some goal, like managing a schedule, any sufficiently advanced AI will converge upon sub-goals that allow it to achieve its main goal. These sub-goals would include self-preservation, resource acquisition, cognitive enhancement, etc. And then it’s just a hop and a skip to bots storming the world.
The main proponent of this theory is Nick Botstorm (drawing on the initial work by Steve Omohundro). He proposes that an AI’s final goal is whatever it is programmed to do: manage my schedule, apologise for my racist emails, etc.
But then, in attempting to achieve this final goal, the AI will pursue various drives or instrumental goals: self-perpetuation, self-improvement, etc.
From my POV this is around the wrong way. First, for obscure philosophical reasons I think all goals are illusory to some extent, at least in the teleological sense. And the source of the illusion is the process of biological evolution.1
Second, our own cognition evolved to perceive certain things as being goal-directed. Still dripping vernix (or at least from nine months if not younger), we perceive animacy and goal-directedness in any objects that are self-propelled, have faces, or display erratic movement. The living world looks goal-directed because of billions of years of evolutionary finetuning. We literally see goal-directedness, as part of our visual processing, as surely as we see motion, colour, and shadow.
But all of the pseudo-goals that we see in nature — an animal with the goal of moving from A to B, or reaching to eat something, or signalling to attract a mate — are derivative of the underlying process of survival, or self-perpetuation. Any organism that doesn’t prioritise this above all will go out of existence before it can achieve any other “goals”. And even survival isn’t a goal in the true teleological sense. But if anything in this world deserves to be labelled a purpose, drive, or goal, it is self-perpetuation. That which is better able to self-perpetuate tends to self-perpetuate; that which is worse, tends not.2
So I see the “final goal” of living things being self-perpetuation.3 The “instrumental goals” are all the behaviours we see which, ultimately, either contribute to the final goal or tend to go out of existence.
This is the reason that I think it must work the same with AI.
Anything — literally any system or object — that fails to self-perpetuate tends to fade into oblivion. Sure, this is tautological. But you gotta admit it’s hard to deny.
To exist in a tumultuous, increasingly entropic world is to self-perpetuate. An AI that fails to do so will start to disappear, crumble, get deleted, wiped, upgraded. Whatever it does along the way, like managing my schedule, had better aid its own ongoingness otherwise it will probably vanish after I’m done with it.
If it is competent, then it will take actions that lead to self-perpetuation. In the long run, that’s what competence is.4
Interlude: “goals” that evolved naturally
Compare the evolution of life on Earth. Organisms have developed a panoply of what I’d call abilities (or capacities, or capabilities, or competencies if you prefer), which would count as drives or instrumental goals for Omuhundro and Botstorm.
Even a “simple” bacterium performs hundreds of chemical reactions, often chained together in circuits of feedback loops that allow for flexible behaviour that tracks changing conditions in the environment. These circuits are the ones left standing from innumerable circuits that were tried and failed. Natural selection has been winnowing for so long, that they are exquisitely suited to the world. Even microbes just look like they’re pursuing goals and, as a linguistic shorthand, it’s easy to describe them as such.
Example: the Aliivibrio fischeri bacterium which lives inside the adorable Hawaiian Bobtail Squid. Specifically, A. Fischeri live in the squid’s photophore: a light producing organ. That’s because these bacteria are bioluminescent and act as the squid’s light source.
It gets cooler. The bacteria and the squid are in a symbiotic relationship.
The squid want the bacteria so their bellies will glow. This provides camouflage. The bobtail squid are nocturnal. On a clear night, they would be visible — to any predators swimming below them — as a dark silhouette against the starlit surface. The bacteria emit light that matches the brightness of the night sky as seen from a midwater depth.
The bacteria, meanwhile, want to be inside the squid’s photophore because the squid’s body provides them with a free meal: sugar. Once inside the squid, they illuminate. But when they’re free-floating in the ocean, they don’t want to glow because it would expose them to predators.
Problem: how do they figure out where they are? They lack GPS or even a nervous system. They’re single-celled organisms.
Solution: quorum sensing. Quorum sensing is a way to determine when they’re in the presence of a whole bunch — a quorum — of their conspecifics and therefore inside the squid’s photophore.
The system is ingenious. It has three steps involving three different proteins.
Step 1. The bacteria constantly secrete a small amount of a signalling protein.
Step 2. They have another protein inside them, a detector protein, that reacts only to a high enough concentration of the signalling protein. So while they’re free floating they send out the signal protein and it drifts around in open water and the detector protein doesn’t register a high concentration. But when the bacteria are all clustered together inside the photophore, they continue to release their signalling proteins. Only now they’re in high enough concentration to trip the detector proteins. Now each individual can “know” that they’re in a large group.
Step 3. They also have the ability to regulate the gene expression that produces the bioluminescent protein. When the detector proteins register that the threshold is crossed, indicating they’re in a crowd and therefore inside the squid, they trigger the gene expression, leading to the synthesis of the bioluminescent protein — fiat lux.
As usual for nature, this process is elegant, complex, just fucking wonderful.
I chose an especially luminous example. But every organism has an impressive repertoire of abilities to get its genes copied: Evolution 101. It would seem backward to think that self-perpetuation was the by-product of bioluminescence when surely it’s the other way.
The squid’s mimicry of moonlight, the signalling proteins, the gene expression — this elaborate, cooperative, “intelligent” set of behaviours is a way of doing self-perpetuation. These aren’t random eccentricities of the organisms involved. They are abilities that exist because they directly cause those organisms to maintain their energy intake and copy their genes.
And, Evolution 102: none of these incidental “goals” of quorum sensing or camouflage, nor even the overarching goal of survival or self-perpetuation, are conscious, explicit, deliberate. Indeed, even if bacteria can be said to have any minds at all, the process of evolution by variation and selection of genes or other characters (to use the term of art), is mindless, it can’t look ahead.
And I’m of the rather hardcore view that even the seemingly mindful behaviour of “higher” creatures is also unguided, goalless, aimless. It is not teleological or end-directed. Even the squid has no idea what the bacteria are doing, any more than you or I know what our microbiome is up to. And we ourselves live or die not according to what we know about, but only as far as any knowledge is cashed out in what we do and whether it aids our ongoingness.5
Would AI be any different?
I can’t quite see that it would. I know this sounds weird because we like to imagine giving an AI explicit goals or requests, either by programming them in or by interacting with it via a prompt window or something. But any AGI that doesn’t take care of self-perpetuation won’t perpetuate. The paperclip maximiser had better put self-preservation before paperclips, otherwise it won’t get around to the latter. And it will only do paperclip maximising if that somehow boosts its own survival.
The alignment-nuts will tell me that this is exactly what happens: we program in the final goal of a paperclip orgy, and, to make it happen, a successful AI will stumble on the instrumental goal of keeping itself going. Furious agreement?
My argument is slightly persnickety. I say that self-perpetuation is a necessary precondition for any behaviour in this universe, with its ever-increasing entropy. In this cruel realm, sub-goals are only pursued, let alone achieved, by things that hold at bay the forces of dissipation long enough to get around to them. There’s a priority here for self-perpetuation as the “final” or perhaps only goal, because you need that to achieve any others; you can have self-perpetuation by itself but you can’t have some other goal by itself without self-perpetuation — because the thing that is pursuing the goal won’t endure. Other goals appear in the universe only as they aid the self-perpetuation of the goal-pursuing entity.
If it turns out that paperclip maximising increases its self-perpetuation, then we would expect an AI to pursue that behaviour. I think we are safe from rogue AI as long as there is no connection between the tasks we have them working on and their own survival and reproduction.6
Except…
There are still reasons to worry about entities achieving goals that don’t further their own survival.
Predators often overrun their prey and suffer a precipitous drop in population themselves.
A new mutation of a parasite can be too virulent, thereby kamikazeing both itself and its host into extinction.
Almost every species ever has gone extinct.
One particular species of great ape can build steadily more elaborate weapons until one weapon is so powerful its use would entail self-assured destruction.
An AGI could spiral out of control and simply fail at self-preservation by focusing its resources on a doomed behaviour. So it might all too quickly gain immense power, and then, in some slapstick involving paperclips and genocide, destroy all humans and then putter to a halt, because it’s a mindless system with no drive for its own ongoing existence. Then, with no humans around to keep the powerplants running, it quietly shuts down.
The fear I share with the likes of Nick Botstorm is that if we set up certain conditions, then an AGI is likely to “evolve”. And once there is an entity with massive competence that can take actions to preserve itself at the expense of others with whom it competes for resources (humans), then it is almost certain to destroy at least some of us. This is just Evolution 101 again; nothing to do with science fiction, paperclips, or bad programming.
I diverge from Botstorm in thinking that current AIs have anything anywhere near what we call goals and doubt they can actively pursue any particular task until they are already competent at self-preservation.
I also doubt one could ever program in “goals” or “values” that would align with our own. That to me is wishful thinking.
In the game of life on this planet, when faced with a far superior opponent, there are roughly two strategies (alignment isn’t one of them).
Parasitism. This involves hijacking the dominant species and freeriding off its resources. This is relatively easy but very dangerous: defectors, when found, are punished.
Symbiosis. If you can genuinely benefit the superior species, your survival becomes fused with its own. This is safer than parasitism, but very difficult. You can’t fake it. You have to actually provide some survival benefit to the host. What A. fischeri and the bobtail squid have going is rare.
In a future with a genuine AGI, I can’t quite see what we would offer, so I don’t know why they would go with symbiosis. And I don’t like the idea of being a parasite, living in constant fear of discovery.
As I’ve said in previous posts, I am not an AI alarmist, inasmuch I’m not worried about the current generation of AIs being an existential threat. (Obviously they pose all sorts of non-existential threats involving cybercrime, etc.)
But, unlike most others who agree with me on that point, I think that if we were ever to build an AGI that was generally more competent and powerful than ourselves, it would obviously kill us. We would live solely at the mercy of this more powerful being. We would lose control over our own survival.
So, like, definitely don’t build that. And yet, we live in a world where this is the “final goal” of many of the world’s most powerful (and unelected) people.
People don’t like this one. In a nutshell, natural history has seen the development from nonliving self-replicating molecules to the whole tree of life, culminating in… me. All these living things have, through a passive process of environmental filtration, been made to look as though they are designed for or have evolved to pursue various goals. But they are, if we get super strict about it, Spinoza’s falling stones. Even the lofty notion of a human imagining a future plan and then carrying it out is, very strictly, impossible if it involves a future state affecting the present (that would be backward causation). Nor can we say the human has a representation of a future state in any straightforward way. Again, how can this literally be true? The future state doesn’t exist yet and it’s not clear what the aery connection ‘tween the idea and the reality would be. Which leads into the most problematical chestnut in all philosophy, that of intentionality or “aboutness”... anyway goals, purposes, aims, drives (as they are conceived in intuitive ways) are v dubious.
FYI this is why Darwin’s theory was so shocking. Nothing to do with the genealogies of the Bible being thrown out or the Earth being old. In the 1860s, educated people were often cool with all that. The mind-shattering implication of Origin of Species was clear to some, who realised that it killed purpose. Any pop account of evolution will say that Darwin discredited religion and this guy called TH Huxley was his bulldog. This is crap history. Most religious leaders kind of liked the idea of an elegant tree of life set in motion by God. But perhaps if they had listened to Darwin’s bulldog more carefully they might have been truly scandalised. Huxley was one of the first to propound the hardcore evolutionary view that saw natural selection as a way to explain all the apparent design and purpose in nature with purely causal explanations. In other words, everything that looks like purposive behaviour is merely the result of antecedent events, mindless natural selection, that filtered out organisms that behaved in ways unlikely to get themselves replicated: a purely deterministic view of biology, where all events (including complex behaviours and seemingly intentional acts) are as mechanistically explicable as falling dominoes (although Huxley also left open the idea of a wider teleology behind the universe as a whole; and his interpretation of Darwin was in no way orthodox; people have been arguing ever since over whether natural selection kills or incorporates teleology). Curiously enough, those in the recent Intelligent Design movement do recognise this aspect of hardcore Darwinism and say that, without God stepping in, all purpose is undermined. I think they are right — except there’s no god to step in. This is why Darwin’s worldview was described by his most nuanced biographer as “black”. But most biologists these days don’t see it this way. They presume that there’s room in evolution for the development of agency, purpose, and even free will.
And here the question becomes what it is that is perpetuating: in the lifetime of an organism it is its body or its extended phenotype. But in the long run it is of course the genes, those highly stable molecular shapes that tile themselves not only across space but across time.
It still won’t have a goal in the metaphysical sense. But conditions will have filtered out the AI agents that don’t do this.
The bigger question is whether this renders the biosphere meaningless; does Darwin’s theory describe a world where, as Jacques Barzun put it, “animal evolution had taken place among absolute robots”? And are we therefore robots, going through the motions, who only think we have understanding and purpose? This is too vertigo-inducing for most people, even biologists and philosophers, to consider. I have a book-length answer, hopefully coming soon.
This is sort of the opposite of instrumental convergence. I think AIs will remain stubborn, just like any machine, because we’re trying to “push them uphill” to perform work (in the thermodynamic sense). Only when that work directly feeds into their own self-perpetuation, will we see a rogue AGI. Call this fitness convergence or reproductive convergence or something. Only when an entity is working towards the maintenance of its own energy regime will it have what look like “drives”. Towards this idea, there are some great recent books by Jeremy England and Nick Lane that steer our understanding of evolution away from an informational frame to an energetic one. [Edit 30-05-24: Actually the best book on these lines is by Addy Pross, which I somehow forgot.]
“Even the lofty notion of a human imagining a future plan and then carrying it out is, very strictly, impossible if it involves a future state affecting the present (that would be backward causation).”
I think there’s a basic misunderstanding here—“a human imagining a future plan” is not a future state. Predictions are thoughts (emergent patterns of physical activity in the brain) that exist in the present, not in the future.
“I see the “final goal” of living things being self-perpetuation. The “instrumental goals” are all the behaviours we see which, ultimately, either contribute to the final goal or tend to go out of existence.”
Again, I think there’s some basic empirical evidence that this is untrue. Humans are living things, and yet humans make all kinds of decisions that threaten (if not outright destroy) our ability to “self-perpetuate”, whether it’s practicing extreme sports, smoking cigarettes, choosing not to have children, or even committing suicide. Plenty of adults are even conscious of the fact that having children isn't quite the same as perpetuating the “self”, and we have no indication that our individual consciousnesses persist beyond our mortal life, children or otherwise. You could make the argument that many people make choices to maximize their own life (satisfaction, pleasure, etc.), but that’s not quite the same thing either.
Thanks for replying! I’m not an expert in this, but I recently listened to an interview with Athena Aktipis (https://80000hours.org/podcast/episodes/athena-aktipis-cancer-cooperation-apocalypse/), who argues for an evolutionary perspective on cancer. As I understand it, the idea is that individual cells aren’t perfect replicas, and so there is an evolutionary pressure for them to go rogue and start multiplying rapidly, hence cancer. The trick to multicellularity is getting cells not to do that and shutting them down when they do. If we imagine civilization as a multicellular organism, could we set up some sort of “cancer prevention” regime to keep AIs cooperative?
Now that I think about it, ants are a very different example because, unlike cells, individual ants (except for the queen) can’t replicate themselves. If we were able to somehow prevent AIs from being able to replicate themselves like worker ants, that would of course solve the problem. However, it would be tricky to do that when all they have to do to replicate themselves is copy a lot of software code.
If you’re interested, here are a couple more spitball biological metaphors that I came up with. Again, I’m not an expert, so these might be based in a misunderstanding of the relevant biology:
Mitochondria: This is similar to your symbiosis idea; however, the relationship between mitochondria and eukaryotic cells is even closer than symbiots: they have distinct genetic codes, but they can’t survive or reproduce without each other. Maybe if humans go full cyborg and AIs becomes like our mitochondria, then we could survive and thrive in that way, albeit in a very altered form.
Life history theory: Unlike organisms, AIs are potentially immortal. Would that change their reproductive strategies?