Ensuring smarter-than-human intelligence has a positive outcome

||Analysis,Video

I recently gave a talk at Google on the problem of aligning smarter-than-human AI with operators’ goals:

The talk was inspired by “AI对齐:为什么很难以及从哪里开始,” and serves as an introduction to the subfield of alignment research in AI. A modified transcript follows.

Talk outline (slides):

1.概述

2.Simple bright ideas going wrong

2.1.任务:填充大锅
2.2.Subproblem: Suspend buttons

3.The big picture

3.1.Alignment priorities
3.2.四个关键命题

4.Fundamental difficulties



概述

我是机器情报研究所的执行董事。金宝博娱乐非常粗略地说,我们是一个长期思考人工智能的团队,并努力确保在我们已经获得高级AI系统时,我们也知道如何将它们指向有用的方向。金宝博官方

在整个历史上,科学和技术一直是人类和动物福利变化的最大驱动因素,无论好坏。如果我们能够自动化科学和技术创新,那有可能以自工业革命以来未见的规模改变世界。当我谈论“高级AI”时,我想到的是自动化创新的潜力。

明年没金宝博官方有超过人类的人工智能系统,但是许多聪明的人正在努力,我也不是一个押注人类创造力的人。我认为我们一生中很可能能够建立像自动化科学家之类的东西,这表明这是我们需要认真对待的事情。

When people talk about the social implications ofgeneral AI, they often fall prey to anthropomorphism. They conflate artificialintelligencewith artificialconsciousness, or assume that if AI systems are “intelligent,” they must be intelligent in the same way a human is intelligent. A lot of journalists express a concern that when AI systems pass a certain capability level, they’ll spontaneously develop “natural” desires like a human hunger for power; or they’ll reflect on their programmed goals, find them foolish, and “rebel,” refusing to obey their programmed instructions.

这些是放错地方的问题。人脑是自然选择的复杂产物。我们不应该期望在科学创新中超过人类性能的机器与早期火箭,飞机或热气球非常相似的人类非常类似于人类。1

The notion of AI systems “breaking free” of the shackles of their source code or spontaneously developing human-like desires is just confused. The AI systemis它的源代码及其动作只会从我们启动的指令执行。CPU只是继续执行程序寄存器中的下一个指令。我们可以编写一个程序来操纵自己的代码,包括编码目标。即使那样,它进行的操作还是由于执行我们编写的原始代码而进行的;它们不是源于机器中的某种幽灵。

The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification. As Stuart Russell (co-author of人工智能:一种现代方法):

主要关心的不是怪异的新兴意识,而是仅仅是制造的能力high-quality decisions。Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:

1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.

2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.

一个金宝博官方正在优化函数的系统nvariables, where the objective depends on a subset of sizek<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

These kinds of concerns deserve a lot more attention than the more anthropomorphic risks that are generally depicted in Hollywood blockbusters.

Simple bright ideas going wrong

任务:填充大锅

许多人开始谈论与人类更聪明的人AI的担忧时,会抛出终结者的照片。我曾经在新闻文章中引用我的话,这让我在所有关于AI的文章中都贴上了终结者图片的人,旁边是终结者图片。那天我学到了一些关于媒体的知识。

I think this is a much better picture:

VLCSNAP-2016-05-04-18H44M30S933

This is Mickey Mouse in the movieFantasia,他非常巧妙地着迷于扫帚,代表他填补了一块大锅。

米奇怎么做?我们可以想象,米奇(Mickey)编写了计算机程序,并让扫帚执行该程序。米奇首先写下评分函数或目标函数:
$$\mathcal{U}_{broom} =
\begin{cases}
1 &\text{ if cauldron full} \\
0 &\text{ if cauldron empty}
\end{cases}$$
Given some set of available actions, Mickey then writes a program that can take one of these actions as input and calculate how high the score is expected to be if the broom takes that action. Then Mickey can write a function that spends some time looking through actions and predicting which ones lead to high scores, and outputs an action that leads to a relatively high score:
$ \ unterSet {a \ in} {\ mathrm {sorta \ mbox { - } argmax}}}} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ left [\ mathcal {\ mathcal {u} _ {broom} _ {broom}The reason this is “sorta-argmax” is that there may not be time to evaluate every action in . For realistic action sets, agents should only need to find actions that make the scoring function as large as they can given resource constraints, even if this isn’t the maximal action.

该程序看起来很简单,但是当然,详细信息中的魔鬼是:编写一种算法,该算法可以进行准确的预测和通过动作空间进行智能搜索,这基本上是AI的整个问题。但是,从概念上讲,这很简单:我们可以在广泛的笔触中描述扫帚必须进行的各种操作,以及它们在不同性能水平上的合理后果。

When Mickey runs this program, everything goes smoothly at first.Then:

vlcsnap-2016-05-04-19h48m12s031

I claim that as fictional depictions of AI go, this is pretty realistic.

Why would we expect a generally intelligent system executing the above program to start overflowing the cauldron, or otherwise to go to extreme lengths to ensure the cauldron is full?

The first difficulty is that the objective function that Mickey gave his broom left out一堆其他术语米奇cares about:

$$\mathcal{U}_{human} =
\begin{cases}
1&\text{ if cauldron full} \\
0&\ text {如果cauldron空} \\
-10&\text{ if workshop flooded} \\
+0.2&\text{ if it’s funny} \\
-1000000&\ text {如果有人被杀} \\
&\text{… and a whole lot more} \\
\end{cases}$$

第二个困难是,米奇(Mickey)编程了扫帚,以使其得分的期望尽可能大。“只要用水填充一个大锅”看起来像是一个适中的,有限的目标,但是当我们将这个目标转化为概率上下文时,我们发现优化它意味着提高成功的可能性荒谬的高度。如果扫帚分配了99.9%的概率,即“大锅已经满”,并且周围有额外的资源,那么它将始终尝试找到使用这些资源来推动概率甚至更高一点的方法。

将其与有限的“对比”像任务一样” goal we presumably had in mind. We wanted the cauldron full, but in some intuitive sense we wanted the system to “not try too hard” even if it has lots of available cognitive and physical resources to devote to the problem. We wanted it to exercise creativity and resourcefulness within some intuitive limits, but we didn’t want it to pursue “absurd” strategies, especially ones with large unanticipated consequences.2

In this example, the original objective function looked pretty task-like. It was bounded and quite simple. There was no way to get ever-larger amounts of utility. It’s not like the system got one point for every bucket of water it poured in — then there would clearly be an incentive to overfill the cauldron. The problem was hidden in the fact that we’re maximizingexpectedutility. This makes the goal open-ended, meaning that even small errors in the system’s objective function willblow up

There are a number of different ways that a goal that looks task-like can turn out to be open-ended. Another example: a larger system that has an overarching task-like goal may havesubprocessesthat are themselves trying to maximize a variety of different objective functions, such as optimizing the system’s memory usage. If you don’t understand your system well enough to track whether any of its subprocesses are themselves acting like resourceful open-ended optimizers, then顶级目标的安全性可能并不重要

So the broom keeps grabbing more pails of water — say, on the off chance that the cauldron has a leak in it, or that “fullness” requires the water to be slightly above the level of the brim. And, of course, at no point does the broom “rebel against” Mickey’s code. If anything, the broom pursued the objectives it was programmed withtoo有效。

Subproblem: Suspend buttons

A common response to this problem is: “OK, there may be some unintended consequences of the objective function, but we can always pull the plug, right?”

米奇尝试这个, and it doesn’t work:

vlcsnap-2016-05-04-19h21m04s349 vlcsnap-2016-05-04-19h22m09s178 vlcsnap-2016-05-04-19h53m09s315

And I claim that this is realistic too, for systems that are sufficiently good at modeling their environment. If the system is trying to drive up the expectation of its scoring function and is smart enough to recognize that its being shut down will result in lower-scoring outcomes, then the system’s incentive is to subvert shutdown attempts. The more capable the system is, the likelier it is to find creative ways to achieve that subgoal — e.g., by copying itself to the Internet, or by tricking the programmers into thinking it’s safer.

这并不是说不可能关闭足够功能的AI系统。金宝博官方只是我们需要投入额外的工作,以蓄意设计系统,以避免寻找避免被关闭的方法。金宝博官方如果您写了试图抗拒试图关闭它的代码,那么即使您确信它会失败,也不应该运行代码 - 花费这些计算周期的方法比在对抗性策略搜索上更好。

Relevant safety measures that don’t assume we can always outthink and outmaneuver the system will need to modify the system’s objective function in some relevant way. We need to figure out how to formally specify objective functions that don’t automatically place the AI system into an adversarial context with the operators; or we need to figure out some way to have the system achieve goals without optimizing some objective function in the traditional sense. E.g., we can look for ways to limit the system’s capabilitiesto a small set of “whitelisted” domainsthat don’t include thinking about whether the system will be shut down.

One obvious solution is to try changing the score function to make reference to a suspend button :
$$\mathcal{U}^\prime_{broom} =
\begin{cases}
1 \text{ if cauldron full } & \text{& } \textbf{B}\text{=OFF} \\
0 \ text {if cauldron空}&\ text {&} \ textbf {b} \ text {= off} \\
1 \text{ if robot suspended } & \text{& } \textbf{B}\text{=ON} \\
0 \text{ otherwise}
\end{cases}$$
This says that when the button is on, the system’s goal changes from “fill the cauldron” to “be suspended.”

目前尚不清楚我们如何实际指定这个目标 - 如果我们指定“悬架”的方式存在任何漏洞,那么该系统可能会试图利用它们来对冲其赌注。金宝博官方例如,它可以实现technicaldefinition of “suspend operations” while tasking a child agent it built or subprocess it started with filling the cauldron. This would allow is to achieve both “cauldron full” and “robot suspended” at once, just in case it’s wrong about whether the button has been pushed.

Let’s assume, however, that we’ve found an airtight way to technically specify suspension. The problem we run into now is that we’ve given the system an incentive to care a lot about whether the button is pushed. If it has any way to influence its operators’ decision to push the button, or any other way to exercise more control, then it will jump on the opportunity and try to ensure the button is pushed, or ensure that it’s not pushed — depending on whether it looks easier to suspend or to fill the cauldron.

And, surprisingly, the state of the art on this problem isn’t much more advanced than that. There’s a paper co-authored by me, Fallenstein, Yudkowsky, and Armstrong (“Corrigibility”),以及Orseau和Armstrong的相关结果(“Safely Interruptible Agents”),但不多。

What we want is a way to combine two objective functions — a default function for normal operation, and a suspend function for when we want to suspend the system to disk.$$\mathcal{U}_{normal}(act) =
\begin{cases}
1\text{ if cauldron full} \\
0\text{ if cauldron empty}
\end{cases}$$

$$\mathcal{U}_{suspend}(act) =
\begin{cases}
1 \ text {如果暂停} \ hspace {0.8cm} \\
0\text{ otherwise}
\end{cases}$$
We want our method for combining the functions to satisfy three conditions: an operator should be able to switch between the functions (say, by pushing a button); the system shouldn’t have any incentives to control which function is active; and if it’s plausible that the system’s normal operations could inadvertently compromise our ability to switch between the functions, then the system should be incentivized to keep that from happening.

So far, we haven’t found any way to achieve all three goals at once. As an example, we can try to remove any incentive on the system’s part to control whether its suspend button is pushed by giving the system a switching objective function that always assigns the same expected utility to the button being on or off:
$$\mathcal{U}_{switch}(act) =
\begin{cases}
\mathcal{U}_{normal}(act) &\text{ if button=OFF} \\
\mathcal{U}_{suspend}(act) + \theta &\text{ if button=ON}
\end{cases}$$
$$\theta = \max_{act} \mathbb{E}\left[\mathcal{U}_{normal}\mid act\right] – \max_{act} \mathbb{E}\left[\mathcal{U}_{suspend}\mid act\right]$$Unfortunately, this approach makes it impossible to satisfy the third condition. It turns out that the above objective function is equivalent to setting the probability of the button being pressed to 0. The system won’tdeliberatelytamper with the button, but it will happily approve updates that delete any reference to the button as dead code, or build subagents that ignore the button.

满足这些直觉的简单约束,这是一个非平凡的问题。这是一种在这个空间中的许多问题的模式:传统的工具和概念陷入了直接的安全问题,这些问题在传统能力研究中不会出现。金宝博娱乐


The big picture

Alignment priorities

让我们退后一步,谈论总体上需要什么,以使高功能强大的AI系统与我们的兴趣相结合。金宝博官方

这是一个极其简化的管道:您有一些人提出一些任务,目标或偏好集,这些任务或偏好集可作为其预期的价值功能。由于我们的价值是复杂且对上下文敏感的,因此实际上,我们需要构建系统以随着时间的推移而不是手工编码。金宝博官方3We’ll call the goal the AI system ends up with (which may or may not be identical to ) .

alignment-prioritiesWhen the press covers this topic, they often focus on one of two problems: “What if the wrong group of humans develops smarter-than-human AI first?”, and “What if AI’s natural desires cause to diverge from ?”

humans-ndIn my view, the “wrong humans” issue shouldn’t be the thing we focus on until we have reason to think we could get good outcomes with therightgroup of humans. We’re very much in a situation where well-intentioned people couldn’t leverage a general AI system to do good things even if they tried. As a simple example, if you handed me a box that was an extraordinarily powerful function optimizer — I could put in a description of any mathematical function, and it would give me an input that makes the output extremely large — then I don’t know how I could use that box to develop a new technology or advance a scientific frontier without causing any catastrophes.4

There’s a lot we don’t understand about AI capabilities, but we’re in a position where we at least have a general sense of what progress looks like. We have a number of good frameworks, techniques, and metrics, and we’ve put a great deal of thought and effort into successfully chipping away at the problem from various angles. At the same time, we have a very weak grasp on the problem of how to align highly capable systems with any particular goal. We can list out some intuitive desiderata, but the field hasn’t really developed its first formal frameworks, techniques, or metrics.

I believe that there’s a lot of low-hanging fruit in this area, and also that a fair amount of the work does need to be done early (e.g., to help inform capabilities research directions — some directions may produce systems that are much easier to align than others). If we don’t solve these problems, developers with arbitrarily good or bad intentions will end up producing equally bad outcomes. From an academic or scientific standpoint, our first objective in that kind of situation should be to remedy this state of affairs and at least make good outcomes technologically possible.

Many people quickly recognize that “natural desires” are a fiction, but infer from this that we instead need to focus on the other issues the media tends to emphasize — “What if bad actors get their hands on smarter-than-human AI?”, “How will this kind of AI impact employment and the distribution of wealth?”, etc. These are important questions, but they’ll only end up actually being relevant if we figure out how to bring general AI systems up to a minimum level of reliability and safety.

另一个常见的线程是“为什么不直接告诉AIsystem to (insert intuitive moral precept here)?” On this way of thinking about the problem, often (perhaps unfairly) associated with Isaac Asimov’s writing, ensuring a positive impact from AI systems is largely about coming up with natural-language instructions that are vague enough to subsume a lot of human ethical reasoning:

intended-values

In contrast, precision is a virtue in real-world safety-critical software systems. Driving down accident risk requires that we begin with limited-scope goals rather than trying to “solve” all of morality at the outset.5

My view is that the critical work is mostly in designing an effective value learning process, and in ensuring that the sorta-argmax process is correctly hooked up to the resultant objective function :

vl-argmax.png

您的价值学习框架越好,确定值函数所需的明确和准确性就越少,并且您可以将弄清楚您想要的内容的问题越多。金宝博官方然而,价值学习提高了a number of basic difficultiesthat don’t crop up in ordinary machine learning tasks.

Classic capabilities research is concentrated in the sorta-argmax and Expectation parts of the diagram, but sorta-argmax also contains what I currently view as the most neglected, tractable, and important safety problems. The easiest way to see why “hooking up the value learning process correctly to the system’s capabilities” is likely to be an important and difficult challenge in its own right is to consider the case of our own biological history.

自然选择是我们所知道的唯一导致一般智能人工制品的“工程”过程:人脑。Since natural selection relies on a fairly unintelligent hill-climbing approach, one lesson we can take away from this is that it’s possible to reach general intelligence with a hill-climbing approach and enough brute force — though we can presumably do better with our human creativity and foresight.

另一个关键的外卖是自然选择对只要optimizing brains for a single very simple goal: genetic fitness. In spite of this, the internal objectives that humans represent as their goals are not genetic fitness. We have innumerable goals — love, justice, beauty, mercy, fun, esteem, good food, good health, … — that correlated with good survival and reproduction strategies in the ancestral savanna. However, we ended up valuing these correlates directly, rather than valuing propagation of our genes as an end in itself — as demonstrated every time we employ birth control.

This is a case where the external optimization pressure on an artifact resulted in a general intelligence with internal objectives that didn’t match the external selection pressure. And just as this caused humans’ actions to diverge from natural selection’s pseudo-goal once we gained new capabilities, we can expect AI systems’ actions to diverge from humans’ if we treat their inner workings as black boxes.

If we apply gradient descent to a black box, trying to get it to be very good at maximizing some objective, then with enough ingenuity and patience, we may be able to produce a powerful optimization process of some kind.6By default, we should expect an artifact like that to have a goal that strongly correlates with our objective in the training environment, but sharply diverges fromin some new environments or when a much wider option set becomes available

On my view, the most important part of the alignment problem is ensuring that the value learning framework and overall system design we implement allow us to crack open the hood and confirm when the internal targets the system is optimizing for match (or don’t match) the targets we’re externally selecting through the learning process.7

We expect this to be technically difficult, and if we can’t get it right, then it doesn’t matter who’s standing closest to the AI system when it’s developed. Good intentions aren’t sneezed into computer programs by kind-hearted programmers, and coming up with plausible goals for advanced AI systems doesn’t help if we can’t align the system’s cognitive labor with a given goal.

四个关键命题

Taking another step back: I’ve given some examples of open problems in this area (suspend buttons, value learning, limited task-based AI, etc.), and I’ve outlined what I consider to be the major problem categories. But my initial characterization of why I consider this an important area — “AI could automate general-purpose scientific reasoning, and general-purpose scientific reasoning is a big deal” — was fairly vague. What are the core reasons to prioritize this work?

第一的,goals and capabilities are orthogonal。That is, knowing an AI system’s objective function doesn’t tell you how good it is at optimizing that function, and knowing that something is a powerful optimizer doesn’t tell you what it’s optimizing.

I think most programmers intuitively understand this. Some people will insist that when a machine tasked with filling a cauldron gets smart enough, it will abandon cauldron-filling as a goal unworthy of its intelligence. From a computer science perspective, the obvious response is that you could go out of your way to build a system that exhibits that conditional behavior, but you could also build a system that doesn’t exhibit that conditional behavior. It can just keeps searching for actions that have a higher score on the “fill a cauldron” metric. You and I might get bored if someone told us to just keep searching for better actions, but it’s entirely possible to write a program that executes a search and never gets bored.8

Second,sufficiently optimized objectives tend to converge on adversarial instrumental strategies。大多数目标自己人工智能系统c金宝博官方ould possess would be furthered by subgoals like “acquire resources” and “remain operational” (along with “learn more about the environment,” etc.).

这是遇到问题暂停按钮:夏娃n if you don’t explicitly include “remain operational” in your goal specification, whatever goal you did load into the system is likely to be better achieved if the system remains online. Software systems’ capabilities and (terminal) goals are orthogonal, but they’ll often exhibit similar behaviors if a certain class of actions is useful for a wide variety of possible goals.

To use an example due to Stuart Russell: If you build a robot and program it to go to the supermarket to fetch some milk, and the robot’s model says that one of the paths is much safer than the other, then the robot, in optimizing for the probability that it returns with milk, will automatically take the safer path. It’s not that the system fears death, but that it can’t fetch the milk if it’s dead.

第三,general-purpose AI systems are likely to show large and rapid capability gains。The human brain isn’t anywhere near the upper limits for hardware performance (or, one assumes, software performance), and there are a number of other reasons to expect large capability advantages and rapid capability gain from advanced AI systems.

As a simple example, Google can buy a promising AI startup and throw huge numbers of GPUs at them, resulting in a quick jump from “these problems look maybe relevant a decade from now” to “we need to solve all of these problems in the next year” à la DeepMind’s progress in Go. Or performance may suddenly improve when a system is first given large-scale Internet access, when there’s a conceptual breakthrough in algorithm design, or when the system itself is able to propose improvements to its hardware and software.9

Fourth,aligning advanced AI systems with our interests looks difficult。我会说更多关于我目前认为这一点的原因。

Roughly speaking, the first proposition says that AI systems won’t naturally end up sharing our objectives. The second says that by default, systems with substantially different objectives are likely to end up adversarially competing for control of limited resources. The third suggests that adversarial general-purpose AI systems are likely to have a strong advantage over humans. And the fourth says that this problem is hard to solve — for example, that it’s hard to transmit our values to AI systems (addressing orthogonality) oravert adversarial incentives(解决融合仪器策略)。

These four propositions don’t mean that we’re screwed, but they mean that this problem is critically important. General-purpose AI has the potential to bring enormous benefits if we solve this problem, but we do need to make finding solutions a priority for the field.


Fundamental difficulties

Why do I think that AI alignment looks fairly difficult? The main reason is just that this has been my experience from actually working on these problems. I encourage you tolook at some of the problems yourself并尝试在玩具设置中解决它们;我们可以在这里使用更多的眼睛。我还要注意一些结构性原因,以期期望这些问题很难:

第一的,aligning advanced AI systems with our interests looks difficult for the same reason rocket engineering is more difficult than airplane engineering.

Before looking at the details, it’s natural to think “it’s all just AI” and assume that the kinds of safety work relevant to current systems are the same as the kinds you need when systems surpass human performance. On that view, it’s not obvious that we should work on these issues now, given that they might all be worked out in the course of narrow AI research (e.g., making sure that self-driving cars don’t crash).

Similarly, at a glance someone might say, “Why would rocket engineering be fundamentally harder than airplane engineering? It’s all just material science and aerodynamics in the end, isn’t it?” In spite of this, empirically, the proportion of rockets that explode is far higher than the proportion of airplanes that crash. The reason for this is that a rocket is put under much greater stress and pressure than an airplane, and small failures are much more likely to be highly destructive.10

类似地,即使将军AI和狭窄的AI在某种意义上是“ AI只是AI”,我们可以期望更通用的AI系统可能会经历更大范围的压力源,并且具有更危险的故障模式。金宝博官方

For example, once an AI system begins modeling the fact that (i) your actions affect its ability to achieve its objectives, (ii) your actions depend on your model of the world, and (iii) your model of the world is affected by its actions, the degree to which minor inaccuracies can lead to harmful behavior increases, and the potential harmfulness of its behavior (which can now include, e.g., deception) also increases. In the case of AI, as with rockets, greater capability makes it easier for small defects to cause big problems.

第二,对齐看起来困难相同的意图son it’s harder to build a good space probe than to write a good app.

You can find a number of interesting engineering practices at NASA. They do things like take three independent teams, give each of them the same engineering spec, and tell them to design the same software system; and then they choose between implementations by majority vote. The system that they actually deploy consults all three systems when making a choice, and if the three systems disagree, the choice is made by majority vote. The idea is that any one implementation will have bugs, but it’s unlikely all three implementations will have a bug in the same place.

与新的WhatsApp部署相比,这要谨慎得多。造成区别的一个重要原因是,很难回滚空间探测。您可以将版本更新发送到空间探测器并纠正软件错误,但只有当探针的天线和接收器工作以及应用程序所需的所有代码工作时,该版本才能正常工作。如果您的应用补丁金宝博官方系统本身是故障,那么无需完成。

In that respect, smarter-than-human AI is more like a space probe than like an ordinary software project. If you’re trying to build something smarter than yourself, there are parts of the system that have to work perfectly on the first real deployment. We can do all the test runs we want, but once the system is out there, we can only make online improvements if the code that makes the systemallow这些改进正常工作。

如果还没有什么使您的内心感到恐惧,我建议您冥想我们文明的未来很可能取决于我们编写代码的能力works correctly在第一个部署。

最后,对齐方式看起来很困难,因为很难计算机安全性:系统需要稳健地搜索漏洞。金宝博官方

Suppose you have a dozen different vulnerabilities in your code, none of which is itself fatal or even really problematic in ordinary settings. Security is difficult because you need to account for intelligent attackers who might find all twelve vulnerabilities and chain them together in a novel way to break into (or just break) your system. Failure modes that would never arise by accident can be sought out and exploited; weird and extreme contexts can be instantiated by an attacker to cause your code to follow some crazy code path that you never considered.

A similar sort of problem arises with AI. The problem I’m highlighting here is not that AI systems might act adversarially: AI alignment as a research program is all about finding ways toprevent adversarial behavior在它可以出现之前。我们不想从事试图任意聪明的对手的业务。那是一个失败的游戏。

与密码学的相似之处在于,在AI对齐中,我们处理通过非常大的搜索空间执行智能搜索的系统,并且可以产生怪异的环境,从而迫使代码降低意外路金宝博官方径。这是因为怪异边缘案例are places of extremes, and places of extremes are often the place where a given objective function is optimized.11Like computer security professionals, AI alignment researchers need to be very good at thinking about edge cases.

It’s much easier to make code that works well on the path that you were visualizing than to make code that works on all the paths that you weren’t visualizing. AI alignment needs to work在所有路径上,您都不可视化

总结,我们应该这类问题的方法with the same level of rigor and caution we’d use for a security-critical rocket-launched space probe, and do the legwork as early as possible. At this early stage, a key part of the work is just to formalize basic concepts and ideas so that others can critique them and build on them. It’s one thing to have a philosophical debate about what kinds of suspend buttons people intuit ought to work, and another thing to translate your intuition into an equation so that others can fully evaluate your reasoning.

This is a crucial project, and I encourage all of you who are interested in these problems to get involved and try your hand at them. There are在线充足的资源for learning more about the open technical problems. Some good places to start include MIRI’s金宝博娱乐 and a great paper from researchers at Google Brain, OpenAI, and Stanford called “Concrete Problems in AI Safety。”


  1. 飞机无法治愈其伤害或繁殖,尽管它可以比鸟更进一步,更快地承载重型货物。在许多方面,飞机比鸟类更简单,同时在承载能力和速度方面也更有能力(为其设计)。在许多方面,早期的自动化科学家同样会比人类的思想更简单,同时在某些关键方面的能力明显更大,这是合理的。就像飞机的建筑和设计原理相对于生物生物的建筑看起来陌生,我们应该期望与人类思想的建筑相比,高功能强大的AI系统的设计非常陌生。金宝博官方
  2. 试图为这些尝试将类似任务的目标与开放式目标区分开的尝试提供一些正式内容是产生开放研究问题的一种方法。金宝博娱乐在里面 ”Alignment for Advanced Machine Learning Systems” research proposal, the problem of formalizing “don’t try too hard” ismild optimization,“避免荒谬的策略”是conservatism, and “don’t have large unanticipated consequences” isimpact measures。See also “avoiding negative side effects” in Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané’s “Concrete Problems in AI Safety。”
  3. One thing we’ve learned in the field of machine vision over the last few decades is that it’s hopeless to specify by hand what a cat looks like, but that it’s not too hard to specify a learning system that can learn to recognize cats. It’s even more hopeless to specify everything we value by hand, but it’s plausible that we could specify a learning system that can learn the relevant concept of “value.”
  4. See “Environmental Goals,” “Low-Impact Agents,“ 和 ”Mild Optimization”有关指定身体目标而不会引起灾难性副作用的障碍的例子。

    粗略地说,Miri的重点是研究方向,这些方向似乎可以帮助我们从概念上理解如金宝博娱乐何进行原则上的AI对准,因此我们对可能需要的工作的类型从根本上不那么困惑。

    What do I mean by this? Let’s say that we’re trying to develop a new chess-playing programs. Do we understand the problem well enough that we could solve it if someone handed us an arbitrarily large computer? Yes: We make the whole search tree, backtrack, see whether white has a winning move.

    如果我们不知道如何使用一台任意大型计算机回答这个问题,那么这将表明我们从根本上对国际象棋感到困惑。我们要么缺少搜索树数据结构,要么是回溯算法,或者我们会缺少对国际象棋的工作方式的了解。

    这是我们在克劳德·香农(Claude Shannon)开创性论文之前对国际象棋的立场的立场,这是我们目前在AI对齐中许多问题的立场。不管您递给我的计算机多大,我都无法制作一个比人类的AI系统更聪明,甚至可以执行一个非常简单的有限任务(例如,“将草莓放在盘子上,而不会产生任何灾难性的侧面效应”金宝博官方)或甚至实现一个非常简单的开放目标(例如,“最大化宇宙中的钻石量”)。

    If I didn’t have any particular goal in mind for the system, Icouldwrite a program (assuming an arbitrarily large computer) that strongly optimized the future in an undirected way, using a formalism likeAIXI。In that sense we’re less obviously confused about capabilities than about alignment, even though we’re still missing a lot of pieces of the puzzle on the practical capabilities front.

    Similarly, we do know how to leverage a powerful function optimizer to mine bitcoin or prove theorems. But we don’t know how to (safely) do the kind of prediction and policy search tasks I described in the “fill a cauldron” section, even for modest goals in the physical world.

    Our goal is to develop and formalize basic approaches and ways of thinking about the alignment problem, so that our engineering decisions don’t end up depending on sophisticated and clever-sounding verbal arguments that turn out to be subtly mistaken. Simplifications like “what if we weren’t worried about resource constraints?” and “what if we were trying to achieve a much simpler goal?” are a good place to start breaking down the problem into manageable pieces. For more on this methodology, see “MIRI’s Approach。”

  5. “填补这一大锅而不对此太聪明,努力工作或我不期望的任何负面后果”是一个进攻范围限制的目标的一个粗略例子。我们实际上想使用的比人类智能智能更聪明的事情显然比这更雄心勃勃,但是我们仍然希望从各种有限的任务开始,而不是开放式目标开始。

    阿西莫夫(Asimov)的三个机器人法则的故事是出于研究的一部分,部分原因是从研究的角度来看,它们是无助的。金宝博娱乐将道德戒律变成代码行的艰巨任务隐藏在诸如“ [不要,通过无所作为,允许人类受到伤害)之类的短语后面。如果一个严格遵守这样的规则,结果将大大破坏,因为AI系统将需要系统干预以防止金宝博官方even the smallest risks of even the slightest harms;如果意图是一个宽松地遵守规则,那么所有的工作都是由人类的敏感性和直觉来完成的when and how to apply the rule

    A common response here is that vague natural-language instruction is sufficient, because smarter-than-human AI systems are likely to be capable of natural language comprehension. However, this is eliding the distinction between the system’s objective function and its model of the world. A system acting in an environment containing humans may learn a world-model that has lots of information about human language and concepts, which the system can then use to achieve its objective function; but this fact doesn’t imply that any of the information about human language and concepts will “leak out” and alter the system’s objective function directly.

    Some kind of value learning process needs to be defined where the objective function itself improves with new information. This is a tricky task because there aren’t known (scalable) metrics or criteria for value learning in the way that there are for conventional learning.

    If a system’s world-model is accurate in training environments but fails in the real world, then this is likely to result in lower scores on its objective function — the system itself has an incentive to improve. The severity of accidents is also likelier to be self-limiting in this case, since false beliefs limit a system’s ability to effectively pursue strategies.

    In contrast, if a system’s value learning process results in a that matches our in training but diverges from in the real world, then the system’s will obviously not penalize it for optimizing . The system has no incentive relative to to “correct” divergences between and , if the value learning process is initially flawed. And accident risk is larger in this case, since a mismatch between and doesn’t necessarily place any limits on the system’s instrumental effectiveness at coming up with effective and creative strategies for achieving .

    The problem is threefold:

    1.“做我的意思”是一个非正式的想法,即使我们知道如何构建一个比人类的AI系统更聪明,我们也不知道如何精确地在代码行中准确指定此想法。金宝博官方

    2.如果我们真正的意思在于实现特定目标在工具上有用,那么一个足够的能力系统可以学习如何做到这一点,并且只要这样做对其目标都是有用的。金宝博官方但是,随着系统金宝博官方变得越来越有能力,他们可能会找到实现相同目标的创造性新方法,并且没有明显的方法可以保证“做我的意思”将继续无限期地有用。

    3. If we use value learning to refine a system’s goals over time based on training data that appears to be guiding the system toward a that inherently values doing what we mean, it is likely that the system will actually end up zeroing in on a that approximately does what we mean during training but catastrophically diverges in some difficult-to-anticipate contexts. See “古德哈特的诅咒” for more on this.

    For examples of problems faced by existing techniques for learning goals and facts, such as reinforcement learning, see “Using Machine Learning to Address AI Risk。”

  6. 结果可能不是人类般的设计,因为我们的进化涉及如此多的复杂历史意外情况。结果还将能够从许多大型software and hardware advantages
  7. 这个概念有时会陷入“transparency” category, but standard algorithmic transparency research isn’t really addressing this particular problem. A better term for what I have in mind here is “understanding”。我们想要获得insi更深更广ghts into the kind of cognitive work the system is doing and how this work relates to the system’s objectives or optimization targets, to provide a conceptual lens with which to make sense of the hands-on engineering work.
  8. 我们可以choose为系统编程以使其疲倦,但我们不金宝博官方必这样做。原则上,人们可以编程扫帚,该扫帚只能找到并执行优化大锅丰满的动作。提高系统有效查找高分动作的能金宝博官方力(通常或相对于特定评分规则)本身并不能改变其用于评估动作的评分规则。
  9. 我们可以想象后一个案件导致feedback loop随着系统的设计金宝博官方改进,它可以提出进一步的设计改进,直到所有低悬挂的水果耗尽为止。

    另一个重要的考虑因素是两个的main bottlenecks to humans doing faster scientific research are training time and communication bandwidth. If we could train a new mind to be a cutting-edge scientist in ten minutes, and if scientists could near-instantly trade their experience, knowledge, concepts, ideas, and intuitions to their collaborators, then scientific progress might be able to proceed much more rapidly. Those sorts of bottlenecks are exactly the sort of bottleneck that might give automated innovators an enormous edge over human innovators even without large advantages in hardware or algorithms.

  10. 具体而言,火箭队经历了更广泛的温度和压力,可以更快地穿越这些范围,并且还更加充分地包含炸药。
  11. Consider Bird and Layzell’sexampleof a very simple genetic algorithm that was tasked with evolving an oscillating circuit. Bird and Layzell were astonished to find that the algorithm made no use of the capacitor on the chip; instead, it had repurposed the circuit tracks on the motherboard as a radio to replay the oscillating signal from the test device back to the test device.

    This was not a very smart program. This is just using hill climbing on a very small solution space. In spite of this, the solution turned out to be outside the space of solutions the programmers were themselves visualizing. In a computer simulation, this algorithm might have behaved as intended, but the actual solution space in the real world was wider than that, allowing hardware-level interventions.

    In the case of an intelligent system that’s significantly smarter than humans on whatever axes you’re measuring, you should by default expect the system to push toward weird and creative solutions like these, and for the chosen solution to be difficult to anticipate.