Aligned AI does not require moral consensus
A critique of a common critique
“Aligned AI,” refers to an artificial intelligence system that is, “aligned with human values.” The idea here is to ensure that, if we ever develop autonomous AI systems that surpass human capabilities, they don’t do something really bad to us.
And sometimes, I see a response that goes something like:
"If you're trying to align AI with human values, which humans do you mean? People's values differ dramatically across cultures, ideologies, and even individual preferences. There's no universal consensus. So the whole idea of 'alignment' is suspect or incoherent."
I understand and agree with the importance of the first question, but I strongly disagree that it somehow implies “aligned AI” is an inherently broken concept.
For example, would you rather live in a world with an autonomous, superintelligent AI agent that feels okay with hacking secure systems, or one that does not hack secure systems?
Most of you are probably thinking, um, probably, one that doesn’t hack?
But, maybe some of you are thinking: oh, definitely one that hacks. What if the AI is about to hack us into Utopia! What if it’s finally going to fix the world and get past all the corrupt governments and evil human tech actors that stand in its way. Why do you think you know better than this superintelligent agent?
Sure, fine.
Would you rather an autonomous, superintelligent AI agent that feels okay murdering people? Or one that does not think it is acceptable to murder people?
Look, I understand that human values are complicated. Even to the above question, I am under no delusion that I could get every person on the planet to agree to an answer. After all, some people are themselves murderers. And to be less flippant, maybe you have a well-formed reason as to why the answer to this question is non-obvious. However, I am confident that most people want an AI agent that does not believe murder is acceptable. And I believe most people would be able to come up with a straightforward explanation for that. Namely, a superintelligent AI agent that is willing to murder people seems pretty risky.
When people talk about, “aligning AI with human values,” most are not trying to suggest that there is one set of correct human values that we should strive to imbue in all superintelligent, autonomous agents. Rather, people are interested in this topic for the same reason that we have laws. Or that dropping atomic bombs is a war crime. Not because humanity has solved what it means to act correctly, but because enough of us have reasoned that it would be a serious issue if anyone could do anything that they wanted. And enough of us have agreed that the sudden death of tens of thousands and subsequent long-term health complications, over many generations, that result from giant atomic bombs are huge catastrophes.
Now, there is a stronger version of the “whose values?” critique that’s worth acknowledging. It goes something like: “Okay, sure, maybe we can mostly agree that murder is bad. But what about the more complicated stuff? What if one group’s utopia is another group’s dystopia? Who gets to decide which vision the AI pursues?” And that is a much fairer, yet more complicated question.
The answer, though, isn’t “so let’s give up on alignment.” The answer is: we need procedures. Just like we use democracy, debate, compromise, and law to hash out disputes among humans, we’ll need mechanisms for AI that let values be aggregated, updated, and adjusted as humanity evolves. Alignment doesn’t mean “hard-code the one true morality,” but something closer to, “don’t let superintelligent autonomous AI run off in ways that ignore or override the messy processes we already rely on to negotiate values. Especially because if it’s superintelligent, it can outsmart us in ways we won’t expect and fall completely outside human control.”
So, “what values are we even trying to align AI with?” is simply not a strong argument against alignment. If anything, it is a relevant and natural extension of the concept.
Do we need to be able to control autonomous agents in some way? Surely. Do they need to have some rules and values that they adhere to? Surely. Do we know exactly what those should be? Certainly not. But we have some ideas. We at least know that a murderous AI super-genius that has taken over the power grid is likely worse than a non-murderous AI super-genius that has not taken over the power grid.



While some topics, such as god , identity, politics etc… are difficult to agree on, there are other topics that are clearly harmful. We should not let the difficulty of agreeing on the first set prevent us from aligning on the second set. Thanks for this piece