AI Safety Without Trusting AI
To ensure AI safety, we’ll need advanced AI capabilities. But how can we trust entities smarter than us? (Spoiler: We don’t have to.)
The challenge of AI safety has long been framed as an us-vs.-them problem:1
Yoshua Bengio asks, “what if we build a superintelligent AI, and what if it has goals that are dangerous to humanity?”2
Geoffrey Hinton warns that “if a digital super-intelligence ever wanted to take control it is unlikely that we could stop it.”3
Marvin Minsky said, “Once the computers got control, we might never get it back. We would survive at their sufferance.”4
Alan Turing shared this view: “it seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers…we should have to expect the machines to take control”5
A common view holds that superintelligent AIs will have vast, unique value and must be aligned with human purposes lest they be a threat to our existence. But solving the technical problem of superintelligent-AI alignment is neither necessary nor sufficient:
A solution isn’t necessary for any practical purpose because the supposed value of superintelligent AI entities (“AIs”6) is undercut by other options.
A solution to the technical problem isn’t sufficient for safety because humans who deploy AI might be incompetent, irresponsible, or malicious.
Consider how high-level AI capabilities could be applied to large, consequential tasks:
Proposing plans is a task for generative AI models that compete to develop brilliant proposals and critiques.
Choosing plans is a task for humans advised by AI systems that compete to identify problems and discuss options.
Executing plans is a set of incremental tasks for specialized AI agents that compete to provide reliable results on budget and within schedule.
Overseeing plans (e.g., budgets, milestones, regulatory compliance), is a task for specialized AI systems that monitor, assess, and report to humans. Plans can be updated in response.7
This approach — the “AI Agency Architecture”8 — scales to superintelligent level capabilities, yet neither the parts nor the whole amount to “a superintelligent AI”. Agency architectures follow patterns seen in today’s superhuman entities — human organizations — and incremental upgrades of organizations with AI tend to move in this direction.
But what if “the AIs” collude against us?
Long-standing tradition mistakenly frames AI capabilities as properties of AI agents,9 and then imagines that the future situation will be “us vs. them” (the robot uprising, etc.). But this scenario typically assumes that weak humans will confront a cohesive, supercapable “them” — that all the smart AI systems will collude against us.
What conditions would make collusion a losing strategy for even the most self-interested superintelligent AI?10 Let’s take a perspective inspired by elementary game theory:
The following is adapted from an FHI Technical Report, Reframing Superintelligence: Comprehensive AI Services as General Intelligence, Section 20.
Note that “oracle”, a term introduced by FHI’s founder, Nick Bostrom, means “non-agentic question-answering AI”. Oracles in this sense would be integral to AI agencies as non-agentic systems designed to provide trustworthy answers to critical questions.
Collusion among superintelligent oracles
can readily be avoided
Because collusion among AI question-answering systems can readily be avoided, there is no obstacle to applying superintelligent-level AI resources to problems that include AI safety.
20.1 Summary
The difficulty of establishing successful collusion among actors tends to increase as their capabilities, knowledge, situations, and roles become more diverse and adversarial (think auditors, competitors, specialists, red-teams…), and increasing the number of actors can make collusive cooperation more difficult . In the context of AI systems (even more so than among human beings), these conditions can be readily implemented and are attractive for pragmatic reasons. Arguments that, absent preexisting alignment, high-level AI systems will inevitably collude are ill-founded. Instead, we should expect that interactions among multiple superintelligent-level systems can be applied to suppress deception and reduce risk by leveraging imperfect alignment achieved at the level of individual systems.
20.2 Trustworthiness can be an emergent property
Prospects for solving AI-safety problems would be greatly improved if we could safely apply superintelligent-level question-answering resources (“oracles”, or more generally, “systems that provide information in response to prompts”) to solving those problems. A familiar objection dismisses this potentially powerful approach as unsafe in itself, arguing that, absent solutions to difficult problems, individual superintelligent-level systems would be untrustworthy, and that attempts to establish checks and balances among multiple systems (for example, through superintelligent-level evaluation of potentially deceptive answers to questions) would inevitably be thwarted by collusive cooperation.
Identifying robust strategies for ensuring non-collusion among superintelligent question-answering systems would overcome this objection, inviting exploration of superintelligence-enabled strategies for managing potentially untrusted superintelligent AI systems. The present discussion argues that the robust non-emergence of deceptive collusion among imperfectly aligned systems can be ensured by structuring systems of systems with diverse capabilities, knowledge, situations, objectives, and roles.
This problem framing assumes the ability to develop systems that respond to questions with superintelligent-level competence, and will (as a consequence of the nature of digital systems) assume the ability to constrain information inputs to AI systems during their development and use.
In the problematic case, superintelligent question-answering systems might provide deceptive answers in pursuit of emergent, potentially undesirable goals. The aim here is to identify principles for architecting multicomponent systems that will act as non-deceptive oracles, while treating their components as actors that could in principle engage in deceptive collusion.
20.3 A range of conditions can make collusion robust or fragile
Potential problems of deceptive collusion can be addressed by architecting systems that make collusion difficult or effectively impossible. Some relevant conditions:
Conditions that tend to facilitate collusion:
A1) Small numbers of actors (the required scope of collusion is small)
A2) Insensitivity to defectors (partial collusion is sufficient)
A3) Similarity of actors (for example, shared histories and objectives)
A4) Free communication (information can flow between any pair of actors)
A5) Iterated decision processes with observable actions
A6) Common knowledge of the above circumstances
Contrasting conditions that tend to disrupt collusion:
B1) Large numbers of actors (the required scope of collusion is large)
B2) Sensitivity to defectors (collusion fails unless it is complete)
B3) Diversity of actors (for example, different histories, adversarial objectives)
B4) Constrained communication (information flows through limited causal channels)
B5) Single-move decision processes, limited observability
B6) Lack of shared knowledge of strategically-relevant circumstances
Implementation choices that would yield conditions (A1–A5) are unnecessary and often undesirable. By contrast, implementation choices that yield conditions (B1–B5) are practical and often well-motivated:
Characteristics of practical architectures:
C1) To improve the quality and reliability of answers, it is natural to implement multiple, diverse (and implicitly competing) systems to propose alternatives.
C2) To identify low-quality or misleading answers, it is natural to employ diverse critics (auditors, red teams, etc.), any one of which might disrupt deceptive collusion.
C3) Systems of diverse, competing proposers and critics naturally implement different and often adversarial objectives.
C4) To minimize unwanted dependencies, systems are designed such that components have a sparse, thoughtfully designed communication graph.
C5) To provide stable results, it is natural to apply fixed models to a series of problems, yielding what is in effect a series of history-blind, single-move decisions.
C6) In training, it is natural to provide differentiated, task-relevant information to systems solving different problems, often omitting knowledge of general circumstances.
Considerations (C1–C6) [which parallel B1–B6, not A1–A6] suggest that implementing systems that will be resistant to the emergence of deceptive collusion aligns with the general strategy of employing ensembles of diverse, stable, task-directed system components to propose and assess answers to questions [in agency architectures, answers may include plans]. This approach can pass the test of practicality and potentially widespread adoption.
20.4 Untrusted superintelligence can be applied to AI safety
It seems both possible and practical to establish conditions that would effectively preclude deceptive collusion among diverse, task-oriented, superintelligent-level question-answering systems, thereby addressing a key problem in safely applying superintelligent capabilities to problems (Bostrom, Superintelligence, p.145).
Prospects for obtaining trustworthy answers from sets of individually untrusted superintelligent systems call for deeper exploration of AI safety strategies premised on reliable access to superintelligent problem-solving capabilities. Capabilities of particular interest include the ability to run and evaluate the results of superintelligent-level red-team/blue-team competitions relevant to questions of global strategic stability.
TL;DR: Because collusion can be thwarted, it is reasonable to expect that applying multiple potentially untrustworthy superintelligent-level systems to problems can improve rather than degrade safety.11 This prospect calls for greater attention and exploration.
Some comments on AI safety, 2025:
Many AI safety researchers and policy advocates have assumed that, without breakthroughs in alignment, advanced AI capabilities cannot be employed without catastrophic risks. This is mistaken.
Arguments for restrictions on AI development should not be premised on the assumed existential necessity of breakthroughs in AI alignment. False premises have unintended consequences.12
Further progress in AI alignment is extraordinarily important in part because better alignment makes AI more useful, and useful AI will be critical to safety.
Many AI risks are not safety risks and have no technical solution.
Further Reading
[from the original article in Reframing Superintelligence]
Section 8: Strong optimization can strongly constrain AI capabilities, behavior, and effects
Section 12: AGI agents offer no compelling value
Section 19: The orthogonality thesis undercuts the generality of instrumental convergence
Section 21: Broad world knowledge can support safe task performance
Section 23: AI development systems can support effective human guidance
Section 24: Human oversight need not impede fast, recursive AI technology improvement
Illustrated here by quotations from three Turing Award winners and one Turing.
From “Reasoning through arguments against taking AI safety seriously” (Yoshua Bengio, July 2024).
From a lecture at Oxford (Geoffrey Hinton, February 2023).
From Marvin Minsky, quoted in Life magazine (1970)
From “Intelligent Machinery, A Heretical Theory” (Alan Turing, c. 1951).
Advanced AI resources should not be equated with entities called “AIs”: Intelligence isn’t a thing.
Sensible plans always include plans for revising the plans (in light of experience, innovations, obstacles, new goals…), so “corrigibility” is inherent, not a challenge.
The “Agency Architecture” is a concept that is so general and intuitive that it shouldn’t even need a name. For a deeper exploration, see “How to harness powerful AI”.
Answers include red-teaming, which involves testing systems through simulated adversarial attacks, will be crucial to to enabling a strategic pivot to a defensively stable world. Optimal red-teaming for security will call for superintelligent-level AI systems that seek military strategies for achieving destruction and power, while competing systems explore first-mover options for thwarting those strategies. Think wargames in boxed simulators.
In a conversation around 1990, Marvin Minsky urged me to publish the concept described here — in brief, that trustworthy answers to AI safety questions can be gotten by engineering checks and balances among multiple, diverse, untrustworthy but robustly non-cooperating AI systems, using mechanisms along the lines outlined above. I regret any confusion that may have resulted from the delay in publication.
For example, leading anti‑nuclear campaigners (Linus Pauling, John Gofman), are argued to have exaggerated the risks of radiation and reactors to oppose nuclear weapons. This effort failed to halt the arms race, but fed public fears that stifled progress in developing safer, low‑carbon nuclear technologies (like generation IV reactors) that could have mitigated today’s climate crisis. This was not their intent, but ideas have unintended consequences.