Mar 13·edited Mar 13Liked by Sayash Kapoor

Thank you, this was a great read.

"The assumption that AI safety is a property of AI models is pervasive in the AI community."

Isn't the reason for this that a lot of AI companies claim they're on the pathway to "AGI", and AGI is understood to be human-level intelligence (at least), which translates, in the minds of most people, as human level understanding of context, and thus, human level of responsibility. It's hard to say a model is as smart as a human, but not so smart that it cannot discern to what purpose another human is using it.

Put another way, many (though not all) AI companies want you to see their models to be as capable of humans, able to do the tasks humans can, at human or near-human levels, only automated, at broader scale, and without the pesky demands human employees make.

Acknowledging that AI models cannot be held responsible for their risky uses puts them in their appropriate place: as a new form of computing with great promise and interesting use cases, but nowhere close to replacing humans, or equaling the functions humans play in thinking through and mitigating risk when performing the kinds of tasks AI may be used for.

But that runs contrary to the AGI in a few years/singularity narrative, so safety is facto expected to be an aspect of the intelligence of Artificial Intelligence models. The snake oil salesmen are being asked to drink their own concoctions. Hopefully, at some point, they'll be forced to acknowledge reality.

Expand full comment

I find that your argument intertwines several issues in a way that could lead to some confusion. First, you suggest that the widespread belief in AI safety as an intrinsic model property stems from exaggerated claims about achieving AGI, implying a level of contextual understanding and responsibility akin to human intelligence. However, I believe there's a distinction to be made between designing narrow AI with built-in safety features and the broader ambition of creating AGI with human-like cognitive abilities.

The ambition to build safety guardrails into narrow AI does not necessarily equate to attributing agency or the potential for human replacement. In fact, the automation of jobs, as you mention, is expected to be significant but not total. Indeed, the move toward more capable AI systems, including AGI, requires a rigorous focus on safety from the outset to prevent potential abuse or unintended consequences.

Your metaphor of "snake oil salesmen" seems to criticize the marketing of AI capabilities, suggesting a disconnect between claims and reality, particularly with respect to AI safety and AGI. While skepticism about exaggerated claims is warranted, I'd like to understand your perspective more clearly. Are you suggesting that the AI safety discourse is being manipulated or exaggerated by some in the industry? Or is the criticism more about the inherent limitations in the ability of current AI models to self-regulate and act responsibly?

Essentially, my confusion lies in how we move from discussing the technical and ethical challenges of AI safety to what appears to be a critique of AI marketing and safety advocacy. Could you elaborate on the connection you're making between these issues and the implications for the development of safe AI systems?

Expand full comment
Mar 12Liked by Arvind Narayanan

Great post! There are a lot of important ideas here that I hadn't seen clearly expressed before.

You note that in many cases, an important aspect of risk reduction may lie in adapting to risks, rather than attempting to develop riskless models. I have recently been thinking along somewhat similar lines. It seems to me that for many risks, there are adaptations available that are worthwhile on their own merits, even without taking AI risk into account. For instance, public health measures such as improved building ventilation and pandemic preparedness seem justifiable based solely on the the basis of existing viruses, but would also reduce the danger from AIs capable of assisting with bioengineering. Not all risks can feasibly be eliminated in this fashion, but it seems to me that many risks can be substantially reduced.

Expand full comment

Yes, exactly! (We've made this point once or twice but it deserves to be emphasized a lot more.)

Expand full comment

Thoughtful piece, and a few good take-aways, but a few comments in response:

- As you point out, this is a post about misuse, not safety risks in general. You suggest your argument applies broadly, but make no convincing case in this area. There are a variety of AI safety risks that are treatable at the model level, especially if that model is trained to be effective on a specific use case or for a specific persona, each of which would imply embedding some amount of context as part of the training process. Things like toxicity, slander, hate speech, bias and fairness, and other areas of safety are things that model teams can and should be trying to detect and, where relevant, resolve at the model level, or at least providing information on vulnerabilities to others in the application chain so application-level support (e.g., adversarial judges, application-level rails, etc) can be considered.

- Your point on marginal risk (here and elsewhere) is a reasonable addition to the debate, but the fact that no/little marginal risk is added by a model need not mean that action to reduce that risk should not be taken. Just because a model doesn't today provide much more risky information to a user than that user can already get from the Internet hardly means that we should throw our hands in the air and conclude that all new things should be as unsafe as whatever is already possible. This is a least common denominator approach to technology development. We need not even get into questions around responsibility here -- development and consuming organizations have good corporate reasons to avoid being vectors for unwanted outcomes, and therefore lots of reason to be better than, say, what the general Internet makes possible. So marginal risk defined as risk beyond some public denominator is often actually the wrong way for organizations to think about risk.

Expand full comment

I teach law, and just posted on Boeing & Complexity on my own substack, so this post really caught my eye. I liked it, as usual, but think the case, or at least the title, is overstated. Of course, any technology can be used for bad things. Consider planes (9/11), or pharmaceuticals (opioid crisis), or cars, or household appliances, or . . . But we care a great deal about the safety of planes and drugs and the rest. Modern societies have elaborate regulatory structures to ensure that such technologies are safe in ordinary, and perhaps even modestly negligent, use, quite apart from malicious use. So yeah, the properties of technologies matter. Your recognition of an exception of "safety not model property" rule for child sexual abuse seems insufficient.

That said, and after reading past the title, I think your overarching point is that context matters, and that developers need to bear responsibility for ensuring safe use. An analogy might be controlled substances of various sorts. How much responsibility, of course, is a social/legal problem. And I agree, from a developer's perspective, it would be convenient to limit the question of responsibility to model properties, and society should not allow that.

Holding the problems of context to one side, which your post interestingly addresses in preliminary fasion, I am left wondering, what do you think model safety would look like? Keep up the good work!

Expand full comment

Interesting! You basically argue for government funded regulators to build (and presumably test) offensive pipelines, right? Would be interesting to hear you speculate about what that would mean/require in detail. Hunderds or thousands of government funded highly capable people using AI to try to cyberattack nuclear powerplants, steal credit card information, or build biological weapons? Not trying to caricature your argument here. I agree actually, but might make some people quite uncomfortable. So interesting and useful to speculate what it should look like...

Expand full comment

I suspect that, regardless of what we as a society decide in advance, certain interests are already capitalizing on this kind of approach to powerful AI models (and eventually autonomous systems), leading inevitably to certain kinds of "runaway scenarios". The question of how disruptive these "runaway scenarios" are will inevitably be related to how much "real interest" academic and corporate researchers have in trying to make serious advances in safety (and security) research. Think about it: if you were one of those interests, would you be willing to wait and see if the other side decides to autonomize and weaponize these technologies? The author is absolutely correct in his assessments, and in fact makes suggestions that will help correct potential downsides inadvertently caused by both those pursuing the "greater good" and their adversaries. One possible suggestion would be to create an "Agency of AI Security Operations" and a "Counteragency of AI Security Operations" in which certain "games" are played that mimic real-world scenarios, and in which the goal of one agency is to pursue a goal, while the goal of the other agency is to detect and thwart such goals, with an oversight committee as a third node. The problem of "operational boundaries" (of course there has to be a "fictional model" or "simulation" or "hard boundary/constraints" for a game that involves such an arbitrary agency trying to gain control of nuclear plant X, where securing plant X (as opposed to exploiting the vulnerability) corresponds to winning the game) and the ethical ramifications of such experiments and how they must not affect the real world and society, and thus how useful they would be in practice, are relevant considerations to have and may make this a bad idea. What are some of your thoughts on this?

Expand full comment

There is a category of risk that you have omitted, but which features heavily in the ai-notkilleveryone discourse: what if the model decides to do something malicious, without being asked to do it by the model user?

Now, maybe, models are safe by default in this sense, in that you have to do some work to create a malicious model that turns on its users. If we're unlucky, it's an emergent property.

In any case, the user of the model would probably like to know that the model isn't going to try to kill them, and this seems a model property.

Expand full comment

Yeah, this post says some very sensible things about the challenges in preventing misuse, but in its title and statements about others' positions conflates

"model properties are not enough ensure safety"


"model properties cannot pose a safety issue"

And it ignores the entirety of the yudkowskian position


"Why has the myth of safety as a model property persisted? Because it would be convenient for everyone if it were true! In a world where safety is a model property, companies could confidently determine whether a model is safe enough to release, and AI researchers could apply their arsenal of technical methods toward safety. "

The hard-doomer position is that some AGI+ models may turn out inherently unsafe/power-grabbing in general AND that this may be hard to verify due to deceptiveness and alignment-training. If you put any weight on that possibility, investing in tools to check "model properties" is one of the few things you can do to deal with it

Expand full comment

The above was serious, this bit is a science fiction story. What if the model decides whether the person using them is an asshole, and. If so, kills them? Would we carry on using these models. Like, well, sure, there's a risk and those guys upended up dead, but the remaining users are fairly sure they're safe.

In reality, I think this is unlikely.

From a regulatory point of view, would we tolerate it?

Expand full comment

Focusing solely on the models themselves feels like putting a Band-Aid on a bullet wound.

What about the bigger picture? Are we addressing the potential biases in the training data that could be baked into the models from the start? What about the societal and ethical implications of relying so heavily on AI, especially when safety thresholds seem arbitrary?

It feels like we're scrambling to make these models "safe" after the fact, rather than building safety into the entire AI development process from the ground up.

Expand full comment

AI safety that goes beyond "model properties" can potentially yield contextually safe models with appropriate 1) training, 2) computational resources, 3) abstraction, 4) emergent contextuality through relations and correspondences, and 5) iterative improvements. These elements are underpinned by a 6) nuanced epistemological and ontological risk-based framework that carefully measures context in relation to individual settings, fostering what we recognize as "good actions under free will" and, subsequently, "agency". In other words, I believe that the more powerful the AI system, while maintaining current approaches to safety (and including the arguments the authors make), the safety of AI systems will scale in such a way that (in addition to) self-iteration (or self-improvement, in terms of safety improvements) will result in systems with proportionally reduced adversary capabilities to cause real harm (as opposed to adversaries using a variety of "narrow AI systems" with fewer safety guardrails). However, the path to truly safe deep neural network architectures and functional agentic systems is hampered by economic imperatives that prioritize deployment, leaving us perpetually one step behind in achieving "provable safety" (or security, recognizing the distinction between the two, which becomes harder to distinguish the more powerful the AI system). The complexity escalates when we consider critical, open systems where the timing of decisions - whether 50 milliseconds or 50 minutes - is crucial.

Expand full comment

In a typical Internet firewall setup, the firewall sits between the wilds of the outside Internet (where the attackers live) and the machines on your corporate internal network the firewall is supposed to keep the bad stuff out. But to do that, it needs an accurate model of the machines on your internal network so it can figure out what's actually bad. There is a whole class of attacks where something that is actually bad isn't recognised by the firewall,

Ok, now imagine doing this with LLMs. The attacker types on stuff, a firewall that you have written checks if it is bad, and if not passes it on to the LLM.

Problem: your firewall needs an accurate model of the LLM, and therefore is as complex as the LLM. Oh dear.

Expand full comment

Except that *modern* security (disclaimer: I work in application security and have done adjacent to ANN work from graduate school on) would likely assert that we should put control inside the network, because that's where one would do it in software. Since in ANN the weights are the "software" that's where security controls are most effective in many cases. Unfortunately, due to the fact that we are dealing with a nonpropositional (non-language like) architecture this is impossible. And so security (and safety, if there is a difference) is also to that degree impossible. Uh oh. Is this a *model* property (or rather family of properties)? No, but it is arguably a desirable property of the software, as stated. IOW, there is more to these systems than just a classifier or the like. This is all very complicated - and one reason I enjoy the work here (and why I also am very worried) is because of the strong connections to the 1980s and 1990s debates in philosophy of mind).

Expand full comment

It's especially fun because todays state of the art LLMs can be persuaded to set up an encrypted communication channel between themselves and the attacker, leaving the firewall in the dark as to what's happening.

So, sure, the firewall could monitor what you're typing and figure out if it's semantically equivalent to creating an encrypted channel. But ... this would require substantial understanding on the part of the firewall.

Expand full comment

If true, this would be absolutely disastrous for AI safety,

So, suppose every application has its own idiosyncratic safety requirement.

Can you use an off the shelf model?

No! By hypothesis, they are unsafe wrt your application requirement.

Oh, so you need to build your own model, that meets your requirement.

Trouble is, that costs you on the order of one hundred million dollars.

For every application.

Have you you got a hundred million dollars for each app.ication.


Ok, you're negligent deploying the system, and you should be liable - and maybe go to jail - if anything goes wrong

Stronger version: you are criminally negligent and should go to jail as a preventative measure before anything bad happens.

Expand full comment

The AI Dungeon experience with a dirty words filter is instructive. That architecture just does not work.

I expect that other application requirement also can't be met with bolted on filters.

Expand full comment

The writers have made out a great case here. That, too, in an easy to understand manner.

Expand full comment

An interesting read!

I'd add that, generally speaking, there is an incentives misalignment to begin with, in AI and tech in general.

How do you explain to someone about the risks of AI when they're in a headlong rush towards wealth and power?

There is just so much money to be made, quite a bit of arrogance, and not enough consequences.

Expand full comment