AI safety is not a model property

Mar 12, 2024

123

Trying to make an AI model that can’t be misused is like trying to make a computer that can’t be used for bad things

Read →

20 Comments

Raj Iyer

Mar 13, 2024Edited

Thank you, this was a great read.

"The assumption that AI safety is a property of AI models is pervasive in the AI community."

Isn't the reason for this that a lot of AI companies claim they're on the pathway to "AGI", and AGI is understood to be human-level intelligence (at least), which translates, in the minds of most people, as human level understanding of context, and thus, human level of responsibility. It's hard to say a model is as smart as a human, but not so smart that it cannot discern to what purpose another human is using it.

Put another way, many (though not all) AI companies want you to see their models to be as capable of humans, able to do the tasks humans can, at human or near-human levels, only automated, at broader scale, and without the pesky demands human employees make.

Acknowledging that AI models cannot be held responsible for their risky uses puts them in their appropriate place: as a new form of computing with great promise and interesting use cases, but nowhere close to replacing humans, or equaling the functions humans play in thinking through and mitigating risk when performing the kinds of tasks AI may be used for.

But that runs contrary to the AGI in a few years/singularity narrative, so safety is facto expected to be an aspect of the intelligence of Artificial Intelligence models. The snake oil salesmen are being asked to drink their own concoctions. Hopefully, at some point, they'll be forced to acknowledge reality.

Expand full comment

Steve Newman

Mar 12, 2024

Great post! There are a lot of important ideas here that I hadn't seen clearly expressed before.

You note that in many cases, an important aspect of risk reduction may lie in adapting to risks, rather than attempting to develop riskless models. I have recently been thinking along somewhat similar lines. It seems to me that for many risks, there are adaptations available that are worthwhile on their own merits, even without taking AI risk into account. For instance, public health measures such as improved building ventilation and pandemic preparedness seem justifiable based solely on the the basis of existing viruses, but would also reduce the danger from AIs capable of assisting with bioengineering. Not all risks can feasibly be eliminated in this fashion, but it seems to me that many risks can be substantially reduced.

Expand full comment

Reply (1)

Arvind Narayanan

Mar 13, 2024

Yes, exactly! (We've made this point once or twice but it deserves to be emphasized a lot more.)

Expand full comment

Jason

Mar 13, 2024

Thoughtful piece, and a few good take-aways, but a few comments in response:

- As you point out, this is a post about misuse, not safety risks in general. You suggest your argument applies broadly, but make no convincing case in this area. There are a variety of AI safety risks that are treatable at the model level, especially if that model is trained to be effective on a specific use case or for a specific persona, each of which would imply embedding some amount of context as part of the training process. Things like toxicity, slander, hate speech, bias and fairness, and other areas of safety are things that model teams can and should be trying to detect and, where relevant, resolve at the model level, or at least providing information on vulnerabilities to others in the application chain so application-level support (e.g., adversarial judges, application-level rails, etc) can be considered.

- Your point on marginal risk (here and elsewhere) is a reasonable addition to the debate, but the fact that no/little marginal risk is added by a model need not mean that action to reduce that risk should not be taken. Just because a model doesn't today provide much more risky information to a user than that user can already get from the Internet hardly means that we should throw our hands in the air and conclude that all new things should be as unsafe as whatever is already possible. This is a least common denominator approach to technology development. We need not even get into questions around responsibility here -- development and consuming organizations have good corporate reasons to avoid being vectors for unwanted outcomes, and therefore lots of reason to be better than, say, what the general Internet makes possible. So marginal risk defined as risk beyond some public denominator is often actually the wrong way for organizations to think about risk.

Expand full comment

David A. Westbrook

Mar 25, 2024

I teach law, and just posted on Boeing & Complexity on my own substack, so this post really caught my eye. I liked it, as usual, but think the case, or at least the title, is overstated. Of course, any technology can be used for bad things. Consider planes (9/11), or pharmaceuticals (opioid crisis), or cars, or household appliances, or . . . But we care a great deal about the safety of planes and drugs and the rest. Modern societies have elaborate regulatory structures to ensure that such technologies are safe in ordinary, and perhaps even modestly negligent, use, quite apart from malicious use. So yeah, the properties of technologies matter. Your recognition of an exception of "safety not model property" rule for child sexual abuse seems insufficient.

That said, and after reading past the title, I think your overarching point is that context matters, and that developers need to bear responsibility for ensuring safe use. An analogy might be controlled substances of various sorts. How much responsibility, of course, is a social/legal problem. And I agree, from a developer's perspective, it would be convenient to limit the question of responsibility to model properties, and society should not allow that.

Holding the problems of context to one side, which your post interestingly addresses in preliminary fasion, I am left wondering, what do you think model safety would look like? Keep up the good work!

Expand full comment

Willem

Mar 14, 2024

Interesting! You basically argue for government funded regulators to build (and presumably test) offensive pipelines, right? Would be interesting to hear you speculate about what that would mean/require in detail. Hunderds or thousands of government funded highly capable people using AI to try to cyberattack nuclear powerplants, steal credit card information, or build biological weapons? Not trying to caricature your argument here. I agree actually, but might make some people quite uncomfortable. So interesting and useful to speculate what it should look like...

Expand full comment

MichaeL Roe

Mar 13, 2024

There is a category of risk that you have omitted, but which features heavily in the ai-notkilleveryone discourse: what if the model decides to do something malicious, without being asked to do it by the model user?

Now, maybe, models are safe by default in this sense, in that you have to do some work to create a malicious model that turns on its users. If we're unlucky, it's an emergent property.

In any case, the user of the model would probably like to know that the model isn't going to try to kill them, and this seems a model property.

Expand full comment

Reply (2)

michaelsklar

Mar 13, 2024

Yeah, this post says some very sensible things about the challenges in preventing misuse, but in its title and statements about others' positions conflates

"model properties are not enough ensure safety"

with

"model properties cannot pose a safety issue"

And it ignores the entirety of the yudkowskian position

re:

"Why has the myth of safety as a model property persisted? Because it would be convenient for everyone if it were true! In a world where safety is a model property, companies could confidently determine whether a model is safe enough to release, and AI researchers could apply their arsenal of technical methods toward safety. "

The hard-doomer position is that some AGI+ models may turn out inherently unsafe/power-grabbing in general AND that this may be hard to verify due to deceptiveness and alignment-training. If you put any weight on that possibility, investing in tools to check "model properties" is one of the few things you can do to deal with it

Expand full comment

MichaeL Roe

Mar 13, 2024

The above was serious, this bit is a science fiction story. What if the model decides whether the person using them is an asshole, and. If so, kills them? Would we carry on using these models. Like, well, sure, there's a risk and those guys upended up dead, but the remaining users are fairly sure they're safe.

In reality, I think this is unlikely.

From a regulatory point of view, would we tolerate it?

Expand full comment

Bianca Dămoc

Mar 13, 2024

An interesting read!

I'd add that, generally speaking, there is an incentives misalignment to begin with, in AI and tech in general.

How do you explain to someone about the risks of AI when they're in a headlong rush towards wealth and power?

There is just so much money to be made, quite a bit of arrogance, and not enough consequences.

Expand full comment

Alexander Harrowell

Jun 6, 2024

I think the problem in the end is that the concept of AI safety is flimsy and relies on a tacit alliance between the industry and the "AI ethics" cottage-industry to promote each others' self-importance. There's no way you can make germs by typing stuff into a chatbot; similarly, if you want to post something nasty on the Internet you can already do that yourself and your right to do so is constitutionally guaranteed. What if the AI generated misleading political propaganda? Well, that would be like....politics? They tell a lot of lies and even more half-truths. It's as if the desire to regulate a whole swath of human behaviour in ways that the wannabe regulator would not accept for themselves and knows they won't be able to achieve for other people has been channelled into AI.

Expand full comment

Dreaming of a song...

May 3, 2024

Focusing solely on the models themselves feels like putting a Band-Aid on a bullet wound.

What about the bigger picture? Are we addressing the potential biases in the training data that could be baked into the models from the start? What about the societal and ethical implications of relying so heavily on AI, especially when safety thresholds seem arbitrary?

It feels like we're scrambling to make these models "safe" after the fact, rather than building safety into the entire AI development process from the ground up.

Expand full comment

Cullen O'Keefe

May 2, 2024

Hello! I have a response to this post here: https://juralnetworks.substack.com/p/ai-safety-is-sometimes-a-model-property?r=2ktfhm&utm_campaign=post&utm_medium=web&triedRedirect=true

Expand full comment

MichaeL Roe

Mar 25, 2024

In a typical Internet firewall setup, the firewall sits between the wilds of the outside Internet (where the attackers live) and the machines on your corporate internal network the firewall is supposed to keep the bad stuff out. But to do that, it needs an accurate model of the machines on your internal network so it can figure out what's actually bad. There is a whole class of attacks where something that is actually bad isn't recognised by the firewall,

Ok, now imagine doing this with LLMs. The attacker types on stuff, a firewall that you have written checks if it is bad, and if not passes it on to the LLM.

Problem: your firewall needs an accurate model of the LLM, and therefore is as complex as the LLM. Oh dear.

Expand full comment

Reply (2)

keithdouglas

Mar 27, 2024

Except that *modern* security (disclaimer: I work in application security and have done adjacent to ANN work from graduate school on) would likely assert that we should put control inside the network, because that's where one would do it in software. Since in ANN the weights are the "software" that's where security controls are most effective in many cases. Unfortunately, due to the fact that we are dealing with a nonpropositional (non-language like) architecture this is impossible. And so security (and safety, if there is a difference) is also to that degree impossible. Uh oh. Is this a *model* property (or rather family of properties)? No, but it is arguably a desirable property of the software, as stated. IOW, there is more to these systems than just a classifier or the like. This is all very complicated - and one reason I enjoy the work here (and why I also am very worried) is because of the strong connections to the 1980s and 1990s debates in philosophy of mind).

Expand full comment

MichaeL Roe

Mar 25, 2024

It's especially fun because todays state of the art LLMs can be persuaded to set up an encrypted communication channel between themselves and the attacker, leaving the firewall in the dark as to what's happening.

So, sure, the firewall could monitor what you're typing and figure out if it's semantically equivalent to creating an encrypted channel. But ... this would require substantial understanding on the part of the firewall.

Expand full comment