Generative AI’s end-run around copyright…

Jan 22, 2024

126

Output similarity is a distraction

Read →

21 Comments

Johan Brandstedt

Jan 22, 2024

Thanks for highlighting the other and more serious side of the issue!

This aspect will be tried in Germany in April (Kneschke v Laion).

It’s also worth noting that both the US senate and UK government have confirmed training on copyright works as illegal this past week.

Expand full comment

Gary Marcus

Jan 22, 2024

We actually addressed the possibility of output filtering in our paper, and noted that it is not reliable, and suspect that it will not work at all well for lesser-known works. You took the easiest cases, ignoring what we wrote, and have generalized to all cases, exactly the sort of reasoning fallacy your blog is supposed to highlight.

At the least you could have addressed the passage in the Spectrum where we said “As a last resort, the X user @bartekxx12 has experimented with trying to get ChatGPT and Google Reverse Image Search to identify sources, with mixed (but not zero) success. It remains to be seen whether such approaches can be used reliably, particularly with materials that are more recent and less well-known than those we used in our experiments.” Even more important I think was this passage towards the end that we quoted “… everyone knows what Mario looks Iike. But nobody would recognize Mike Finklestein’s wildlife photography. So when you say “super super sharp beautiful beautiful photo of an otter leaping out of the water” You probably don’t realize that the output is essentially a real photo that Mike stayed out in the rain for three weeks to take.” You have by no means shown that ChatGPT filtering will be up to that task, nor even considered the possibility.

A bit disappointed you didn’t ask me for comment before posting, particularly given the strong formulation you chose, focusing only on the most frequent stimuli. If you really want to pursue you should also consider the noncanonical Marios in my Substack with Katie Conrad, and other stimuli of that ilk. Generalization, as always, is the key issue.

Expand full comment

Reply (2)

Gary Marcus

Jan 22, 2024

I should also add that OpenAI regularly updates relative to published works (multiple mechanisms for that may exist), so a success on something public without further probing is inevitably an insufficient evaluation.

Expand full comment

David Willis

Feb 10, 2024

"nobody would recognize Mike Finklestein’s wildlife photography"

A computer would, if it was trained to do so. I don't see why an artist's popularity has any bearing on whether or not output filtering would work. Why is it harder to identify Mike Finklestein's work than it is, say, Annie Leibovitz?

Expand full comment

Reply (1)

Thomas Board

Jul 6

I think the point is that mere identification is not sufficient in this case. As I understand UK copyright for example (based on in-person attendance at a copyright workshop at The British Library, run by their in-house copyright expert), the use of an image to create a "derivative" work is an infringement of the original image creator's copyright. Therefore, the direct reproduction of an image is likely an infringement much as the quoted NYT examples of "output similarity" is likely an infringement. The indexing of originals used in training for a lookup filter RAG is not a solution for highly unique images, because the generative algorithm in such cases appears to just replicate the original in situations where it doesn't have a lot of training data.

Expand full comment

Matthew Udanoh

Jan 24, 2024

The AI genie is out of the bottle and you can't legislate your way to stop it. Introducing restrictive laws in the West will only empower competitors in less controlled regions. AI has changed the rules and the best thing to do is to adapt instead of trying to hang on to the what worked in the past. We will all come to this realization sooner or later.

Expand full comment

Tek Bunny

Jan 24, 2024

Newspaper articles can be found all over the internet, especially if they are interesting to a particular demographic. How is it possible for NYT to 'own' a piece of news? In particular as you can find that many articles are themselves copied from Reuters reports, or from journalism sourced from another news source. Even if you argue about the similarity of the language used, unless you have pages of text, how many ways are there to report a news worthy event?

Expand full comment

Vernon Niven

Jun 19, 2024Edited

The AI training/copyright problem will not be resolved legally - or at least, not to the satisfaction of media creators.

One reason is legal precedent: there are countless examples of media creators and companies building their franchises by incorporating and exploiting other people's work - consider:

- WaPo, Wall Street Journal, and NYTimes incorporate and reference data and copy from other sources into their work all the time, without payment or obtaining permission. If it's on the web, it's fair game.

- Statista and other infographic media companies do the exact same thing.

- How many influencers create the studies and news stories they discuss?

Another reason is that there is an easy/quick technology fix rolling out right now. As long as AI responses return the URLs of "likely sources" for their responses, they address the copyright/training issue under existing laws.

And RAG (retrieval-augmented-generation) technology makes this relatively easy and inexpensive to do.

Google, Perplexity, and most other major model builders are incorporating RAG into their architectures right now. And almost every applied AI company must use RAG to meet functional requirements - not to respect policy.

At a high level, the vast majority of AI use cases - media included - will soon be RAG-enabled, so the size of the copyright exposure problem for foundation model builders and applied AI companies is about to decline dramatically.

Will the debate over copyright and training about to follow suit...?

Expand full comment

Jes Rutledge

Mar 5, 2024

This touches on a topic that has been troubling me for some time - that the generative AI platforms I have used seem unable to unable to correctly identify the source of their information, which makes it difficult for a reader to (a) assess the quality of the original source, and (b) understand where to look if they want to learn more. The academic practice of referencing source material would help to address this issue if it could be done accurately, but thus far I have not seen that executed well.

It seems to be that this is not just an end-run around copyright but also a deep failure in acknowledging information sourcing, which threatens to make critically assessing the output of these platforms challenging for anyone that is not a subject matter expert.

Vetting the integrity of information sources seems to be one of the rising challenges of our times.

Expand full comment

Johannes

Jan 24, 2024

You highlight the harms LLMs can do to creators, yet fail to contrast that with the benefit the global public receives through these models. LLMs make our global shared cultural memory available to every person on the globe with internet access at very low cost and lets them apply it to produce and achieve new things, from research to differential medical diagnosis to technical and legal advice to personal tutoring on any subject matter ever written about - a massive step forward in the evolution of knowledge societies, comparable maybe only to the printing press. The demand of the NY Times to destroy this new tool because it can (unreliably) reproduce months-old articles is completely out of proportion given the above benefits these models bring to the public. NY Times also needs to demonstrate ACTUAL loss in revenue, which will be very hard to prove because I doubt a single person has cancelled their subscription because they can produce some individual articles that went viral a few months ago.

I am not a lawyer, so I have no way of telling which way the lawsuit will go. But I agree that the effects of LLMs will outlast the lawsuit. The difference is just that I think these effects are overwhelmingly positive for knowledge societies as a whole, even if some players, which have created information monopolies for themselves in the past, are hurt in the short-term.

Expand full comment

Reply (2)

Bala Pillai

Feb 3, 2024

Excellent point, @Johannes.

For example, in almost every place in the world, it would almost be impossible to make a fresh #zerocrime village/town happen.

Yet, that is what thousands of our early ancestors did in no-village days.

Why can’t we? Because 1) emergence, #evolutionary_emergence has unnoticedly ended in the world, its last bastion being #SpiceTradeAsia.

2) We have to resuscitate #evolutionary_emergence, to have the readiness to emerge (as almost all plants and animals except most humans have) again.

3) Overwhelming pessimism/negativity without thought to the fact that the first village ever, would not have been possible with this high level of herd-mentality programmed pessimism.

4) Several more causes

Generative AI gives us an opportunity to fix this gross pessimism flaw that media has a big responsibility for.

It is only reasonable for media, NYT included to contribute towards resetting the sensibility of the playing field -- have it tilt back to neutral instead of remaining demonic.

Background visual of Universal Basic Life (UBL not UBI) in #SpiceTradeAsia region: https://youtu.be/NuZujx-LMfg

Expand full comment

Victualis

Jan 24, 2024

How long will proprietary LLMs remain low cost, if people start to rely on them to do their jobs? Actually low cost LLMs like quantized Llama-2 variants are not competitive with ChatGPT-4 and will likely remain worse since the secret sauce seems to be proprietary data and curation. This article highlights how access to insider information (whether paid-for newsletters, databases, consulting reports, or copyrighted articles) creates a defensible moat for the cloud AI providers, which have cash to pay for access once the lawsuits decide they need to pay. The problem here isn't the LLMs but the way we have decided it is OK to keep some information private and to restrict access, to incentivize creators, but at the cost of creating a socially corrosive division between those in the know and the rest of us. LLMs are just a symptom of this divide, and cannot bridge it.

Expand full comment

Reply (1)

Johannes

Jan 24, 2024

I believe LLMs (and the AI technologies that will follow it) can bridge that divide, if we as society make the reasonable choice and decide that training AI on the entirety of available scientific and cultural knowledge is in the best interest of mankind. Declaring AI training as 'theft' and raising paywalls to allow only well-funded AI systems to access our shared cultural memory (and only rich clients to use it) will lead to exactly that divided future you describe above.

Expand full comment

David A. Westbrook

Jan 23, 2024

Nicely done, again! Speaking as a recovering lawyer, this is all about copyright. I'm not sure the fixes you suggest will work so well in the land of trademark, where likelihood of confusion rules. But we shall come to know.

More generally, we have yet to develop a political economy of technology -- what do/should we want? Moving the question from employment (Luddites!) to art may be a cognitive and political step forward.

Keep up the good work.

Expand full comment

Vainius Indilas

Jan 22, 2024

Thank you, great article!

Just re: “As the model generates text, the filter looks it up in real time in a web search index (OpenAI can easily do this due to its partnership with Bing)”

- wouldn’t this substantially slow down the output generation / increase compute costs? I assume that the output text would need to be split into multiple chunks and each would need to be tested separately.

Expand full comment

Reply (2)

Arvind Narayanan

Jan 22, 2024

I agree, but LLMs are so computationally expensive that I think this will be a fraction of the total cost. I think one lookup per paragraph is reasonable, but I believe generating a paragraph of text is significantly more costly than a web search.

Expand full comment

Johan Brandstedt

Jan 22, 2024

It’s massively wasteful. This is intra-big tech competition, revving datacenter engines against the others in a chicken race of wasted compute

Expand full comment

Thomas Board

Jul 6

Thanks for this article, it helps provide clearer framing around the need for better solutions to AI copyright issues than relying on the courts to sort things out. Ever since becoming aware of the threat of AI tools to the creative industries, I've been thinking that the real issue is not compliance with current laws but the need for new ones. Our statute-based legal systems, being based on case law established from litigation in the past, are wholly inadequate for the task of solving future-facing technological disruption.

It would be interesting in future articles to hear more thoughts and perspectives on how regulators and law-makers can grasp the "real" issues facing these creative industries. Whilst I agree with your view that AI technologies are likely to take more time to mature and evolve than the hype would suggest, this is of little comfort when companies seem to be trapped in a reactionary competitive cycle of needing to keep up with competitors who are "adopting AI" (whatever that means) at a pace dictated by the market hype.

I sincerely hope your articles help redress the balance and inspire our regulators and law-makers to start engaging proactively with these issues.

Expand full comment

James Andrewartha

Mar 13, 2024

I don't think the fair use argument will hold water given there's now a well-established market for training data for LLMs.

Expand full comment

Dmitrii Smirnov

Mar 2, 2024

I think, that this lawsuit process(es) will lead to build up even higher walls around copyrighted knowledge inside big companies. Imagine, if human's reading copyrighted material, he/she just obtains this knowledge, but almost nobody are able to re-produce what was read precisely. This is the main difference between AI and Human. And this the biggest pain. My personal prediction here is: soon big companies will be forced to invent new kind of Copyright Laws, e.g. "grade of equality", like checking if your text is unique or not. One AI will be fighting another AI.

Expand full comment

David Willis

Feb 10, 2024

In fairness to ChatGPT, NYT and others use a client-side paywall that is easily defeated with some simple tweaks in dev tools. Tweaks a computer doesn't even need to make. I feel like that's on them.

Expand full comment