Thanks for highlighting the other and more serious side of the issue!

This aspect will be tried in Germany in April (Kneschke v Laion).

It’s also worth noting that both the US senate and UK government have confirmed training on copyright works as illegal this past week.

Expand full comment

We actually addressed the possibility of output filtering in our paper, and noted that it is not reliable, and suspect that it will not work at all well for lesser-known works. You took the easiest cases, ignoring what we wrote, and have generalized to all cases, exactly the sort of reasoning fallacy your blog is supposed to highlight.

At the least you could have addressed the passage in the Spectrum where we said “As a last resort, the X user @bartekxx12 has experimented with trying to get ChatGPT and Google Reverse Image Search to identify sources, with mixed (but not zero) success. It remains to be seen whether such approaches can be used reliably, particularly with materials that are more recent and less well-known than those we used in our experiments.” Even more important I think was this passage towards the end that we quoted “… everyone knows what Mario looks Iike. But nobody would recognize Mike Finklestein’s wildlife photography. So when you say “super super sharp beautiful beautiful photo of an otter leaping out of the water” You probably don’t realize that the output is essentially a real photo that Mike stayed out in the rain for three weeks to take.” You have by no means shown that ChatGPT filtering will be up to that task, nor even considered the possibility.

A bit disappointed you didn’t ask me for comment before posting, particularly given the strong formulation you chose, focusing only on the most frequent stimuli. If you really want to pursue you should also consider the noncanonical Marios in my Substack with Katie Conrad, and other stimuli of that ilk. Generalization, as always, is the key issue.

Expand full comment

I should also add that OpenAI regularly updates relative to published works (multiple mechanisms for that may exist), so a success on something public without further probing is inevitably an insufficient evaluation.

Expand full comment

"nobody would recognize Mike Finklestein’s wildlife photography"

A computer would, if it was trained to do so. I don't see why an artist's popularity has any bearing on whether or not output filtering would work. Why is it harder to identify Mike Finklestein's work than it is, say, Annie Leibovitz?

Expand full comment

The AI genie is out of the bottle and you can't legislate your way to stop it. Introducing restrictive laws in the West will only empower competitors in less controlled regions. AI has changed the rules and the best thing to do is to adapt instead of trying to hang on to the what worked in the past. We will all come to this realization sooner or later.

Expand full comment

Newspaper articles can be found all over the internet, especially if they are interesting to a particular demographic. How is it possible for NYT to 'own' a piece of news? In particular as you can find that many articles are themselves copied from Reuters reports, or from journalism sourced from another news source. Even if you argue about the similarity of the language used, unless you have pages of text, how many ways are there to report a news worthy event?

Expand full comment

You highlight the harms LLMs can do to creators, yet fail to contrast that with the benefit the global public receives through these models. LLMs make our global shared cultural memory available to every person on the globe with internet access at very low cost and lets them apply it to produce and achieve new things, from research to differential medical diagnosis to technical and legal advice to personal tutoring on any subject matter ever written about - a massive step forward in the evolution of knowledge societies, comparable maybe only to the printing press. The demand of the NY Times to destroy this new tool because it can (unreliably) reproduce months-old articles is completely out of proportion given the above benefits these models bring to the public. NY Times also needs to demonstrate ACTUAL loss in revenue, which will be very hard to prove because I doubt a single person has cancelled their subscription because they can produce some individual articles that went viral a few months ago.

I am not a lawyer, so I have no way of telling which way the lawsuit will go. But I agree that the effects of LLMs will outlast the lawsuit. The difference is just that I think these effects are overwhelmingly positive for knowledge societies as a whole, even if some players, which have created information monopolies for themselves in the past, are hurt in the short-term.

Expand full comment

Excellent point, @Johannes.

For example, in almost every place in the world, it would almost be impossible to make a fresh #zerocrime village/town happen.

Yet, that is what thousands of our early ancestors did in no-village days.

Why can’t we? Because 1) emergence, #evolutionary_emergence has unnoticedly ended in the world, its last bastion being #SpiceTradeAsia.

2) We have to resuscitate #evolutionary_emergence, to have the readiness to emerge (as almost all plants and animals except most humans have) again.

3) Overwhelming pessimism/negativity without thought to the fact that the first village ever, would not have been possible with this high level of herd-mentality programmed pessimism.

4) Several more causes

Generative AI gives us an opportunity to fix this gross pessimism flaw that media has a big responsibility for.

It is only reasonable for media, NYT included to contribute towards resetting the sensibility of the playing field -- have it tilt back to neutral instead of remaining demonic.

Background visual of Universal Basic Life (UBL not UBI) in #SpiceTradeAsia region: https://youtu.be/NuZujx-LMfg

Expand full comment

How long will proprietary LLMs remain low cost, if people start to rely on them to do their jobs? Actually low cost LLMs like quantized Llama-2 variants are not competitive with ChatGPT-4 and will likely remain worse since the secret sauce seems to be proprietary data and curation. This article highlights how access to insider information (whether paid-for newsletters, databases, consulting reports, or copyrighted articles) creates a defensible moat for the cloud AI providers, which have cash to pay for access once the lawsuits decide they need to pay. The problem here isn't the LLMs but the way we have decided it is OK to keep some information private and to restrict access, to incentivize creators, but at the cost of creating a socially corrosive division between those in the know and the rest of us. LLMs are just a symptom of this divide, and cannot bridge it.

Expand full comment

I believe LLMs (and the AI technologies that will follow it) can bridge that divide, if we as society make the reasonable choice and decide that training AI on the entirety of available scientific and cultural knowledge is in the best interest of mankind. Declaring AI training as 'theft' and raising paywalls to allow only well-funded AI systems to access our shared cultural memory (and only rich clients to use it) will lead to exactly that divided future you describe above.

Expand full comment

Nicely done, again! Speaking as a recovering lawyer, this is all about copyright. I'm not sure the fixes you suggest will work so well in the land of trademark, where likelihood of confusion rules. But we shall come to know.

More generally, we have yet to develop a political economy of technology -- what do/should we want? Moving the question from employment (Luddites!) to art may be a cognitive and political step forward.

Keep up the good work.

Expand full comment

Thank you, great article!

Just re: “As the model generates text, the filter looks it up in real time in a web search index (OpenAI can easily do this due to its partnership with Bing)”

- wouldn’t this substantially slow down the output generation / increase compute costs? I assume that the output text would need to be split into multiple chunks and each would need to be tested separately.

Expand full comment

I agree, but LLMs are so computationally expensive that I think this will be a fraction of the total cost. I think one lookup per paragraph is reasonable, but I believe generating a paragraph of text is significantly more costly than a web search.

Expand full comment

It’s massively wasteful. This is intra-big tech competition, revving datacenter engines against the others in a chicken race of wasted compute

Expand full comment
Jun 19·edited Jun 19

The AI training/copyright problem will not be resolved legally - or at least, not to the satisfaction of media creators.

One reason is legal precedent: there are countless examples of media creators and companies building their franchises by incorporating and exploiting other people's work - consider:

- WaPo, Wall Street Journal, and NYTimes incorporate and reference data and copy from other sources into their work all the time, without payment or obtaining permission. If it's on the web, it's fair game.

- Statista and other infographic media companies do the exact same thing.

- How many influencers create the studies and news stories they discuss?

Another reason is that there is an easy/quick technology fix rolling out right now. As long as AI responses return the URLs of "likely sources" for their responses, they address the copyright/training issue under existing laws.

And RAG (retrieval-augmented-generation) technology makes this relatively easy and inexpensive to do.

Google, Perplexity, and most other major model builders are incorporating RAG into their architectures right now. And almost every applied AI company must use RAG to meet functional requirements - not to respect policy.

At a high level, the vast majority of AI use cases - media included - will soon be RAG-enabled, so the size of the copyright exposure problem for foundation model builders and applied AI companies is about to decline dramatically.

Will the debate over copyright and training about to follow suit...?

Expand full comment

I don't think the fair use argument will hold water given there's now a well-established market for training data for LLMs.

Expand full comment

This touches on a topic that has been troubling me for some time - that the generative AI platforms I have used seem unable to unable to correctly identify the source of their information, which makes it difficult for a reader to (a) assess the quality of the original source, and (b) understand where to look if they want to learn more. The academic practice of referencing source material would help to address this issue if it could be done accurately, but thus far I have not seen that executed well.

It seems to be that this is not just an end-run around copyright but also a deep failure in acknowledging information sourcing, which threatens to make critically assessing the output of these platforms challenging for anyone that is not a subject matter expert.

Vetting the integrity of information sources seems to be one of the rising challenges of our times.

Expand full comment

I think, that this lawsuit process(es) will lead to build up even higher walls around copyrighted knowledge inside big companies. Imagine, if human's reading copyrighted material, he/she just obtains this knowledge, but almost nobody are able to re-produce what was read precisely. This is the main difference between AI and Human. And this the biggest pain. My personal prediction here is: soon big companies will be forced to invent new kind of Copyright Laws, e.g. "grade of equality", like checking if your text is unique or not. One AI will be fighting another AI.

Expand full comment

In fairness to ChatGPT, NYT and others use a client-side paywall that is easily defeated with some simple tweaks in dev tools. Tweaks a computer doesn't even need to make. I feel like that's on them.

Expand full comment