18 Comments

Thanks for highlighting the other and more serious side of the issue!

This aspect will be tried in Germany in April (Kneschke v Laion).

It’s also worth noting that both the US senate and UK government have confirmed training on copyright works as illegal this past week.

Expand full comment

We actually addressed the possibility of output filtering in our paper, and noted that it is not reliable, and suspect that it will not work at all well for lesser-known works. You took the easiest cases, ignoring what we wrote, and have generalized to all cases, exactly the sort of reasoning fallacy your blog is supposed to highlight.

At the least you could have addressed the passage in the Spectrum where we said “As a last resort, the X user @bartekxx12 has experimented with trying to get ChatGPT and Google Reverse Image Search to identify sources, with mixed (but not zero) success. It remains to be seen whether such approaches can be used reliably, particularly with materials that are more recent and less well-known than those we used in our experiments.” Even more important I think was this passage towards the end that we quoted “… everyone knows what Mario looks Iike. But nobody would recognize Mike Finklestein’s wildlife photography. So when you say “super super sharp beautiful beautiful photo of an otter leaping out of the water” You probably don’t realize that the output is essentially a real photo that Mike stayed out in the rain for three weeks to take.” You have by no means shown that ChatGPT filtering will be up to that task, nor even considered the possibility.

A bit disappointed you didn’t ask me for comment before posting, particularly given the strong formulation you chose, focusing only on the most frequent stimuli. If you really want to pursue you should also consider the noncanonical Marios in my Substack with Katie Conrad, and other stimuli of that ilk. Generalization, as always, is the key issue.

Expand full comment

The AI genie is out of the bottle and you can't legislate your way to stop it. Introducing restrictive laws in the West will only empower competitors in less controlled regions. AI has changed the rules and the best thing to do is to adapt instead of trying to hang on to the what worked in the past. We will all come to this realization sooner or later.

Expand full comment

Newspaper articles can be found all over the internet, especially if they are interesting to a particular demographic. How is it possible for NYT to 'own' a piece of news? In particular as you can find that many articles are themselves copied from Reuters reports, or from journalism sourced from another news source. Even if you argue about the similarity of the language used, unless you have pages of text, how many ways are there to report a news worthy event?

Expand full comment

You highlight the harms LLMs can do to creators, yet fail to contrast that with the benefit the global public receives through these models. LLMs make our global shared cultural memory available to every person on the globe with internet access at very low cost and lets them apply it to produce and achieve new things, from research to differential medical diagnosis to technical and legal advice to personal tutoring on any subject matter ever written about - a massive step forward in the evolution of knowledge societies, comparable maybe only to the printing press. The demand of the NY Times to destroy this new tool because it can (unreliably) reproduce months-old articles is completely out of proportion given the above benefits these models bring to the public. NY Times also needs to demonstrate ACTUAL loss in revenue, which will be very hard to prove because I doubt a single person has cancelled their subscription because they can produce some individual articles that went viral a few months ago.

I am not a lawyer, so I have no way of telling which way the lawsuit will go. But I agree that the effects of LLMs will outlast the lawsuit. The difference is just that I think these effects are overwhelmingly positive for knowledge societies as a whole, even if some players, which have created information monopolies for themselves in the past, are hurt in the short-term.

Expand full comment

Nicely done, again! Speaking as a recovering lawyer, this is all about copyright. I'm not sure the fixes you suggest will work so well in the land of trademark, where likelihood of confusion rules. But we shall come to know.

More generally, we have yet to develop a political economy of technology -- what do/should we want? Moving the question from employment (Luddites!) to art may be a cognitive and political step forward.

Keep up the good work.

Expand full comment

Thank you, great article!

Just re: “As the model generates text, the filter looks it up in real time in a web search index (OpenAI can easily do this due to its partnership with Bing)”

- wouldn’t this substantially slow down the output generation / increase compute costs? I assume that the output text would need to be split into multiple chunks and each would need to be tested separately.

Expand full comment

I don't think the fair use argument will hold water given there's now a well-established market for training data for LLMs.

Expand full comment

This touches on a topic that has been troubling me for some time - that the generative AI platforms I have used seem unable to unable to correctly identify the source of their information, which makes it difficult for a reader to (a) assess the quality of the original source, and (b) understand where to look if they want to learn more. The academic practice of referencing source material would help to address this issue if it could be done accurately, but thus far I have not seen that executed well.

It seems to be that this is not just an end-run around copyright but also a deep failure in acknowledging information sourcing, which threatens to make critically assessing the output of these platforms challenging for anyone that is not a subject matter expert.

Vetting the integrity of information sources seems to be one of the rising challenges of our times.

Expand full comment

I think, that this lawsuit process(es) will lead to build up even higher walls around copyrighted knowledge inside big companies. Imagine, if human's reading copyrighted material, he/she just obtains this knowledge, but almost nobody are able to re-produce what was read precisely. This is the main difference between AI and Human. And this the biggest pain. My personal prediction here is: soon big companies will be forced to invent new kind of Copyright Laws, e.g. "grade of equality", like checking if your text is unique or not. One AI will be fighting another AI.

Expand full comment

In fairness to ChatGPT, NYT and others use a client-side paywall that is easily defeated with some simple tweaks in dev tools. Tweaks a computer doesn't even need to make. I feel like that's on them.

Expand full comment