CURMUDGUCATION: Authors Sue NVIDIA Over AI Theft

AI companies are knowingly using pirated copies of published works to train their bots, according to a class action lawsuit in U.S. District Court in Northern California. Five authors have filed a copyright lawsuit against NVIDIA, a major tech company in Santa Clara, California.

You may remember NVIDIA as the folks who made your computer video gaming run smoothly, but they are in the AI biz these days, including Large Language Models, more commonly known as chatbots. They're doing okay. In 2023, Larry Ellison and Elon Musk were among a group of tech overlords who met NVIDIA's chief for what Ellison described as "an hour of sushi and begging" to get a larger allocation of the company's H100 GPU. In March of 2024, they became the third company in U.S. history to reach market capitalization of $2 trillion-with-a-T.

Lined up against them are Abdi Nazemian (Like a Love Story), Brian Keene (Ghost Walk), Stewart O'Nan (Last Night at the Lobster), Andres Dubus III (The Garden of Last Days), and Susan Orlean (The Orchid Thief). I have no read any of their stuff, but it is apparent many people have, though I don't think they are collectively worth $2 trillion.

I have learned a lot reading this lawsuit. For one thing, there are things called "shadow libraries" aka "pirate libraries." (I didn't know about them, but Wikipedia does.) It should come as no surprise that just as the digital world makes pirated copies of music and movies available, it also provides free access to print media. Books, ebooks, and scholarly media (those journal articles that are behind a really expensive paywall).

In particular, the lawsuit points to Anna's Archive, which is apparently the big name in pirated text these days. (I'm not going to link to it-- if you want to mess with that kind of theft, you'll have to find it on your own.) Pirate libraries are composed by violating the copyright of the various collected works.

So here's the story the lawsuit tells. In August 2023, NVIDIA approached legitimate publishers in an attempt to license mountains of text in order to train their chatbot.

But on information and belief, NVIDIA could not secure this fast access to the huge quantity of books it needed through publishers. As one book publisher told NVIDIA, it was “ not in a position to engage directly just yet but will be in touch.” In 2023, NVIDIA had “chatted with multiple publishers . . . but none [] wanted to enter into data licensing deals.”

So they approached Anna's Archive hoping to acquire millions of pirated copies of books for "pre-training data for our LLMs." Anna's Archive offers high-speed access for a fee, and NVIDIA executives asked about that kind of access. What would it look like.

Anna's Archive replied, in effect, "You guys know that our entire library consists of pirated copies, right? Maybe you should figure out if you're okay with that." NVIDIA executives would (real quote coming) need to let Anna's Archive know "when you have decided internally that this is something that you can pursue. We have wasted too much time on people who could not get internal buy-in."

It took NVIDIA just a couple of days to decide that they were perfectly okay making a deal to use this vast library or pirated works-- all of Anna's Archive, plus works from Internet Archive (previously found to be copyright infringement). NVIDIA was promised 500 terrabytes of data. They also hit up other shadow libraries.

A few months later, they unveiled Nemotron-4 15B. As was usual, the training data used to raise up this AI beast was kept a super secret, but the plaintiffs believe that it could not have been done without using that vast library of pirated works (including their own).

And since NVIDIA offered the NeMo Megatron framework for customers to build and train their own AI. "As part of this process, NVIDIA assisted and encouraged its customers" to go ahead and pirate those works some more by downloading and using that same dataset.

So the allegation is that NVIDIA used pirated works, knew it was using pirated works, and then offered to share those pirated works. With a few smoking emails to back it up.

NVIDIA says, who, us? We didn't violate copyright laws. Everything we did was legal, and also, fair use.

It's the fair use defense we'll want to watch. An earlier lawsuit by authors suing Anthropic over the training data used for its Claude AI was decided last summer, with the judge declaring that using the stolen works to train the AI was "exceedingly transformative" and therefor okey dokey fair use. Also last summer, a group of authors (including Sarah Silverman and Ta-Nehisi Coates) lost their similar lawsuit against Mark Zuckerberg's Meta. The judge in that case said it “is generally illegal to copy protected works without permission,” but in this case, the plaintiffs failed to present a compelling argument that Meta’s use of books to train their chatbot Llama caused “market harm.”

I don't suppose it will be easy to ever show market harm. ChatGPT slurps up my horror novel and then spits out fifty bad horror novels-- is that competition that does me market harm?

So it's not looking good for this newest lawsuit. Is it theft if someone takes my work without paying for it and uses it to power their trillion dollar company's newest product? It sure seems like it, but it seems that the law is having trouble keeping up with the new kinds of thievery that technology makes possible. Mind you, if I stole a copy of Microsoft office and didn't use it compete with Microsoft-- just use it to run my business-- I'm pretty sure my claim of fair use would not get past the courts.

And the AI industry--which depends on this kind of theft as to keep costs down in their business model-- certainly can't be counted on to do the right thing. So we're stuck in this shitty place where a monster industry bases its product on the theft-without-pay of other peoples' work, and nobody can do anything about it.

What does any of this have to do with education?

Maybe nothing directly, but I want you to think about all of this the next time somebody wants to talk to you about "ethical" use of AI in schools. Then ask them how one ethically uses a fundamentally unethical product.

Pages

Thursday, January 22, 2026

Authors Sue NVIDIA Over AI Theft

No comments:

Post a Comment