But on information and belief, NVIDIA could not secure this fast access to the huge quantity of books it needed through publishers. As one book publisher told NVIDIA, it was “ not in a position to engage directly just yet but will be in touch.” In 2023, NVIDIA had “chatted with multiple publishers . . . but none [] wanted to enter into data licensing deals.”
So they approached Anna's Archive hoping to acquire millions of pirated copies of books for "pre-training data for our LLMs." Anna's Archive offers high-speed access for a fee, and NVIDIA executives asked about that kind of access. What would it look like.
Anna's Archive replied, in effect, "You guys know that our entire library consists of pirated copies, right? Maybe you should figure out if you're okay with that." NVIDIA executives would (real quote coming) need to let Anna's Archive know "when you have decided internally that this is something that you can pursue. We have wasted too much time on people who could not get internal buy-in."
It took NVIDIA just a couple of days to decide that they were perfectly okay making a deal to use this vast library or pirated works-- all of Anna's Archive, plus works from Internet Archive (previously found to be copyright infringement). NVIDIA was promised 500 terrabytes of data. They also hit up other shadow libraries.
A few months later, they unveiled Nemotron-4 15B. As was usual, the training data used to raise up this AI beast was kept a super secret, but the plaintiffs believe that it could not have been done without using that vast library of pirated works (including their own).
And since NVIDIA offered the NeMo Megatron framework for customers to build and train their own AI. "As part of this process, NVIDIA assisted and encouraged its customers" to go ahead and pirate those works some more by downloading and using that same dataset.
So the allegation is that NVIDIA used pirated works, knew it was using pirated works, and then offered to share those pirated works. With a few smoking emails to back it up.
NVIDIA says, who, us? We didn't violate copyright laws. Everything we did was legal, and also, fair use.
It's the fair use defense we'll want to watch. An earlier lawsuit by authors suing Anthropic over the training data used for its Claude AI was decided last summer, with the judge declaring that using the stolen works to train the AI was "exceedingly transformative" and therefor okey dokey fair use. Also last summer, a group of authors (including Sarah Silverman and Ta-Nehisi Coates) lost their similar lawsuit against Mark Zuckerberg's Meta. The judge in that case said it “is generally illegal to copy protected works without permission,” but in this case, the plaintiffs failed to present a compelling argument that Meta’s use of books to train their chatbot Llama caused “market harm.”
I don't suppose it will be easy to ever show market harm. ChatGPT slurps up my horror novel and then spits out fifty bad horror novels-- is that competition that does me market harm?
So it's not looking good for this newest lawsuit. Is it theft if someone takes my work without paying for it and uses it to power their trillion dollar company's newest product? It sure seems like it, but it seems that the law is having trouble keeping up with the new kinds of thievery that technology makes possible. Mind you, if I stole a copy of Microsoft office and didn't use it compete with Microsoft-- just use it to run my business-- I'm pretty sure my claim of fair use would not get past the courts.
And the AI industry--which depends on this kind of theft as to keep costs down in their business model-- certainly can't be counted on to do the right thing. So we're stuck in this shitty place where a monster industry bases its product on the theft-without-pay of other peoples' work, and nobody can do anything about it.
What does any of this have to do with education?
Maybe nothing directly, but I want you to think about all of this the next time somebody wants to talk to you about "ethical" use of AI in schools. Then ask them how one ethically uses a fundamentally unethical product.

No comments:
Post a Comment