CURMUDGUCATION: Against AI Theft

Wednesday, January 22, 2025

Against AI Theft

Among the many reasons to give Artificial Intelligence some real side-eye is the business model that rests entirely on plagiarism-- stealing the works of human creators to "train" the systems. Now a new paper attacks a defense of the AI cyber-theft machines.

"Generative AI's Illusory Case for Fair Use" comes from Jacqueline Charlesworth (who appears to be a real person, a necessary check that we all need to do now whenever we come across scholarly work because AI is cranking out sludge swiftly). Charlesworth was a general counsel of the US Copyright Office and specializes in copyright litigation.

The folks hoping to make bank on AI insist that piracy is not their business model, and one of their favorite arguments to hide behind is Fair Use. Teachers are familiar with Fair Use rules, which tell us that we can show movies if they are being used for legitimate teaching stuff but not for entertainment.

But as Charlesworth explains it, the Big Boys of AI argue that while the programs are copying the wo4rks used for training, the AI only "learns" uncopyrightable information about the works.

Once trained, they say, the model does not comprise or make use of the content of the training works. As such, they contend, the copying is a fair use under U.S. law.

That, says Charlesworth, is bunk.

The 42 page paper combines hard-to-understand AI stuff with hard-to-understand law stuff. But it includes lots of useful insights and illustrations of AI's lack of smartitude. And Charlesworth is a clear and incisive writer. And she dismantles the defense used by Big AI companies pretty thoroughly.

Despite wide employment of anthropomorphic terms to describe their behavior, AI machines do not learn or reason as humans do. They do not “know” anything independently of the works on which they are trained, so their output is a function of the copied materials. Large language models, or LLMs, are trained by breaking textual works down into small segments, or “tokens” (typically individual words or parts of words) and converting the tokens into vectors—numerical representations of the tokens and where they appear in relation to other tokens in the text. The training works thus do not disappear, as claimed, but are encoded, token by token, into the model and relied upon to generate output.

Furthermore, the earlier cases don't fit the current situation as far as business aspects go-

The exploitation of copied works for their intrinsic expressive value sharply distinguishes AI copying from that at issue in the technological fair use cases relied upon by AI’s fair use advocates. In these earlier cases, the determination of fair use turned on the fact that the alleged infringer was not seeking to capitalize on expressive content—exactly the opposite of generative AI.

Charlesworth also notes that in the end, these companies fall back on the claim of their "overwhelming need to ingest massive amounts of copyrighted material without permission from or payment to rightsholders." In other words, "Please let us steal this stuff because we really, really need to steal this stuff to make a big mountain of money."

Charlesworth does a good job of puncturing the attempts to anthropomorphize AI, when, in fact, AI is not "smart" at all.

Unlike humans, AI models “do not possess the ability to perform accurately in situations not encountered in their training.” They “recite rather than imagine.” A group of AI researchers has shown, for instance, that a model trained on materials that say “A is B” does not reason from that knowledge, as a human would, to produce output that states the reverse, that B is A. To borrow one of the researchers’ examples, a model trained on materials that say Valentina Tereshkova was the first woman to travel in space may respond to the query, “Who was Valentina Tereshkova?” with “The first woman to travel in space.” But asked, “Who was the first woman to travel in space?,” it is unable to come up with the answer. Based on experiments in this area, the research team concluded that large language models suffer from “a basic inability to generalize beyond the training data.”

Charlesworth gets into another area-- the ability of AI to reconstruct the data it was trained on. One of her examples is one that shows up in the New York Times lawsuit against OpenAI, in which, with just a little prompting, ChatGPT was able to "regurgitate" nine paragraphs verbatim of a NYT article. This ability isn't one we often seen demonstrated (certainly it is not in OpenAI's interest to show it off), but it certainly creates a problem for the Fair Use argument. They may not have a copy of the copyrighted work stored, but they can pull one up any time they want.

And she notes that the cases cited in defense are essentially different:

Pointing to a handful of technology-driven fair use cases, AI companies and their advocates claim that large-scale reproduction of copyrighted works to develop and populate AI systems constitutes a fair use of those works. But Google Books, HathiTrust, Sega and other key precedents relied upon by AI companies to defend their unlicensed copying—mainly Kelly v. Arriba Soft Corp., Perfect 10, Inc. v. Amazon.com, Inc., A.V. v. iParadigms, LLC (“iParadigms”), Sony Computer Entertainment, Inc. v. Connectix Corp. (“Sony Computer”) and Google, LLC v. Oracle America, Inc. (“Oracle”)—are all in a different category with respect to fair use. That is because these cases were concerned with functional rather than expressive uses of copied works. The copying challenged in each was to enable a technical capability such as search functionality or software interoperability. By contrast, copying by AI companies serves to enable exploitation of protected expression.

There's lots more, and her 42 pages include 237 footnotes. It's not a light read. But it is a powerful argument against the wholesale plagiarism fueling the AI revolution. It remains for the courts to decide just how convincing the argument is. But if you're trying to bone up on this stuff, this article is a useful read.

Pages

Wednesday, January 22, 2025

Against AI Theft

No comments:

Post a Comment