Apply for community grant: Academic project (gpu)

#1
by nathanReitinger - opened

tl;dr I'd like to host some of my empirical work where others can easily access it and see for themselves how my argument works. HuggingFace is the perfect place for that, but the models in a gradio Space do not run fast enough for the results to populate within a reasonable time. More on the project is below! Thanks for looking :hug:

=== start introduction

From chatGPT to Stable Diffusion, Artificial Intelligence (AI) is having a summer the likes of which rival only the heydays of the 1970s. Starting around December 2023, hundreds of AI-driven businesses have blossomed, Google, for the first time in decades, experienced some semblance of competition, and GPU stock prices, a proxy marker for machine learning activity, have skyrocketed. Yet, the jubilation has not been met without resistance. From Hollywood to the Louvre, AI has awoken a sleeping giant. A giant keen to protect a world that once seemed exclusively human: creativity.

As luck would have it, these big, clunky models have an Achilles heel: training data. All of the most popular model architectures today necessitate a high-quality, world-encompassing data diet, one the “free” services we cannot live without, have, so far, happily provided. Two aspects of this data diet are important.

First, high quality in this context means of human origin. Although non-human-genesis data is an option, and synthetic data has made many strides since the idea of a computer playing itself was popularized by War Games, computer science literature has shown that model quality will degrade over time if humanness is completely taken out of the loop (i.e., model rot). In other words, human data is the lifeblood of these models. Second, world-encompassing means world-encompassing. If you put it online, you should assume the model has used it in training: that Myspace post you were hoping only you and Tom remembered (ingested), that picture-encased-memory you gladly forgot about until PimEyes forced you to remember it (ingested), and those late-night Reddit tirades you hoped were just a dream (ingested).

The irony is that the very data that makes these models with an accuracy that feels apocalyptic, the data they need to survive, is the same data producing legal antipathy. To sum it up in one line: Data is free beer, not free speech.
Copyright—purposed to encourage the progress of science and useful arts—vests from the moment a work is created and lasts for the life of an author plus 70 years. If it is safe to assume that the entirety of data shared on the Internet includes at least some portion of protected works (it is), then these models need to come face to face with the question: does ingestion result in the storage of protected content; or with less precision: do machine learning models memorize?

According to openAI, the company behind chatGPT, the case is closed: ingestion does not mean storage; the models, which learned from training data long since incinerated, do not copy or contain anything even closely resembling protected property. "Models do not contain or store copies of information that they learn from. Instead, as a model learns, some of the numbers that make up the model change slightly to reflect what it has learned."

This is a compelling argument, given that the two-part secret sauce of machine learning, neural networks and learned weights, is a many-steps-removed abstraction from the inputs and outputs users interact with. If neural networks are unintelligible, the weights are nearly gibberish. How can the following set of numbers be said to contain copyrighted material, let alone anything useful; and even more convincingly, if this product is the result of purposeful randomness, specifically to get the model to avoid memorization, does this not moot simple conclusions that models are encodings of protected content?

What is more, OpenAI’s position is not without legal support. Several works agree, beyond the required, if ephemeral, storage of training data, that the resulting model—itself—does not store protected content, going so far as to call the position a mistaken perception: “One persistent misunderstanding is the perception that the [machine learning] training process makes repeated or derivative copies of each work used as input [and that] the model somehow ‘stores’ the works used for training within the model. Both of these perceptions are incorrect.” This point is perhaps made most eloquently by Professor Murray: "Many of the participants in the current debate on visual generative AI systems have latched onto the idea that generative AI systems have been trained on datasets and foundation models that contained actual copyrighted image files, .jpgs, .gifs, .png files and the like, scraped from the internet, that somehow the dataset or foundation model must have made and stored copies of these works, and somehow the generative AI system further selected and copied individual images out of that dataset, and somehow the system copied and incorporated significant copyrightable parts of individual images into the final generated images that are offered to the end-user. This is magical thinking." Still others have softened the argument: Generally, pseudo-expression generated by large language models does not infringe copyright because these models ‘learn’ latent features and associations within the training data; they do not memorize snippets of original expression from individual works. However, this Article identifies situations in the context of text-to-image models where memorization of the training data is more likely.

Either way, the throughline is clear: Models are a collection of floating point numbers, and numbers, even groups of numbers, do not contain anything and therefore cannot infringe. On the one hand, openAI is well-informed. Machine learning models are merely a series of uninterpretable floating point numbers (weights) joined with a roadmap on how to apply those numbers (architecture). This fact, however true, does not nullify the inquiry; if it did, then any piece of information communicated since we successfully built Babbage’s Difference Engine and joined it with Lovelace’s genius would have failed to warrant protection. The entire field of computing is built on abstractions, and is, at the bottom, only ever a manipulation of electricity: not something copyrightable. This point appears trivial, but as this Article shows, it meets the crux of the debate—how do we interpret storage in terms of copyright, and how do we interpret storage in terms of how the models technically work?

This Article argues that technical storage and legal storage are one in the same. In Part I, the Article lays a foundation for understanding machine learning models in the copyright context. This part includes background information on what goes into training, and, as the primary focus of the Article, what comes out of training. Next, the Article addresses the lynchpin issue: what does it mean to store protected content? This part of the Article breaks the question into two pieces. First, legally, how is storage defined in the copyright context? Second, what does it mean, technically, to store information inside a machine-learning model? The second question is answered empirically, and, more importantly, by taking a page out of Harvard’s CS50 handbook: if a problem is hard, break it into smaller, digestible pieces.

The models we are attempting to reason about, models like LLaMA, Claude, or chatGPT, represent the result of some of the most powerful companies on Earth focusing a large part of their computing resources on very particular tasks, leading to a scale and complexity that is unfathomable. Think about this. It might take my personal computer roughly 347 days to learn a machine-learning task like predicting the results of Supreme Court cases based only on oral argument transcripts because my computer takes five minutes to complete one loop of a 100,000-loop process. The same task might take a university-accessible researcher only two days, and it might take a researcher at a company like Google only a few minutes. What this means for the types of models that are at center stage in 2024’s version of Google Books, is that these models are: (1) difficult to access in terms of “poking around the engine” to see how they work (e.g., it might be nice to know how many characters in a sentence produced by chatGPT are unique when compared to the entirety of chatGPT’s training data—not going to happen); and (2) unnecessarily complex, in many regards, for particular legal inquiries (i.e., a model’s parameter count does not affect a factual inquiry of whether a model contains a copy, though it surely affects the optics of this conclusion and makes it much harder to answer the question).

With these complexities in mind, Section II.A breaks the problem into smaller pieces by evaluating a (small) dataset of only 60,000 images: the MNIST dataset, limited to the training corpus). This environment is nontrivial in terms of technical analysis and rigor, easier to interpret, and easier to test. The Article uses this dataset to show how generative models—an off-the-shelf Generative Adversarial Network (GAN), a Convolutional Neural Network (CNN), and today’s golden child, a diffusion model—can and do generate images that are similar to images found in training data.

To be sure, this finding is not novel. Nicholas Carlini, Vitaly Shmatikov, Arvind Narayanan, and many others have been evaluating the memorization of training data by machine learning models for several years, but this perspective is novel in that it provides a small-world dataset to analyze, allowing this Article to assess overall trends of copies throughout the entire training dataset. In turn, the Article produces compelling empirical data on when and how models store copies.

Finally, in Part III, the Article merges the technical and legal to make an argument that models, to some non-trivial degree, offend the copyright of content used in training. This part of the Article also addresses several variations of arguments made in refutation of this conclusion: models learn not copy, memorizing 0.05% of the training corpus is permissible, and randomness kills the legal standard for copying. Lastly, the Article concludes with a few normative points on copyright plus human augmentation via AI.

Sign up or log in to comment