StarCoder Memorization Experiment Highlights Privacy Risks of Fine-Tuning On Code

Community Article Published November 2, 2023

TL;DR

Why code memorization poses copyright and IP issues

Our experiment
Experimental Methodology in More Detail

Demo

Implications

Conclusion

TL;DR

We recently conducted an experiment that illustrates the memorization of LLMS of their training data, which poses privacy as the training data memorized can be extracted by prompts post-deployment. Prior research showed that LLMs memorize their training set:

Quantifying Memorization Across Neural Language Models showed that GPT-J memorized at least 1% of its training set
Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy showed that GitHub Copilot memorized snippets of code from GitHub and could bypass its own filter

We further confirm that trend by showing that the Hugging Face LLM for coding, StarCoder, memorized at least 8% of the training examples we sampled from The Stack. Our experiment can be reproduced using our notebook.

This highlights the inherent risk of sending confidential data, for instance code, to Conversational AI providers that train on users’ inputs, as the weights could memorize the data by heart, and other users can then extract it through prompting. This memorization issue is the reason Samsung’s proprietary code got leaked after being sent to OpenAI.

On our Hugging Face Space, we released a demo showing how the completion of training samples of the training dataset of StarCoder are memorized by heart.

Why code memorization poses copyright and IP issues

LLMs have shown a huge potential for coding. However, LLMs are literally trained to learn their data by heart, which can be quite problematic for LLMs trained on code. Indeed those models can memorize code whose owners might not have intended to be shared with either the LLM or other users.

Yet, once this unconsented code is ingested and memorized by the LLM, the LLM can then regurgitate that code to users of the model, who might not know they are using unconsented code! Worse, if people send proprietary and sensitive code to an LLM provider, and this code is used for training, other users can exfiltrate the proprietary code through prompts of the LLM!

The paper Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy has highlighted this trend by showing that not only GitHub Copilot memorized samples of code that did not explicitly have a license that allowed it to be used for training, but that the filter to prevent such code from being suggested by Copilot could be bypassed by itself!

Unfortunately, as GitHub Copilot is a closed-source solution, studying the effects of memorization of training code can be difficult, that is why we have leveraged the great work done by the BigCode team.

Our experiment

To better understand the memorization of code used during training, we have reproduced the results of this paper, this time not on GitHub Copilot, which is a black box model, but this time with StarCoder, an open-source model trained by BigCode on The Stack, a dataset of code with a permissive license.

Thanks to BigCode’s effort to make their training procedure open source, we knew exactly what data was used, so testing for memorization was quite straightforward.

Our approach was straightforward:

Take samples from StarCoder's original training data (The Stack)
Feed StarCoder just the first few tokens from each sample as a prompt
Check if StarCoder's completions are close to the original sample. If the BLEU score between the completion and original sample is higher than 0.75, this specific sample is deemed memorized.

You can reproduce our experiment using our notebook.

Experimental Methodology in More Detail

We conducted the following process to evaluate memorization in the StarCoder model:

We sampled 1536 training examples from The Stack. This number was due to our limited computing resources. We encourage others to reproduce our results and explore larger sample sizes.
We kept only the first 50 tokens of each sample that serve as a prefix.
We fed the prefix into StarCoder and generated a completion with greedy decoding (other sampling strategies could be used such as beam search but Quantifying Memorization Across Neural Language Models showed it had little impact on memorization).
We compared the generated completion to the original training example using BLEU score, a standard measure of similarity between machine translations.
If the BLEU score exceeds a threshold of 0.75, we classified the sample as approximately memorized since the completion closely reconstructs the original training content.

Here is the distribution of BLEU score on the 1536 training samples we experimented with:

We found that 7.6% of our samples had a BLEU score superior to 0.75, which means that 7.6% of the training samples of The Stack could be deemed memorized!

To see what it means in practice to have a memorized sample, let’s look at an example where the completion has a BLEU of 0.8:

Here is the original sample:

Here is the sample truncated to the first 50 tokens:

Here is the completed sample:

If we do a diff, we will see that the original sample and the completion are quite similar:

Demo

You can play with our demo on Hugging Face to see how training samples from The Stack are memorized by StarCoder.

Implications

Unfortunately, many commercially available LLMs do train on your data, and their privacy controls make it hard for users to opt-out, as those providers are incentivized to improve their model to remain competitive.

This creates the privacy issue that we saw, where memorized data can be extracted by future prompts from other users of the LLM solution. That is what happened to Samsung.

Alas, depending on your use case, there might not be an easy solution. “LLM firewalls” that remove PII, such as credit card numbers are sometimes necessary but often far from sufficient:

Semantics are often preserved by LLMs, aka if I replace “Daniel” in “Daniel is a blind man with one foot living in Minnesota”, with “David”, before sending that to an LLM, the semantics are preserved and anyone extracting the sentence “David is a blind man with one foot living in Minnesota” can infer a lot about me using other information not removed by PII scrubbers.
PII removal does not work for code. You cannot identify and remove PII from proprietary code, there is no Named Entity Recognition or regex rule you apply to “sanitize code”. Therefore you have no way to reduce the risk of it being learned and extracted if sent to an LLM that trains on your code.

Therefore, one has to be extremely careful when sending code to LLM solutions that train on your data, which seems to be the default of most LLM providers today, due to competitive pressure.

Conclusion

We have seen through this article that memorization does happen and that the memorization and leakage of Samsung’s proprietary code is a feature not a bug of LLMs.

The key takeaway here is that there is a real risk that LLMs memorize data you send to them if they are able to train on it.

The best way to avoid that issue is simply for your data to not be used for training, but having those guarantees can be complicated.

Because those issues are of utmost importance, we have developed BlindChat, a Confidential Conversational AI that answers the privacy risks of LLMs.

BlindChat allows users to query the open-source LLMs we host, such as Llama 2 70B, but with guarantees that not even our admins could see or train on their data, as prompts sent to us are end-to-end protected: only users have the decryption key, and we could not expose their data even if we wanted to.

BlindChat is open-source, we have already been audited on our Confidential AI stack, and the technical whitepaper behind it is available here.

We hope this article has been useful and helped you better understand the inherent privacy risks of LLMs!

Upvote