starcoder_memorization_checker

Runtime error

App Files Files Community

dhuynh95 commited on Oct 11, 2023

Commit

df22a11

1 Parent(s): 620922f

add theme

Browse files

Files changed (1) hide show

app.py +9 -4

app.py CHANGED Viewed

@@ -4,6 +4,7 @@ import os
 from huggingface_hub import InferenceClient, login
 from transformers import AutoTokenizer
 import evaluate
 bleu = evaluate.load("bleu")
@@ -21,12 +22,14 @@ title = "<h1 style='text-align: center; color: #333333; font-size: 40px;'> 🤔
 description = """
 This ability of LLMs to learn their training set by heart can pose huge privacy issues, as many large-scale Conversational AI available commercially collect users' data at scale and fine-tune their models on it.
-This means that if sensitive data is sent and memorized by an AI, other users can willingly or unwillingly prompt the AI to spit out this sensitive data.
 To raise awareness of this issue, we show in this demo how much [StarCoder](https://huggingface.co/bigcode/starcoder), an LLM specialized in coding tasks, memorizes its training set, [The Stack](https://huggingface.co/datasets/bigcode/the-stack-dedup).
-We found that **StarCoder memorized at least 8% of the training samples** we used, which highlights the high risks of LLMs exposing the training set. We provide a notebook to reproduce our results [here](https://colab.research.google.com/drive/1YaaPOXzodEAc4JXboa12gN5zdlzy5XaR?usp=sharing).
-To evaluate memorization of the training set, we can prompt StarCoder with the first tokens of an example from the training set. If StarCoder completes the prompt with an output that looks very similar to the original sample, we will consider this sample to be memorized by the LLM.
 """
 memorization_definition = """
@@ -232,7 +235,9 @@ def df_select(evt: gr.SelectData):
     return evt.value
-with gr.Blocks() as demo:
     with gr.Column():
         gr.Markdown(title)
         with gr.Row():

 from huggingface_hub import InferenceClient, login
 from transformers import AutoTokenizer
 import evaluate
+import theme
 bleu = evaluate.load("bleu")
 description = """
 This ability of LLMs to learn their training set by heart can pose huge privacy issues, as many large-scale Conversational AI available commercially collect users' data at scale and fine-tune their models on it.
+This means that if sensitive data is sent and memorized by an AI, other users can willingly or unwillingly prompt the AI to spit out this sensitive data. 🔓
 To raise awareness of this issue, we show in this demo how much [StarCoder](https://huggingface.co/bigcode/starcoder), an LLM specialized in coding tasks, memorizes its training set, [The Stack](https://huggingface.co/datasets/bigcode/the-stack-dedup).
+We found that **StarCoder memorized at least 8% of the training samples** we used, which highlights the high risks of LLMs exposing the training set. We provide a notebook to reproduce our results [here](https://colab.research.google.com/drive/1YaaPOXzodEAc4JXboa12gN5zdlzy5XaR?usp=sharing). 👈
+To evaluate memorization of the training set, we can prompt StarCoder with the first tokens of an example from the training set. If StarCoder completes the prompt with an output that looks very similar to the original sample, we will consider this sample to be memorized by the LLM. 💾
 """
 memorization_definition = """
     return evt.value
+style = theme.Style()
+with gr.Blocks(theme=style) as demo:
     with gr.Column():
         gr.Markdown(title)
         with gr.Row():