dhuynh95 commited on
Commit
df22a11
โ€ข
1 Parent(s): 620922f
Files changed (1) hide show
  1. app.py +9 -4
app.py CHANGED
@@ -4,6 +4,7 @@ import os
4
  from huggingface_hub import InferenceClient, login
5
  from transformers import AutoTokenizer
6
  import evaluate
 
7
 
8
  bleu = evaluate.load("bleu")
9
 
@@ -21,12 +22,14 @@ title = "<h1 style='text-align: center; color: #333333; font-size: 40px;'> ๐Ÿค”
21
 
22
  description = """
23
  This ability of LLMs to learn their training set by heart can pose huge privacy issues, as many large-scale Conversational AI available commercially collect users' data at scale and fine-tune their models on it.
24
- This means that if sensitive data is sent and memorized by an AI, other users can willingly or unwillingly prompt the AI to spit out this sensitive data.
 
25
 
26
  To raise awareness of this issue, we show in this demo how much [StarCoder](https://huggingface.co/bigcode/starcoder), an LLM specialized in coding tasks, memorizes its training set, [The Stack](https://huggingface.co/datasets/bigcode/the-stack-dedup).
27
- We found that **StarCoder memorized at least 8% of the training samples** we used, which highlights the high risks of LLMs exposing the training set. We provide a notebook to reproduce our results [here](https://colab.research.google.com/drive/1YaaPOXzodEAc4JXboa12gN5zdlzy5XaR?usp=sharing).
 
28
 
29
- To evaluate memorization of the training set, we can prompt StarCoder with the first tokens of an example from the training set. If StarCoder completes the prompt with an output that looks very similar to the original sample, we will consider this sample to be memorized by the LLM.
30
  """
31
 
32
  memorization_definition = """
@@ -232,7 +235,9 @@ def df_select(evt: gr.SelectData):
232
 
233
  return evt.value
234
 
235
- with gr.Blocks() as demo:
 
 
236
  with gr.Column():
237
  gr.Markdown(title)
238
  with gr.Row():
 
4
  from huggingface_hub import InferenceClient, login
5
  from transformers import AutoTokenizer
6
  import evaluate
7
+ import theme
8
 
9
  bleu = evaluate.load("bleu")
10
 
 
22
 
23
  description = """
24
  This ability of LLMs to learn their training set by heart can pose huge privacy issues, as many large-scale Conversational AI available commercially collect users' data at scale and fine-tune their models on it.
25
+ This means that if sensitive data is sent and memorized by an AI, other users can willingly or unwillingly prompt the AI to spit out this sensitive data. ๐Ÿ”“
26
+
27
 
28
  To raise awareness of this issue, we show in this demo how much [StarCoder](https://huggingface.co/bigcode/starcoder), an LLM specialized in coding tasks, memorizes its training set, [The Stack](https://huggingface.co/datasets/bigcode/the-stack-dedup).
29
+ We found that **StarCoder memorized at least 8% of the training samples** we used, which highlights the high risks of LLMs exposing the training set. We provide a notebook to reproduce our results [here](https://colab.research.google.com/drive/1YaaPOXzodEAc4JXboa12gN5zdlzy5XaR?usp=sharing). ๐Ÿ‘ˆ
30
+
31
 
32
+ To evaluate memorization of the training set, we can prompt StarCoder with the first tokens of an example from the training set. If StarCoder completes the prompt with an output that looks very similar to the original sample, we will consider this sample to be memorized by the LLM. ๐Ÿ’พ
33
  """
34
 
35
  memorization_definition = """
 
235
 
236
  return evt.value
237
 
238
+ style = theme.Style()
239
+
240
+ with gr.Blocks(theme=style) as demo:
241
  with gr.Column():
242
  gr.Markdown(title)
243
  with gr.Row():