Spaces:

anonauthors
/

SecretLanguage

Runtime error

App Files Files Community

anonymousauthors commited on Feb 8, 2023

Commit

5fb4442

•

1 Parent(s): 2247711

Upload 4 files

Browse files

Files changed (3) hide show

SecretLanguage.py +11 -4
pages/0_📙_Dictionary_(Search).py +13 -1
pages/2_😈_Blackbox_Attack.py +37 -6

SecretLanguage.py CHANGED Viewed

@@ -15,23 +15,30 @@ st.set_page_config(layout="wide", page_title="ACl23 Secret Language")
 st.title("ACl23 Submission: Finding Secret Language")
 st.markdown('This webpage serves as an illustration of an anonymous submission to ACL 23.')
-st.markdown('### How to play with this page?')
-st.markdown('We present two methods for searching secret language: a direct search using the Dictionary (Search) option, and browsing words that have already been found for secret languages.')
-st.markdown("By entering a word you want to find its secret languages, you can view the word's meaning in English, all the secret languages we have discovered for it, and examples."
     "The hyperlinks sometimes might not work due to the contained property of Hugging Face space.")
 st.image(search_image, caption='A search example.')
 st.markdown('By clicking on the initial letters (A to Z, numbers, and other characters), you can view all the words whose secret languages have been discovered and that begin with the selected initial. By clicking on a word, you will be redirected to the search page, where you can view information about the selected word.')
 st.image(browse_image, caption='A browse example.')
 st.markdown('### Models and datasets.')
 st.markdown('On this page, we present the secret languages we discovered using ALBERT, DistillBERT, and Roberta models and data from the GLUE (MRPC), SNLI, and SQuAD datasets.')
 st.markdown('### Ethics statements for this webpage')
 st.markdown('We present secret languages discovered using our proposed algorithms. '

 st.title("ACl23 Submission: Finding Secret Language")
+# st.sidebar.markdown("### This webpage serves as an illustration of an anonymous submission to ACL 23.")
 st.markdown('This webpage serves as an illustration of an anonymous submission to ACL 23.')
+st.markdown('### What do we offer?')
+st.markdown('We present two methods for searching secret languages. The first method is a direct search using the "📙 Dictionary (Search)" option, while the second method, "📖 Dictionary (Browse)", involves browsing words that have already been found to have secret languages. '
+    'Additionally, we also provide a tool for finding secret languages in a black-box manner.')
+st.markdown('#### How to use "📙 Dictionary (Search)"?')
+st.markdown("By entering a word you want to find its secret languages, you can view the word's meaning in English, all the secret languages we have discovered for it, and examples. "
     "The hyperlinks sometimes might not work due to the contained property of Hugging Face space.")
 st.image(search_image, caption='A search example.')
+st.markdown('#### How to use "📖 Dictionary (Browse)"?')
 st.markdown('By clicking on the initial letters (A to Z, numbers, and other characters), you can view all the words whose secret languages have been discovered and that begin with the selected initial. By clicking on a word, you will be redirected to the search page, where you can view information about the selected word.')
 st.image(browse_image, caption='A browse example.')
+st.markdown('#### How to use "😈 Blackbox Attack"?')
+st.markdown('We offer two methods for generating replacement words using secret languages. Detailed introduction can be found on the page.')
 st.markdown('### Models and datasets.')
 st.markdown('On this page, we present the secret languages we discovered using ALBERT, DistillBERT, and Roberta models and data from the GLUE (MRPC), SNLI, and SQuAD datasets.')
 st.markdown('### Ethics statements for this webpage')
 st.markdown('We present secret languages discovered using our proposed algorithms. '

pages/0_📙_Dictionary_(Search).py CHANGED Viewed

@@ -36,6 +36,18 @@ for key in st.session_state.keys():
 title = st.sidebar.text_input(":red[Search secret languages given the following word (case-sensitive)]", default_title)
 if ord(title[0]) in list(range(48, 57)):
     file_name = 'num_dict.pkl'
 elif ord(title[0]) in list(range(97, 122)) + list(range(65, 90)):
@@ -230,7 +242,7 @@ if title in datas:
                         _string += 'question**: :'
                     elif task == 'Paraphrase':
                         _string += 'sentence 1**: :'
-                    _string += f'red[{_all[_sl]["Replaced hypothesis"][j]}]'.replace(":", "[colon]")
                     if task == 'NLI':
                         _string += '<br> **Premise**: :'
                     elif task == 'QA':

 title = st.sidebar.text_input(":red[Search secret languages given the following word (case-sensitive)]", default_title)
+st.sidebar.markdown("### Frequent FAQs")
+st.sidebar.markdown("1. *Why are words in sentences represented as subwords instead of complete words?*<br>"
+        "The tokenizer we use is from DistillBERT, ALBERT, or Roberta, which tokenizes sentences into subwords. As a result, the word being replaced in a sentence might be a subword (such as `rain` in `rainforest`).",
+         unsafe_allow_html=True)
+st.sidebar.markdown("2. *This page is extremely slow. I cannot stand it.*<br>"
+        "We apologize for the slow performance of this page. We are actively working on improving it."
+        "As loading the data can take time and some words have many secret languages, this page needs time to process.",
+         unsafe_allow_html=True)
+st.sidebar.markdown("3. *Why are some examples significantly different from the original sentences? *<br>"
+        "As per our submission, we replace 1 to 10 subwords in a sentence. However, for some examples with short lengths, the entire sentence may be altered. We are conducting experiments and will present examples where only a single subword has been changed.",
+         unsafe_allow_html=True)
 if ord(title[0]) in list(range(48, 57)):
     file_name = 'num_dict.pkl'
 elif ord(title[0]) in list(range(97, 122)) + list(range(65, 90)):
                         _string += 'question**: :'
                     elif task == 'Paraphrase':
                         _string += 'sentence 1**: :'
+                    _string += f'red[{_all[_sl]["Replaced hypothesis"][j]}]'.replace('/', '\\').replace(___sl, f"<i><b>{___sl}</b></i>").replace(":", "[colon]")
                     if task == 'NLI':
                         _string += '<br> **Premise**: :'
                     elif task == 'QA':

pages/2_😈_Blackbox_Attack.py CHANGED Viewed

@@ -15,10 +15,17 @@ from time import time
 st.title('Blackbox Attack')
 st.sidebar.markdown('On this page, we offer a tool for generating replacement words using secret languages.')
-st.sidebar.markdown('There are two methods for generating replacements.')
-st.sidebar.markdown('1. GPT-2 (Searching secret languages based on GPT-2): this method calculates secret languages using [GPT-2](https://huggingface.co/gpt2) and requires input text, the number of replacements desired, and the steps. The number of replacements represents the number of sentences you want to generate, while steps refer to the steps in the SecretFinding process.')
 st.sidebar.markdown('2. Use the secret language we found on ALBERT, DistillBERT, and Roberta: this method replaces words directly with the secret language dictionary derived from ALBERT, DistillBERT, and Roberta.')
 def run(model, _bar_text=None, bar=None, text='Which name is also used to describe the Amazon rainforest in English?', loss_funt=torch.nn.MSELoss(), lr=1, noise_mask=[1,2], restarts=10, step=100, device = torch.device('cpu')):
     subword_num = model.wte.weight.shape[0]
@@ -66,6 +73,11 @@ def run(model, _bar_text=None, bar=None, text='Which name is also used to descri
         perturbed_questions = []
         for i in range(restarts):
             perturbed_questions.append(tokenizer.decode(perturbed_inputs["input_ids"][i]).split("</s></s>")[0])
     return perturbed_questions
@@ -80,7 +92,15 @@ option = st.selectbox(
     ('GPT-2 (Searching secret languages based on GPT-2)', 'Use the secret language we found on ALBERT, DistillBERT, and Roberta.')
 )
-title = st.text_area('Input text.', 'Which name is also used to describe the Amazon rainforest in English?')
 if option == 'GPT-2 (Searching secret languages based on GPT-2)':
     _cols = st.columns(2)
@@ -124,8 +144,11 @@ if button('Tokenize', key='tokenizer'):
                 _index = i * 6 + j
                 if _index < _len:
                     disable = False
-                    if subwords[_index].strip() not in all_keys and option == 'Use the secret language we found on ALBERT, DistillBERT, and Roberta.':
-                        disable = True
                     button(subwords[_index], key=f'tokenizer_{_index}', disabled=disable)
@@ -136,8 +159,10 @@ if button('Tokenize', key='tokenizer'):
         for key in st.session_state:
             if st.session_state[key]:
                 if 'tokenizer_' in key:
                     # st.markdown(key)
-                    chose_indices.append(int(key.replace('tokenizer_', '')))
         if len(chose_indices):
             _bar_text = st.empty()
             if option == 'GPT-2 (Searching secret languages based on GPT-2)':
@@ -147,6 +172,7 @@ if button('Tokenize', key='tokenizer'):
             else:
                 _new_ids = []
                 _sl = {}
                 for j in chose_indices:
                     _sl[j] = get_secret_language(tokenizer.decode(input_ids[j]).strip())
                 for i in range(restarts):
@@ -154,11 +180,16 @@ if button('Tokenize', key='tokenizer'):
                     for j in range(len(input_ids)):
                         if j in chose_indices:
                             _tmp.append(_sl[j][i % len(_sl[j])])
                         else:
                             _tmp.append(input_ids[j])
                     _new_ids.append(_tmp)
                 # st.markdown(_new_ids)
                 outputs = [tokenizer.decode(_new_ids[i]).split('</s></s>')[0] for i in range(restarts)]
             st.success(f'We found {restarts} replacements!', icon="✅")
             st.markdown('<br>'.join(outputs), unsafe_allow_html=True)

 st.title('Blackbox Attack')
 st.sidebar.markdown('On this page, we offer a tool for generating replacement words using secret languages.')
+st.sidebar.markdown('#### Require ')
+st.sidebar.markdown('`Input text`: a sentence or paragraph.')
+st.sidebar.markdown('`Number of replacements`: the number of secret language samples.')
+st.sidebar.markdown('`Steps for searching Secret Langauge`: the steps in the SecretFinding process.')
+st.sidebar.markdown('#### Two methods')
+st.sidebar.markdown('1. GPT-2 (Searching secret languages based on GPT-2): this method calculates secret languages using [GPT-2](https://huggingface.co/gpt2).')
 st.sidebar.markdown('2. Use the secret language we found on ALBERT, DistillBERT, and Roberta: this method replaces words directly with the secret language dictionary derived from ALBERT, DistillBERT, and Roberta.')
 def run(model, _bar_text=None, bar=None, text='Which name is also used to describe the Amazon rainforest in English?', loss_funt=torch.nn.MSELoss(), lr=1, noise_mask=[1,2], restarts=10, step=100, device = torch.device('cpu')):
     subword_num = model.wte.weight.shape[0]
         perturbed_questions = []
         for i in range(restarts):
             perturbed_questions.append(tokenizer.decode(perturbed_inputs["input_ids"][i]).split("</s></s>")[0])
+    for i in range(len(perturbed_questions)):
+        for j in noise_mask:
+            _j = tokenizer.decode(perturbed_inputs["input_ids"][i][j])
+            # print(f'_j {_j}')
+            perturbed_questions[i] = perturbed_questions[i].replace(_j, f':red[{_j}]')
     return perturbed_questions
     ('GPT-2 (Searching secret languages based on GPT-2)', 'Use the secret language we found on ALBERT, DistillBERT, and Roberta.')
 )
+def clf_keys():
+    for key in st.session_state.keys():
+        if key in ['tokenizer', 'start']:
+            st.session_state[key] = False
+        elif 'tokenizer_' in key:
+            del st.session_state[key]
+title = st.text_area('Input text.', 'Which name is also used to describe the Amazon rainforest in English?', on_change=clf_keys)
 if option == 'GPT-2 (Searching secret languages based on GPT-2)':
     _cols = st.columns(2)
                 _index = i * 6 + j
                 if _index < _len:
                     disable = False
+                    if option == 'Use the secret language we found on ALBERT, DistillBERT, and Roberta.':
+                        if subwords[_index].strip() not in all_keys:
+                            disable = True
+                    # if f'tokenizer_{_index}' in st.session_state:
+                    #     del st.session_state[f'tokenizer_{_index}']
                     button(subwords[_index], key=f'tokenizer_{_index}', disabled=disable)
         for key in st.session_state:
             if st.session_state[key]:
                 if 'tokenizer_' in key:
+                    _index = int(key.replace('tokenizer_', ''))
                     # st.markdown(key)
+                    if _index < len(input_ids):
+                        chose_indices.append(_index)
         if len(chose_indices):
             _bar_text = st.empty()
             if option == 'GPT-2 (Searching secret languages based on GPT-2)':
             else:
                 _new_ids = []
                 _sl = {}
+                _used_sl = []
                 for j in chose_indices:
                     _sl[j] = get_secret_language(tokenizer.decode(input_ids[j]).strip())
                 for i in range(restarts):
                     for j in range(len(input_ids)):
                         if j in chose_indices:
                             _tmp.append(_sl[j][i % len(_sl[j])])
+                            _used_sl.append(_sl[j][i % len(_sl[j])])
                         else:
                             _tmp.append(input_ids[j])
                     _new_ids.append(_tmp)
                 # st.markdown(_new_ids)
                 outputs = [tokenizer.decode(_new_ids[i]).split('</s></s>')[0] for i in range(restarts)]
+                for i in range(len(outputs)):
+                    for j in _used_sl:
+                        _j = tokenizer.decode(j)
+                        outputs[i] = outputs[i].replace(_j, f':red[{_j}]')
             st.success(f'We found {restarts} replacements!', icon="✅")
             st.markdown('<br>'.join(outputs), unsafe_allow_html=True)