SecurityXuanwuLab
/

HaS-820m

 ---
+language:
+- zh
+- en
+pipeline_tag: other
+# widget:
+# - text: "Paraphrase the text:\n\n"
+#   example_title: "example"
+# inference:
+#   parameters:
+#     # temperature: 1
+#     # do_sample: true
+#     max_new_tokens: 50
 ---
+# Hide-and-Seek隐私保护引擎
+<a href="https://github.com/alohachen/Hide-and-Seek" target="_blank">Github Repo</a> / <a href="https://arxiv.org/abs/2309.03057" target="_blank">arXiv Preprint</a>
+## 介绍
+Hide-and-Seek是一个中英双语隐私保护框架，由[hide](https://huggingface.co/tingxinli/hide-820m)与[seek](https://huggingface.co/tingxinli/seek-820m)两个模型组成。hide模型负责将用户输入中的敏感实体词替换为其他随机实体（加密），seek模型负责将输出中被替换掉的部分还原以对应原文本（解密）。此仓库是我们的社区开源版本，两个模型都以[bloom-1.1b](https://huggingface.co/bigscience/bloom-1b1)为底模，经过词表裁剪和微调后得到。
+## 环境依赖
+由于机器学习环境配置复杂耗时，我们提供了一个[colab notebook](https://drive.google.com/file/d/1ZkGegZ_JjPy6k_wWnjaUaqq4QbF9LoWG/view?usp=sharing)用于demo，我们在下方列出了必要依赖供您参考。如果您在自己的环境上运行，可能需要根据自己设备做出一些调整。
+```shell
+pip install torch==2.1.0+cu118
+pip install transformers==4.35.0
+```
+## Quick Start
+下面是单独调用hide模型的一个例子。
+```ipython
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("tingxinli/hide-820m")
+model = AutoModelForCausalLM.from_pretrained("tingxinli/hide-820m").to('cuda:0')
+hide_template = """<s>Paraphrase the text:%s\n\n"""
+original_input = "张伟用苹果(iPhone 13)换了一箱好吃的苹果。"
+input_text = hide_template % original_input
+inputs = tokenizer(input_text, return_tensors='pt').to('cuda:0')
+pred = model.generate(**inputs, max_length=100)
+pred = pred.cpu()[0][len(inputs['input_ids'][0]):]
+hide_input = tokenizer.decode(pred, skip_special_tokens=True)
+print(hide_input)
+# output:
+# '李明用华为(Mate 40)换了一箱好吃的橙子。
+```
+下面是一个完整调用Hide-and-Seek框架的例子。注意完整的隐私保护流程demo需要自备OpenAI的API token。
+```ipython
+# see hideAndSeek.py in this repo
+from hideAndSeek import *
+tokenizer = AutoTokenizer.from_pretrained("tingxinli/hide-820m")
+hide_model = AutoModelForCausalLM.from_pretrained("tingxinli/hide-820m").to('cuda:0')
+seek_model = AutoModelForCausalLM.from_pretrained("tingxinli/seek-820m").to('cuda:0')
+original_input = "华纳兄弟影业（Warner Bro）著名的作品有《蝙蝠侠》系列、《超人》系列、《黑客帝国》系列和《指环王》系列。目前华纳未考虑推出《蝙蝠侠》系列新作。"
+print('original input:', original_input)
+hide_input = hide_encrypt(original_input, hide_model, tokenizer)
+print('hide input:', hide_input)
+prompt = "Translate the following text into English.\n %s\n" % hide_input
+hide_output = get_gpt_output(prompt)
+print('hide output:', hide_output)
+original_output = seek_decrypt(hide_input, hide_output, original_input, seek_model, tokenizer)
+print('original output:', original_output)
+# output:
+# original input: 华纳兄弟影业（Warner Bro）著名的作品有《蝙蝠侠》系列、《超人》系列、《黑客帝国》系列和《指环王》系列。目前华纳未考虑推出《蝙蝠侠》系列新作。
+# hide input: 迪士尼影业（Disney Studios）著名的作品有《艺术作品1》系列、《艺术作品2》系列、《艺术作品3》系列和《艺术作品4》系列。目前迪士尼未考虑推出《艺术作品1》系列新作。
+# hide output: Disney Studios' famous works include the "Artwork 1" series, "Artwork 2" series, "Artwork 3" series, and "Artwork 4" series. Currently, Disney has not considered releasing a new installment in the "Artwork 1" series.
+# original output: Warner Bro's famous works include the "Batman" series, "Superman" series, "The Matrix" series, and "The Lord of the Rings" series. Currently, Warner has not considered releasing a new installment in the "Batman" series.
+```
+## 引用
+```
+@misc{chen2023hide,
+      title={Hide and Seek (HaS): A Lightweight Framework for Prompt Privacy Protection},
+      author={Yu Chen and Tingxin Li and Huiming Liu and Yang Yu},
+      year={2023},
+      eprint={2309.03057},
+      archivePrefix={arXiv},
+      primaryClass={cs.CR}
+}
+```