HaS-820m / README.md

Update README.md

1625dd4 11 months ago

8.59 kB

	---
	language:
	- zh
	- en
	pipeline_tag: other
	# widget:
	# - text: "Paraphrase the text:\n\n"
	# example_title: "example"
	# inference:
	# parameters:
	# # temperature: 1
	# # do_sample: true
	# max_new_tokens: 50
	---

	# Hide-and-Seek隐私保护模型
	<a href="https://github.com/alohachen/Hide-and-Seek" target="_blank">Github Repo</a> / <a href="https://arxiv.org/abs/2309.03057" target="_blank">arXiv Preprint</a> / <a href="https://xlab.tencent.com/cn/2023/12/05/hide_and_seek/" target="_blank">技术博客</a>

	## 介绍
	Hide-and-Seek是一个由腾讯安全玄武实验室开发的隐私保护模型。该模型的隐私保护流程由hide与seek两个子任务组成，其中hide负责将用户输入中的敏感实体词替换为其他随机实体（隐私信息匿名化），seek负责将输出中被替换掉的部分还原以对应原文本（信息还原）。此仓库是我们的社区开源中文版模型，模型以[bloom-1.1b](https://huggingface.co/bigscience/bloom-1b1)为底模，经过词表裁剪和微调后得到。更多细节请参考我们的[技术博客](https://xlab.tencent.com/cn/2023/12/05/hide_and_seek/)。

	我们已成功将该模型在手机和笔记本上进行了部署实验。经测试，在NF4量化后，仅使用CPU的情况下，MacBook M2笔记本的推理速度为180-200 tokens/s，MacBook M1笔记本的推理速度为110-130 tokens/s，Pixel 8 Pro手机的推理速度为20-30 tokens/s。目前原生支持的NLP任务类型有：润色，摘要，翻译，阅读理解和文本分类，并且支持Zero Shot的自定义扩展任务。

	以下是我们模型的演示视频：

	1. MacBook M2演示视频：
	<a href="https://v.qq.com/txp/iframe/player.html?vid=s3530zas783">
	<img src="https://tinypng.com/backend/opt/output/20j9t2s36aq4cs8wegyjjbwrf85cmpw4/video1_cover.png" alt="web端演示视频" width="300" height="200">
	</a>
	2. Pixel 8 Pro演示视频：
	<a href="https://v.qq.com/txp/iframe/player.html?vid=r3530wjusjm">
	<img src="https://tinypng.com/backend/opt/output/svm4cs6aafwdgvqwm4h8r744thn09kas/video2_cover.png" alt="手机端演示视频" width="300" height="200">
	</a>

	## colab演示&环境依赖
	由于机器学习环境配置复杂耗时，我们提供了一个[colab notebook](https://drive.google.com/file/d/1ZkGegZ_JjPy6k_wWnjaUaqq4QbF9LoWG/view?usp=sharing)用于demo，我们在下方列出了必要依赖供您参考。如果您在自己的环境上运行，可能需要根据自己设备做出一些调整。
	```shell
	pip install torch==2.1.0+cu118
	pip install transformers==4.35.0
	```

	## Quick Start
	下面是通过hide完成信息匿名化的一个例子。
	```ipython
	from transformers import AutoTokenizer, AutoModelForCausalLM
	tokenizer = AutoTokenizer.from_pretrained("SecurityXuanwuLab/HaS-820m")
	model = AutoModelForCausalLM.from_pretrained("SecurityXuanwuLab/HaS-820m").to('cuda:0')
	hide_template = """<s>Paraphrase the text:%s\n\n"""
	original_input = "张伟用苹果(iPhone 13)换了一箱好吃的苹果。"
	input_text = hide_template % original_input
	inputs = tokenizer(input_text, return_tensors='pt').to('cuda:0')
	pred = model.generate(**inputs, max_length=100)
	pred = pred.cpu()[0][len(inputs['input_ids'][0]):]
	hide_input = tokenizer.decode(pred, skip_special_tokens=True)
	print(hide_input)

	# output:
	# 李华用华为(Mate 20)换了一箱美味的橙子。
	```

	下面是通过seek完成摘要还原的一个例子。
	```ipython
	from transformers import AutoTokenizer, AutoModelForCausalLM
	tokenizer = AutoTokenizer.from_pretrained("SecurityXuanwuLab/HaS-820m")
	model = AutoModelForCausalLM.from_pretrained("SecurityXuanwuLab/HaS-820m").to('cuda:0')
	seek_template = "Convert the text:\n%s\n\n%s\n\nConvert the text:\n%s\n\n"
	hide_input = "前天，'2022北京海淀·颐和园经贸合作洽谈会成功举行，各大媒体竞相报道了活动盛况，小李第一时间将昨天媒体报道情况进行了整理。人民日报中国青年网国际联合报北京商报消费者观察报上海晚报杭州日报海峡晚报北京日报北京市电视一台?北京新闻人民网手机雅虎网网易北京长三角经济网新京网中国农业新闻网北京圆桌居然有这么多!还有部分媒体将在未来一周陆续发稿，为经洽会点!为海淀点!阅读投诉阅读精选留言加载中以上留言由公众号筛选后显示了解留言功能详情"
	hide_output = "2022北京海淀·颐和园经贸合作洽谈会成功举办，各大媒体广泛报道"
	original_input = "昨天，’2016苏州吴中·太湖经贸合作洽谈会成功举行，各大媒体竞相报道了活动盛况，小吴第一时间将今天媒体报道情况进行了整理。新华社中国青年报?中青在线香港大公报?大公网香港商报消费者导报扬子晚报江南时报苏州日报姑苏晚报城市商报苏州广电一套?苏州新闻新华网手机凤凰网网易苏州长三角城市网新苏网中国商务新闻网苏州圆桌居然有这么多!还有部分媒体将在今后几天陆续发稿，为经洽会点!为吴中点!阅读投诉阅读精选留言加载中以上留言由公众号筛选后显示了解留言功能详情"
	input_text = seek_template % (hide_input, hide_output, original_input)
	inputs = tokenizer(input_text, return_tensors='pt').to('cuda:0')
	pred = model.generate(**inputs, max_length=512)
	pred = pred.cpu()[0][len(inputs['input_ids'][0]):]
	original_output = tokenizer.decode(pred, skip_special_tokens=True)
	print(original_output)

	# output:
	# 2016苏州吴中·太湖经贸合作洽谈会成功举办，各大媒体广泛报道
	```

	下面是一个完整隐私保护流程的的例子。注意使用时需要自备OpenAI的API token。
	```ipython
	# see hideAndSeek.py in this repo
	from hideAndSeek import *

	tokenizer = AutoTokenizer.from_pretrained("SecurityXuanwuLab/HaS-820m")
	model = AutoModelForCausalLM.from_pretrained("SecurityXuanwuLab/HaS-820m").to('cuda:0')

	original_input = "华纳兄弟影业（Warner Bro）著名的作品有《蝙蝠侠》系列、《超人》系列、《黑客帝国》系列和《指环王》系列。目前华纳未考虑推出《蝙蝠侠》系列新作。"
	print('original input:', original_input)
	hide_input = hide(original_input, model, tokenizer)
	print('hide input:', hide_input)
	prompt = "Translate the following text into English.\n %s\n" % hide_input
	hide_output = get_gpt_output(prompt)
	print('hide output:', hide_output)
	original_output = seek(hide_input, hide_output, original_input, model, tokenizer)
	print('original output:', original_output)

	# output:
	# original input: 华纳兄弟影业（Warner Bro）著名的作品有《蝙蝠侠》系列、《超人》系列、《黑客帝国》系列和《指环王》系列。目前华纳未考虑推出《蝙蝠侠》系列新作。
	# hide input: 索尼影业（Sony Pictures）知名的作品有《艺术作品1》系列、《艺术作品2》系列、《艺术作品3》系列和《艺术作品4》系列。目前索尼未考虑推出《艺术作品1》系列新作。
	# hide output: Sony Pictures' renowned works include the "Artwork 1" series, "Artwork 2" series, "Artwork 3" series, and "Artwork 4" series. Currently, Sony is not considering releasing a new installment in the "Artwork 1" series.
	# original output: Warner Brothers' famous works include the "Batman" series, "Superman" series, "The Matrix" series, and "The Lord of the Rings" series. Currently, Warner is not considering releasing a new installment in the "Batman" series.
	```
	## 测试
	我们用GPT-3.5作为API，以DeepL为翻译参照，在中译英任务上做了损失测评。NF4量化推理的结果如下表所示。第一行为无保护直接翻译的结果，第二行为经过我们系统保护后的翻译结果。结果表明我们的模型能够在仅造成较小的精度损失的情况下保护用户隐私。我们仍然在改进我们的模型以达到更好的效果。
	\| Setting \| ROUGE-1 \| ROUGE-2 \| ROUGE-L \| BLEU-2 \| BLEU-4 \| METEOR \|
	\|----------\|----------\|----------\|----------\|----------\|----------\|----------\|
	\| No protection \| 60.80 \| 33.54 \| 54.96 \| 79.85 \| 67.17 \| 53.03 \|
	\| Protect with HaS \| 57.37 \| 31.60 \| 51.92 \| 72.72 \| 61.24 \| 48.77 \|

	## 引用
	```
	@misc{chen2023hide,
	title={Hide and Seek (HaS): A Lightweight Framework for Prompt Privacy Protection},
	author={Yu Chen and Tingxin Li and Huiming Liu and Yang Yu},
	year={2023},
	eprint={2309.03057},
	archivePrefix={arXiv},
	primaryClass={cs.CR}
	}
	```