poetry / README.md

Update README.md

31a5366 over 1 year ago

No virus

6.88 kB

	---
	language: zh
	datasets: poetry
	inference:
	parameters:
	max_length: 108
	num_return_sequences: 1
	do_sample: True
	widget:
	- text: "物换星移几度秋"
	example_title: "滕王阁1"
	- text: "秋水共长天一色"
	example_title: "滕王阁 2"
	- text: "萍水相逢，尽是他乡之客。"
	example_title: "滕王阁 3"

	---


	# 古诗词

	## Model description

	古诗词AI生成

	## How to use
	使用 pipeline 调用模型:

	```python
	from transformers import AutoTokenizer, GPT2LMHeadModel, TextGenerationPipeline
	model_checkpoint = "supermy/poetry"
	tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
	model = GPT2LMHeadModel.from_pretrained(model_checkpoint)
	text_generator = TextGenerationPipeline(model, tokenizer)
	text_generator.model.config.pad_token_id = text_generator.model.config.eos_token_id

	print(text_generator("举头望明月，", max_length=100, do_sample=True))
	print(text_generator("物换星移几度秋，", max_length=100, do_sample=True))

	>>> print(text_generator("举头望明月，", max_length=100, do_sample=True))
	[{'generated_text': '举头望明月，何以喻无言。顾影若为舞，啸风清独伤。四时别有意，千古得从容。赏音我非此，何如鸥鹭群。崎山有佳色，落落样相宜。不嫌雪霜温，宁受四时肥。老态如偷面，冬心似相知。春风不可恃，触动春何为。岁晚忽然老，花前岁月深。可笑一场梦，婵娟乍自心。列名多岁月，森列尽林峦。试问影非笑'}]
	>>> print(text_generator("物换星移几度秋，", max_length=100, do_sample=True))
	[{'generated_text': '物换星移几度秋，消长随时向一丘。渔者下逢勾漏令，漏声高出景阳丘。天津大尹昔从游，大尹来时春复秋。旗鼓日严宣使从，联镳歌笑又风流。冈峦比并瑶溪水，叠嶂高盘黼黻洲。花木芳菲三月天，莺花暖翠几流年。一从别后多携手，肠断酒阑怀凛然。北阙人称似梦中，西山别样梦魂香。多君观国亲圭璧，能预陇西称巨良。刷羽刷羽'}]

	```
	Here is how to use this model to get the features of a given text in PyTorch:

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	tokenizer = AutoTokenizer.from_pretrained("supermy/poetry")
	model = AutoModelForCausalLM.from_pretrained("supermy/poetry")
	```



	## Training data


	非常全的古诗词数据，收录了从先秦到现代的共计85万余首古诗词。

	## 统计信息

	\| 朝代 \| 诗词数 \| 作者数 \|
	\|-----------------------\|--------\|--------\|
	\| 宋 \| 287114 \| 9446 \|
	\| 明 \| 236957 \| 4439 \|
	\| 清 \| 90089 \| 8872 \|
	\| 唐 \| 49195 \| 2736 \|
	\| 元 \| 37375 \| 1209 \|
	\| 近现代 \| 28419 \| 790 \|
	\| 当代 \| 28219 \| 177 \|
	\| 明末清初 \| 17700 \| 176 \|
	\| 元末明初 \| 15736 \| 79 \|
	\| 清末民国初 \| 15367 \| 99 \|
	\| 清末近现代初 \| 12464 \| 48 \|
	\| 宋末元初 \| 12058 \| 41 \|
	\| 南北朝 \| 4586 \| 434 \|
	\| 近现代末当代初 \| 3426 \| 23 \|
	\| 魏晋 \| 3020 \| 251 \|
	\| 金末元初 \| 3019 \| 17 \|
	\| 金 \| 2741 \| 253 \|
	\| 民国末当代初 \| 1948 \| 9 \|
	\| 隋 \| 1170 \| 84 \|
	\| 唐末宋初 \| 1118 \| 44 \|
	\| 先秦 \| 570 \| 8 \|
	\| 隋末唐初 \| 472 \| 40 \|
	\| 汉 \| 363 \| 83 \|
	\| 宋末金初 \| 234 \| 9 \|
	\| 辽 \| 22 \| 7 \|
	\| 秦 \| 2 \| 2 \|
	\| 魏晋末南北朝初 \| 1 \| 1 \|
	\| 总和 \| 853385 \| 29377 \|

	```
	```

	## Training procedure

	模型：[GPT2](https://huggingface.co/gpt2)
	训练环境：英伟达16G显卡

	bpe分词："vocab_size"=50000
	```

	*** Running training ***
	Num examples = 16431
	Num Epochs = 680
	Instantaneous batch size per device = 24
	Total train batch size (w. parallel, distributed & accumulation) = 192
	Gradient Accumulation steps = 8
	Total optimization steps = 57800
	Number of trainable parameters = 124242432
	GPT-2 size: 124.2M parameters
	0%\| \| 0/57800 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
	9%\|▊ \| 5000/57800 [6:58:57<72:53:18, 4.97s/it]*** Running Evaluation ***
	Num examples = 1755
	Batch size = 24
	{'loss': 3.1345, 'learning_rate': 0.0004939065828881268, 'epoch': 58.82}
	9%\|▊ \| 5000/57800 [6:59:14<72:53:18, Saving model checkpoint to poetry-trainer/checkpoint-5000
	Configuration saved in poetry-trainer/checkpoint-5000/config.json
	Model weights saved in poetry-trainer/checkpoint-5000/pytorch_model.bin
	tokenizer config file saved in poetry-trainer/checkpoint-5000/tokenizer_config.json
	Special tokens file saved in poetry-trainer/checkpoint-5000/special_tokens_map.json
	17%\|█▋ \| 10000/57800 [13:55:32<65:40:41, 4.95s/it]*** Running Evaluation ***
	Num examples = 1755
	Batch size = 24
	{'eval_loss': 11.14090633392334, 'eval_runtime': 16.8326, 'eval_samples_per_second': 104.262, 'eval_steps_per_second': 4.396, 'epoch': 58.82}
	{'loss': 0.2511, 'learning_rate': 0.00046966687938531824, 'epoch': 117.64}
	17%\|█▋ \| 10000/57800 [13:55:48<65:40:41Saving model checkpoint to poetry-trainer/checkpoint-10000
	..........
	95%\|█████████▌\| 55000/57800 [76:06:46<3:59:33, 5.13s/it]*** Running Evaluation ***
	Num examples = 1755
	Batch size = 24
	{'eval_loss': 14.860174179077148, 'eval_runtime': 16.7826, 'eval_samples_per_second': 104.572, 'eval_steps_per_second': 4.409, 'epoch': 588.23}
	{'loss': 0.0083, 'learning_rate': 3.0262183266589473e-06, 'epoch': 647.06}
	95%\|█████████▌\| 55000/57800 [76:07:03<3:59:33,Saving model checkpoint to poetry-trainer/checkpoint-55000

	{'eval_loss': 14.830656051635742, 'eval_runtime': 16.7365, 'eval_samples_per_second': 104.86, 'eval_steps_per_second': 4.421, 'epoch': 647.06}
	{'train_runtime': 287920.5857, 'train_samples_per_second': 38.806, 'train_steps_per_second': 0.201, 'train_loss': 0.33751299874592816, 'epoch': 679.99}

	100%\|██████████\| 57800/57800 [79:58:40<00:00, 4.93s/it]
	```


	```
	### entry and citation info
	```

	```