File size: 6,875 Bytes

---
language: zh
datasets: poetry
inference:
  parameters:
    max_length: 108
    num_return_sequences: 1
    do_sample: True
widget: 
- text: "物换 星移 几度 秋"
  example_title: "滕王阁1"
- text: "秋水 共 长天 一色"
  example_title: "滕王阁 2"
- text: "萍水 相逢，尽是 他乡 之 客。"
  example_title: "滕王阁 3"

---


# 古诗词

## Model description

  古诗词AI生成

## How to use
使用 pipeline 调用模型:

```python
from transformers import AutoTokenizer, GPT2LMHeadModel, TextGenerationPipeline
model_checkpoint = "supermy/poetry"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = GPT2LMHeadModel.from_pretrained(model_checkpoint)
text_generator = TextGenerationPipeline(model, tokenizer)
text_generator.model.config.pad_token_id = text_generator.model.config.eos_token_id

print(text_generator("举头 望 明月，", max_length=100, do_sample=True))
print(text_generator("物换 星移 几度 秋，", max_length=100, do_sample=True))

>>> print(text_generator("举头 望 明月，", max_length=100, do_sample=True))
[{'generated_text': '举头 望 明月， 何以 喻 无言 。 顾影 若为 舞 ， 啸 风清 独 伤 。 四时 别有 意 ， 千古 得 从容 。 赏音 我非 此 ， 何如 鸥鹭 群 。 崎 山有 佳色 ， 落落 样 相宜 。 不嫌 雪霜 温 ， 宁 受 四时 肥 。 老 态 如 偷 面 ， 冬 心 似 相知 。 春风 不可 恃 ， 触 动 春 何为 。 岁晚 忽然 老 ， 花前 岁月深 。 可笑 一场 梦 ， 婵娟 乍 自 心 。 列 名 多 岁月 ， 森 列 尽 林峦 。 试问 影 非 笑'}]
>>> print(text_generator("物换 星移 几度 秋，", max_length=100, do_sample=True))
[{'generated_text': '物换 星移 几度 秋， 消长 随时 向 一丘 。 渔者 下 逢 勾漏 令 ， 漏声 高出 景阳 丘 。 天津 大尹 昔 从游 ， 大尹 来时 春复 秋 。 旗鼓 日 严 宣 使 从 ， 联镳 歌笑 又 风流 。 冈峦 比 并 瑶 溪 水 ， 叠嶂 高 盘 黼黻 洲 。 花木 芳菲 三月 天 ， 莺花 暖 翠 几 流年 。 一从 别后 多 携手 ， 肠断 酒阑 怀 凛然 。 北阙 人称 似梦中 ， 西山 别样 梦魂 香 。 多君 观国 亲 圭璧 ， 能 预 陇西 称 巨 良 。 刷羽 刷羽'}]

```
Here is how to use this model to get the features of a given text in PyTorch:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("supermy/poetry")
model = AutoModelForCausalLM.from_pretrained("supermy/poetry")
```



## Training data


非常全的古诗词数据，收录了从先秦到现代的共计85万余首古诗词。

## 统计信息

| 朝代                   | 诗词数  | 作者数  |
|-----------------------|--------|--------|
| 宋                    | 287114 |   9446 |
| 明                    | 236957 |   4439 |
| 清                    |  90089 |   8872 |
| 唐                    |  49195 |   2736 |
| 元                    |  37375 |   1209 |
| 近现代                |  28419 |    790 |
| 当代                  |  28219 |    177 |
| 明末清初               |  17700 |    176 |
| 元末明初               |  15736 |     79 |
| 清末民国初             |  15367 |     99 |
| 清末近现代初           |  12464 |     48 |
| 宋末元初              |  12058 |     41 |
| 南北朝                |   4586 |    434 |
| 近现代末当代初         |   3426 |     23 |
| 魏晋                  |   3020 |    251 |
| 金末元初              |   3019 |     17 |
| 金                    |   2741 |    253 |
| 民国末当代初           |   1948 |      9 |
| 隋                    |   1170 |     84 |
| 唐末宋初              |   1118 |     44 |
| 先秦                  |    570 |      8 |
| 隋末唐初              |    472 |     40 |
| 汉                    |    363 |     83 |
| 宋末金初              |    234 |      9 |
| 辽                    |     22 |      7 |
| 秦                    |      2 |      2 |
| 魏晋末南北朝初          |      1 |      1 |
| 总和                  | 853385 |  29377 |

```
```

## Training procedure

模型：[GPT2](https://huggingface.co/gpt2) 
训练环境：英伟达16G显卡

bpe分词："vocab_size"=50000
```

***** Running training *****
  Num examples = 16431
  Num Epochs = 680
  Instantaneous batch size per device = 24
  Total train batch size (w. parallel, distributed & accumulation) = 192
  Gradient Accumulation steps = 8
  Total optimization steps = 57800
  Number of trainable parameters = 124242432
GPT-2 size: 124.2M parameters
  0%|          | 0/57800 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
    9%|▊         | 5000/57800 [6:58:57<72:53:18,  4.97s/it]***** Running Evaluation *****
  Num examples = 1755
  Batch size = 24
{'loss': 3.1345, 'learning_rate': 0.0004939065828881268, 'epoch': 58.82}
  9%|▊         | 5000/57800 [6:59:14<72:53:18, Saving model checkpoint to poetry-trainer/checkpoint-5000
Configuration saved in poetry-trainer/checkpoint-5000/config.json
Model weights saved in poetry-trainer/checkpoint-5000/pytorch_model.bin
tokenizer config file saved in poetry-trainer/checkpoint-5000/tokenizer_config.json
Special tokens file saved in poetry-trainer/checkpoint-5000/special_tokens_map.json
 17%|█▋        | 10000/57800 [13:55:32<65:40:41,  4.95s/it]***** Running Evaluation *****
  Num examples = 1755
  Batch size = 24
{'eval_loss': 11.14090633392334, 'eval_runtime': 16.8326, 'eval_samples_per_second': 104.262, 'eval_steps_per_second': 4.396, 'epoch': 58.82}
{'loss': 0.2511, 'learning_rate': 0.00046966687938531824, 'epoch': 117.64}
 17%|█▋        | 10000/57800 [13:55:48<65:40:41Saving model checkpoint to poetry-trainer/checkpoint-10000
..........
 95%|█████████▌| 55000/57800 [76:06:46<3:59:33,  5.13s/it]***** Running Evaluation *****
  Num examples = 1755
  Batch size = 24
{'eval_loss': 14.860174179077148, 'eval_runtime': 16.7826, 'eval_samples_per_second': 104.572, 'eval_steps_per_second': 4.409, 'epoch': 588.23}
{'loss': 0.0083, 'learning_rate': 3.0262183266589473e-06, 'epoch': 647.06}
 95%|█████████▌| 55000/57800 [76:07:03<3:59:33,Saving model checkpoint to poetry-trainer/checkpoint-55000

{'eval_loss': 14.830656051635742, 'eval_runtime': 16.7365, 'eval_samples_per_second': 104.86, 'eval_steps_per_second': 4.421, 'epoch': 647.06}
{'train_runtime': 287920.5857, 'train_samples_per_second': 38.806, 'train_steps_per_second': 0.201, 'train_loss': 0.33751299874592816, 'epoch': 679.99}

100%|██████████| 57800/57800 [79:58:40<00:00,  4.93s/it]  
```


```
###  entry and citation info
```

```