noriyukipy commited on
Commit
e1850c2
1 Parent(s): c4da11c

Upload new model with the latest data on Aug20, 2021

Browse files
Files changed (8) hide show
  1. CHANGELOG.md +0 -16
  2. README.md +27 -22
  3. config.json +4 -2
  4. flax_model.msgpack +0 -3
  5. pytorch_model.bin +2 -2
  6. spiece.model +2 -2
  7. tf_model.h5 +2 -2
  8. tokenizer_config.json +1 -1
CHANGELOG.md DELETED
@@ -1,16 +0,0 @@
1
- # Changelog
2
-
3
- ## [v1]
4
-
5
- ### 2021-04-01
6
- #### Added
7
- - disclaimer and author in the "License" section
8
- - CHANGELOG.md
9
-
10
- ### 2021-03-27
11
- #### Modified
12
- - config.json to set default generation parameters of top_k=50, top_p=0.95 and d_samples=True
13
-
14
- ### 2021-03-27
15
- #### Added
16
- - models and model card
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -2,18 +2,20 @@
2
  language: ja
3
  datasets: wikipedia
4
  widget:
5
- - text: 近年の機械学習は
6
- license: cc-by-sa-4.0
7
  ---
8
 
9
  # GPT-2 small Japanese model
10
 
11
- This repository contains a pretrained SentencePiece tokenizer model and GPT-2 small model trained on Japanese Wikipedia dataset.
12
 
13
  ## Training data
14
 
15
- [Japanese Wikipedia](https://ja.wikipedia.org/wiki/Wikipedia:データベースダウンロード) dataset which is released under [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) is used for training both the tokenizer and GPT-2 model as of March 1st, 2021.
16
- The dataset is splitted into three subsets - train, valid and test. Both of tokenizer and model are trained with the train split.
 
 
17
 
18
  ## Model description
19
 
@@ -23,42 +25,45 @@ The vocabulary size is set to 32,000 instead of an original size of 50,257.
23
 
24
  ## Tokenizer description
25
 
26
- [SentencePiece](https://github.com/google/sentencepiece) tokenizer is used as a tokenizer for this model.
27
 
28
- In a training, the tokenizer model is trained with 10,000,000 samples which are extracted from the train split of training data.
29
- The vocabulary size is set to 32,000. A `add_dummy_prefix` option is set to `True` because words are not separated by whitespaces in Japanese.
 
30
 
31
- After training, the model is imported to `transformers.BERTGenerationTokenizer` because it supports SentencePiece models and it does not add any special tokens as default, which is useful expecially for a text generation task.
 
 
32
 
33
  ## Training
34
 
35
- The model is trained on the train split for 10 epochs with batch size 2 and 1024 tokens for each sample (i.e. 2048 tokens are processed in each batch). Each epoch contains around 250,000 steps.
36
- Adam optimizer is used. The learning rate is linearly decreased from `1e-4` to `0`. A clip norm is also used to set to `1.0`.
37
- After finishing training, the training loss is reached to 3.23, wihle the validation loss is reached to 3.50.
 
 
 
38
 
39
- All the code to train tokenizer and GPT-2 models are available in [a repository on GitHub](https://github.com/colorfulscoop/tfdlg/tree/8d068f4cc3fac49555971ad8244a540587745d79/examples/transformers-gpt2-ja)
40
 
41
  ## Usage
42
 
43
  First, install dependecies.
44
 
45
  ```sh
46
- $ pip install transformers==4.3.3 torch==1.8.0 sentencepiece==0.1.91
47
  ```
48
 
49
- Then load the pretrained tokenizer and GPT-2 model, and call a `generate` method.
50
 
51
  ```sh
52
  >>> import transformers
53
- >>> tokenizer = transformers.AutoTokenizer.from_pretrained("colorfulscoop/gpt2-small-ja")
54
- >>> model = transformers.AutoModelForCausalLM.from_pretrained("colorfulscoop/gpt2-small-ja")
55
- >>> input = tokenizer.encode("近年の機械学習は", return_tensors="pt")
56
- >>> output = model.generate(input, do_sample=True, top_p=0.95, top_k=50, num_return_sequences=3)
57
- >>> tokenizer.batch_decode(output)
58
- ['近年の機械学習は、特に、コンピューター学習において重要な概念である。この概念は、教育心理学', '近年の機械学習は時間間隔の短縮、時間間隔の短縮、学習時間の短縮、学習の', '近年の機械学習は、学生と学生が自分の能力を高め、結果を向上させることを目的としている。それは、']
59
  ```
60
 
61
- **Note:** The default model configuration `config.json` sets some generation parameters with `do_sample=True`, `top_k=50`, `top_p=0.95`. Please reset these parameters when you need to set different parameters.
 
62
 
63
  ## License
64
 
2
  language: ja
3
  datasets: wikipedia
4
  widget:
5
+ - text: 統計的機械学習でのニューラルネットワーク
6
+ license: cc-by-sa-3.0
7
  ---
8
 
9
  # GPT-2 small Japanese model
10
 
11
+ This repository contains a GPT2-small model trained on Japanese Wikipedia dataset.
12
 
13
  ## Training data
14
 
15
+ [Japanese Wikipedia](https://ja.wikipedia.org/wiki/Wikipedia:データベースダウンロード) dataset as of Aug20, 2021 released under [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) is used for both tokenizer and GPT-2 model.
16
+
17
+ We splitted the dataset into three subsets - train, valid and test sets. Both tokenizer and model were trained on the train set.
18
+ Train set contains around 540M tokens.
19
 
20
  ## Model description
21
 
25
 
26
  ## Tokenizer description
27
 
28
+ [SentencePiece](https://github.com/google/sentencepiece) is used as a tokenizer for this model.
29
 
30
+ We utilized 1,000,000 sentences from train set.
31
+ The vocabulary size was 32,000.
32
+ A `add_dummy_prefix` option was set to `True` because Japanese words are not separated by whitespaces.
33
 
34
+ After training, the tokenizer model was imported as `transformers.BERTGenerationTokenizer`
35
+ because it supports SentencePiece models and it does not add any special tokens as default,
36
+ which is useful expecially for a text generation task.
37
 
38
  ## Training
39
 
40
+ The model was trained on the train set for 30 epochs with batch size 32. Each sample contained 1024 tokens.
41
+
42
+ We utilized Adam optimizer. Learning rate was linearly increased from `0` to `1e-4` during the first 10,000 steps.
43
+ A clip norm was set to `1.0`.
44
+
45
+ Test set perplexity of the trained model was 29.13.
46
 
47
+ Please refer to [GitHub](https://github.com/colorfulscoop/gpt-ja) for more training details.
48
 
49
  ## Usage
50
 
51
  First, install dependecies.
52
 
53
  ```sh
54
+ $ pip install transformers==4.10.0 torch==1.8.1 sentencepiece==0.1.96
55
  ```
56
 
57
+ Then use pipeline to generate sentences.
58
 
59
  ```sh
60
  >>> import transformers
61
+ >>> pipeline = transformers.pipeline("text-generation", "colorfulscoop/gpt2-small-ja")
62
+ >>> pipeline("統計的機械学習でのニューラルネットワーク", do_sample=True, top_p=0.95, top_k=50, num_return_sequences=3)
 
 
 
 
63
  ```
64
 
65
+ **Note:** The default model configuration `config.json` sets parameters for text generation with `do_sample=True`, `top_k=50`, `top_p=0.95`.
66
+ Please set these parameters when you need to use different parameters.
67
 
68
  ## License
69
 
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "output/model",
3
  "activation_function": "gelu_new",
4
  "architectures": [
5
  "GPT2LMHeadModel"
@@ -21,6 +21,7 @@
21
  "n_positions": 1024,
22
  "pad_token_id": 0,
23
  "resid_pdrop": 0.1,
 
24
  "sep_token_id": 5,
25
  "summary_activation": null,
26
  "summary_first_dropout": 0.1,
@@ -28,7 +29,8 @@
28
  "summary_type": "cls_index",
29
  "summary_use_proj": true,
30
  "tokenizer_class": "BertGenerationTokenizer",
31
- "transformers_version": "4.3.3",
 
32
  "unk_token_id": 1,
33
  "use_cache": true,
34
  "vocab_size": 32000,
1
  {
2
+ "_name_or_path": "models/gpt2-small",
3
  "activation_function": "gelu_new",
4
  "architectures": [
5
  "GPT2LMHeadModel"
21
  "n_positions": 1024,
22
  "pad_token_id": 0,
23
  "resid_pdrop": 0.1,
24
+ "scale_attn_weights": true,
25
  "sep_token_id": 5,
26
  "summary_activation": null,
27
  "summary_first_dropout": 0.1,
29
  "summary_type": "cls_index",
30
  "summary_use_proj": true,
31
  "tokenizer_class": "BertGenerationTokenizer",
32
+ "torch_dtype": "float32",
33
+ "transformers_version": "4.10.0",
34
  "unk_token_id": 1,
35
  "use_cache": true,
36
  "vocab_size": 32000,
flax_model.msgpack DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:f16a0793a705aed9f18b790ed1664e3d536e293d0bc93e1ce9e495000249684e
3
- size 441678616
 
 
 
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:01931e624962eb49c467771ad8ac53f5085b73c36b00832d3e7a696a51b94ba6
3
- size 454320379
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e95f3fb022ae9e599aaacf7eb9b69cc2194b817da1d07dca8de6ec6127d33d4
3
+ size 454320757
spiece.model CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b003f17b42d4a24bc46a9aa216224f2ff4c93f7402df69b8236707e3e91454d5
3
- size 802167
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec01688f1fdf79d9596d099bfe2cd1ec8d0871e848660ca643907a3e4a5fb97f
3
+ size 802969
tf_model.h5 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5fbfff336b880f72b22d5b66422ef579f5d62404fad3280fd8c370f5b1ec6e24
3
- size 441848144
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:20683b6b0ee0f9c2e6b017085fa169ae4ea62af56b46ff17334e312e6b8b1849
3
+ size 441849416
tokenizer_config.json CHANGED
@@ -1 +1 @@
1
- {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "sep_token": "<sep>", "cls_token": "<cls>"}
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "sep_token": "<sep>", "sp_model_kwargs": {}, "cls_token": "<cls>", "special_tokens_map_file": "models/small-v2/special_tokens_map.json", "tokenizer_file": null, "name_or_path": "models/small-v2/", "tokenizer_class": "BertGenerationTokenizer"}