Upload new model with the latest data on Aug20, 2021

Browse files

Files changed (8) hide show

CHANGELOG.md +0 -16
README.md +27 -22
config.json +4 -2
flax_model.msgpack +0 -3
pytorch_model.bin +2 -2
spiece.model +2 -2
tf_model.h5 +2 -2
tokenizer_config.json +1 -1

CHANGELOG.md DELETED Viewed

@@ -1,16 +0,0 @@
-# Changelog
-## [v1]
-### 2021-04-01
-#### Added
-- disclaimer and author in the "License" section
-- CHANGELOG.md
-### 2021-03-27
-#### Modified
-- config.json to set default generation parameters of top_k=50, top_p=0.95 and d_samples=True
-### 2021-03-27
-#### Added
-- models and model card

README.md CHANGED Viewed

@@ -2,18 +2,20 @@
 language: ja
 datasets: wikipedia
 widget:
-- text: 近年の機械学習は
-license: cc-by-sa-4.0
 ---
 # GPT-2 small Japanese model
-This repository contains a pretrained SentencePiece tokenizer model and GPT-2 small model trained on Japanese Wikipedia dataset.
 ## Training data
-[Japanese Wikipedia](https://ja.wikipedia.org/wiki/Wikipedia:データベースダウンロード) dataset which is released under [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) is used for training both the tokenizer and GPT-2 model as of March 1st, 2021.
-The dataset is splitted into three subsets - train, valid and test. Both of tokenizer and model are trained with the train split.
 ## Model description
@@ -23,42 +25,45 @@ The vocabulary size is set to 32,000 instead of an original size of 50,257.
 ## Tokenizer description
-[SentencePiece](https://github.com/google/sentencepiece) tokenizer is used as a tokenizer for this model.
-In a training, the tokenizer model is trained with 10,000,000 samples which are extracted from the train split of training data.
-The vocabulary size is set to 32,000. A `add_dummy_prefix` option is set to `True` because words are not separated by whitespaces in Japanese.
-After training, the model is imported to `transformers.BERTGenerationTokenizer` because it supports SentencePiece models and it does not add any special tokens as default, which is useful expecially for a text generation task.
 ## Training
-The model is trained on the train split for 10 epochs with batch size 2 and 1024 tokens for each sample (i.e. 2048 tokens are processed in each batch). Each epoch contains around 250,000 steps.
-Adam optimizer is used. The learning rate is linearly decreased from `1e-4` to `0`. A clip norm is also used to set to `1.0`.
-After finishing training, the training loss is reached to 3.23, wihle the validation loss is reached to 3.50.
-All the code to train tokenizer and GPT-2 models are available in [a repository on GitHub](https://github.com/colorfulscoop/tfdlg/tree/8d068f4cc3fac49555971ad8244a540587745d79/examples/transformers-gpt2-ja)
 ## Usage
 First, install dependecies.
 ```sh
-$ pip install transformers==4.3.3 torch==1.8.0 sentencepiece==0.1.91
 ```
-Then load the pretrained tokenizer and GPT-2 model, and call a `generate` method.
 ```sh
 >>> import transformers
->>> tokenizer = transformers.AutoTokenizer.from_pretrained("colorfulscoop/gpt2-small-ja")
->>> model = transformers.AutoModelForCausalLM.from_pretrained("colorfulscoop/gpt2-small-ja")
->>> input = tokenizer.encode("近年の機械学習は", return_tensors="pt")
->>> output = model.generate(input, do_sample=True, top_p=0.95, top_k=50, num_return_sequences=3)
->>> tokenizer.batch_decode(output)
-['近年の機械学習は、特に、コンピューター学習において重要な概念である。この概念は、教育心理学', '近年の機械学習は時間間隔の短縮、時間間隔の短縮、学習時間の短縮、学習の', '近年の機械学習は、学生と学生が自分の能力を高め、結果を向上させることを目的としている。それは、']
 ```
-**Note:** The default model configuration `config.json` sets some generation parameters with `do_sample=True`, `top_k=50`, `top_p=0.95`. Please reset these parameters when you need to set different parameters.
 ## License

 language: ja
 datasets: wikipedia
 widget:
+- text: 統計的機械学習でのニューラルネットワーク
+license: cc-by-sa-3.0
 ---
 # GPT-2 small Japanese model
+This repository contains a GPT2-small model trained on Japanese Wikipedia dataset.
 ## Training data
+[Japanese Wikipedia](https://ja.wikipedia.org/wiki/Wikipedia:データベースダウンロード) dataset as of Aug20, 2021 released under [Creative Commons Attribution-ShareAlike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) is used for both tokenizer and GPT-2 model.
+We splitted the dataset into three subsets - train, valid and test sets. Both tokenizer and model were trained on the train set.
+Train set contains around 540M tokens.
 ## Model description
 ## Tokenizer description
+[SentencePiece](https://github.com/google/sentencepiece) is used as a tokenizer for this model.
+We utilized 1,000,000 sentences from train set.
+The vocabulary size was 32,000.
+A `add_dummy_prefix` option was set to `True` because Japanese words are not separated by whitespaces.
+After training, the tokenizer model was imported as `transformers.BERTGenerationTokenizer`
+because it supports SentencePiece models and it does not add any special tokens as default,
+which is useful expecially for a text generation task.
 ## Training
+The model was trained on the train set for 30 epochs with batch size 32. Each sample contained 1024 tokens.
+We utilized Adam optimizer. Learning rate was linearly increased from `0` to `1e-4` during the first 10,000 steps.
+A clip norm was set to `1.0`.
+Test set perplexity of the trained model was 29.13.
+Please refer to [GitHub](https://github.com/colorfulscoop/gpt-ja) for more training details.
 ## Usage
 First, install dependecies.
 ```sh
+$ pip install transformers==4.10.0 torch==1.8.1 sentencepiece==0.1.96
 ```
+Then use pipeline to generate sentences.
 ```sh
 >>> import transformers
+>>> pipeline = transformers.pipeline("text-generation", "colorfulscoop/gpt2-small-ja")
+>>> pipeline("統計的機械学習でのニューラルネットワーク", do_sample=True, top_p=0.95, top_k=50, num_return_sequences=3)
 ```
+**Note:** The default model configuration `config.json` sets parameters for text generation with `do_sample=True`, `top_k=50`, `top_p=0.95`.
+Please set these parameters when you need to use different parameters.
 ## License

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "output/model",
   "activation_function": "gelu_new",
   "architectures": [
     "GPT2LMHeadModel"
@@ -21,6 +21,7 @@
   "n_positions": 1024,
   "pad_token_id": 0,
   "resid_pdrop": 0.1,
   "sep_token_id": 5,
   "summary_activation": null,
   "summary_first_dropout": 0.1,
@@ -28,7 +29,8 @@
   "summary_type": "cls_index",
   "summary_use_proj": true,
   "tokenizer_class": "BertGenerationTokenizer",
-  "transformers_version": "4.3.3",
   "unk_token_id": 1,
   "use_cache": true,
   "vocab_size": 32000,

 {
+  "_name_or_path": "models/gpt2-small",
   "activation_function": "gelu_new",
   "architectures": [
     "GPT2LMHeadModel"
   "n_positions": 1024,
   "pad_token_id": 0,
   "resid_pdrop": 0.1,
+  "scale_attn_weights": true,
   "sep_token_id": 5,
   "summary_activation": null,
   "summary_first_dropout": 0.1,
   "summary_type": "cls_index",
   "summary_use_proj": true,
   "tokenizer_class": "BertGenerationTokenizer",
+  "torch_dtype": "float32",
+  "transformers_version": "4.10.0",
   "unk_token_id": 1,
   "use_cache": true,
   "vocab_size": 32000,

flax_model.msgpack DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:f16a0793a705aed9f18b790ed1664e3d536e293d0bc93e1ce9e495000249684e
-size 441678616

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:01931e624962eb49c467771ad8ac53f5085b73c36b00832d3e7a696a51b94ba6
-size 454320379

 version https://git-lfs.github.com/spec/v1
+oid sha256:1e95f3fb022ae9e599aaacf7eb9b69cc2194b817da1d07dca8de6ec6127d33d4
+size 454320757

spiece.model CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:b003f17b42d4a24bc46a9aa216224f2ff4c93f7402df69b8236707e3e91454d5
-size 802167

 version https://git-lfs.github.com/spec/v1
+oid sha256:ec01688f1fdf79d9596d099bfe2cd1ec8d0871e848660ca643907a3e4a5fb97f
+size 802969

tf_model.h5 CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5fbfff336b880f72b22d5b66422ef579f5d62404fad3280fd8c370f5b1ec6e24
-size 441848144

 version https://git-lfs.github.com/spec/v1
+oid sha256:20683b6b0ee0f9c2e6b017085fa169ae4ea62af56b46ff17334e312e6b8b1849
+size 441849416

tokenizer_config.json CHANGED Viewed

	@@ -1 +1 @@
1	- {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "sep_token": "<sep>", "cls_token": "<cls>"}


1	+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "pad_token": "<pad>", "sep_token": "<sep>", "sp_model_kwargs": {}, "cls_token": "<cls>", "special_tokens_map_file": "models/small-v2/special_tokens_map.json", "tokenizer_file": null, "name_or_path": "models/small-v2/", "tokenizer_class": "BertGenerationTokenizer"}