hkiyomaru commited on
Commit
17a7ee5
1 Parent(s): ed3c0b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -6
README.md CHANGED
@@ -111,7 +111,7 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
111
  The models have been pre-trained using a blend of the following datasets.
112
 
113
  | Language | Dataset | Tokens|
114
- |:---:|:---:|:---:|
115
  |Japanese|[Wikipedia](https://huggingface.co/datasets/wikipedia)|1.4B
116
  ||[Common Crawl](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus)|130.7B
117
  |English|[Wikipedia](https://huggingface.co/datasets/wikipedia)|4.7B
@@ -123,11 +123,14 @@ The models have been pre-trained using a blend of the following datasets.
123
  The models have been fine-tuned on the following datasets.
124
 
125
  | Language | Dataset | description |
126
- |:---|:---:|:---:|
127
- |Japanese|[jaster](https://github.com/llm-jp/llm-jp-eval)| An automatically transformed data from the existing Japanese NLP datasets |
128
- ||[databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)| A translated one by DeepL in LLM-jp |
129
- ||[OpenAssistant Conversations Dataset](https://huggingface.co/datasets/OpenAssistant/oasst1)| A translated one by DeepL in LLM-jp |
130
-
 
 
 
131
 
132
  ## Evaluation
133
 
 
111
  The models have been pre-trained using a blend of the following datasets.
112
 
113
  | Language | Dataset | Tokens|
114
+ |:---|:---|:---|
115
  |Japanese|[Wikipedia](https://huggingface.co/datasets/wikipedia)|1.4B
116
  ||[Common Crawl](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus)|130.7B
117
  |English|[Wikipedia](https://huggingface.co/datasets/wikipedia)|4.7B
 
123
  The models have been fine-tuned on the following datasets.
124
 
125
  | Language | Dataset | description |
126
+ |:---|:---|:---|
127
+ |Japanese|[ichikara-instruction-004-001](https://liat-aip.sakura.ne.jp/wp/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf%e4%bd%9c%e6%88%90/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf-%e5%85%ac%e9%96%8b/)| A manually constructed Japanese instruction dataset |
128
+ | |[answer-carefully-001]()| A manually constructed Japanese instruction dataset focusing on LLMs' safety |
129
+ | |[databricks-dolly-15k-ja](https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja)| [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) translated into Japanese using DeepL |
130
+ | |[oasst1-21k-ja](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja)| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) translated into Japanese using DeepL |
131
+ | |[oasst2-33k-ja](https://huggingface.co/datasets/llm-jp/oasst2-33k-ja)| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) translated into Japanese using DeepL |
132
+ |English |[oasst1-21k-en](https://huggingface.co/datasets/llm-jp/oasst1-21k-en)| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) |
133
+ | |[oasst2-33k-en](https://huggingface.co/datasets/llm-jp/oasst2-33k-en)| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) |
134
 
135
  ## Evaluation
136