Takeshi Kojima commited on
Commit
6e52648
1 Parent(s): 0985e17

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md CHANGED
@@ -1,3 +1,85 @@
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
4
+
5
+ # tsubaki-10b
6
+
7
+ # Overview
8
+ This repository provides a Japanese-centric multilingual GPT-NeoX model of 10 billion parameters.
9
+
10
+ * **Library**
11
+
12
+ The model was trained using code based on [EleutherAI/gpt-neox](https://github.com/EleutherAI/gpt-neox).
13
+
14
+ * **Model architecture**
15
+
16
+ A 36-layer, 4864-hidden-size transformer-based language model.
17
+
18
+ * **Pre-training**
19
+
20
+ The model was trained on around **600B** tokens from a mixture of the following corpora
21
+
22
+ - [Japanese C4](https://huggingface.co/datasets/mc4)
23
+ - [The Pile](https://huggingface.co/datasets/EleutherAI/pile)
24
+
25
+ * **Model Series**
26
+
27
+ | Variant | Link |
28
+ | :-- | :--|
29
+ | tsubaki-10b-instruction-sft | https://huggingface.co/Kojima777/tsubaki-10b-sft-alpaca-flan-cot-dialog |
30
+ | tsubaki-10b | https://huggingface.co/Kojima777/tsubaki-10b |
31
+
32
+ * **Authors**
33
+
34
+ Takeshi Kojima
35
+
36
+ ---
37
+
38
+ # Benchmarking
39
+
40
+ * **Japanese benchmark**
41
+
42
+ - *The 4-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, and JSQuAD.*
43
+
44
+ | Model | Average | JCommonsenseQA | JNLI | MARC-ja | JSQuAD |
45
+ | :-- | :-- | :-- | :-- | :-- | :-- |
46
+ | tsubaki-10b-instruction-sft | 79.04 | 74.35 | 65.65 | 96.06 | 80.09 |
47
+ | tsubaki-10b | 67.27 | 65.86 | 54.19 | 84.49 | 64.54 |
48
+
49
+ ---
50
+
51
+ # How to use the model
52
+
53
+ ~~~~python
54
+ import torch
55
+ from transformers import AutoTokenizer, AutoModelForCausalLM
56
+
57
+ tokenizer = AutoTokenizer.from_pretrained("Kojima777/tsubaki-10b", use_fast=False)
58
+ model = AutoModelForCausalLM.from_pretrained("Kojima777/tsubaki-10b")
59
+
60
+ if torch.cuda.is_available():
61
+ model = model.to("cuda")
62
+
63
+ text = "吾輩は猫である。"
64
+ token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
65
+
66
+ with torch.no_grad():
67
+ output_ids = model.generate(
68
+ token_ids.to(model.device),
69
+ max_new_tokens=100,
70
+ do_sample=True,
71
+ temperature=0.6,
72
+ top_p=0.9,
73
+ pad_token_id=tokenizer.pad_token_id,
74
+ bos_token_id=tokenizer.bos_token_id,
75
+ eos_token_id=tokenizer.eos_token_id
76
+ )
77
+
78
+ output = tokenizer.decode(output_ids.tolist()[0])
79
+ print(output)
80
+
81
+ ~~~~
82
+ ---
83
+
84
+ # Licenese
85
+ [cc-by-nc-4.0](https://creativecommons.org/licenses/by-nc/4.0/)