Takeshi Kojima commited on
Commit
1c3d608
β€’
1 Parent(s): 200a3a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md CHANGED
@@ -1,3 +1,96 @@
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
4
+
5
+ # tsubaki-10b-instruction-sft
6
+
7
+ # Overview
8
+ This repository provides a Japanese-centric multilingual GPT-NeoX model of 10 billion parameters.
9
+
10
+ * **Library**
11
+
12
+ The model was trained using code based on [EleutherAI/gpt-neox](https://github.com/EleutherAI/gpt-neox).
13
+
14
+ * **Model architecture**
15
+
16
+ A 36-layer, 4864-hidden-size transformer-based language model.
17
+
18
+ * **Pre-training**
19
+
20
+ The model was trained on around **600B** tokens from a mixture of the following corpora
21
+
22
+ - [Japanese C4](https://huggingface.co/datasets/mc4)
23
+ - [The Pile](https://huggingface.co/datasets/EleutherAI/pile)
24
+
25
+ * **Instruction-supervised-finetuning**
26
+
27
+ The model was finetuned on a subset records from a mixture of the following dataset
28
+
29
+ - [Alpaca (English)](https://github.com/gururise/AlpacaDataCleaned/blob/main/alpaca_data_cleaned.json)
30
+ - [Alpaca (Japanese translation)](https://github.com/shi3z/alpaca_ja/blob/main/alpaca_cleaned_ja.json)
31
+ - [Flan 2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original)
32
+ - [Flan CoT](https://huggingface.co/datasets/conceptofmind/cot_submix_original)
33
+ - [Flan Dialog](https://huggingface.co/datasets/conceptofmind/dialog_submix_original)
34
+
35
+ * **Model Series**
36
+
37
+ | Variant | Link |
38
+ | :-- | :--|
39
+ | tsubaki-10b-instruction-sft | https://huggingface.co/Kojima777/tsubaki-10b-instruction-sft |
40
+ | tsubaki-10b | https://huggingface.co/Kojima777/tsubaki-10b |
41
+
42
+ * **Authors**
43
+
44
+ Takeshi Kojima
45
+
46
+ ---
47
+
48
+ # Benchmarking
49
+
50
+ * **Japanese benchmark**
51
+
52
+ - *The 4-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, and JSQuAD.*
53
+
54
+ | Model | Average | JCommonsenseQA | JNLI | MARC-ja | JSQuAD |
55
+ | :-- | :-- | :-- | :-- | :-- | :-- |
56
+ | tsubaki-10b-instruction-sft | 79.04 | 74.35 | 65.65 | 96.06 | 80.09 |
57
+ | tsubaki-10b | 67.27 | 65.86 | 54.19 | 84.49 | 64.54 |
58
+
59
+ ---
60
+
61
+ # How to use the model
62
+
63
+ ~~~~python
64
+ import torch
65
+ from transformers import AutoTokenizer, AutoModelForCausalLM
66
+
67
+ tokenizer = AutoTokenizer.from_pretrained("Kojima777/tsubaki-10b-instruction-sft", use_fast=False)
68
+ model = AutoModelForCausalLM.from_pretrained("Kojima777/tsubaki-10b-instruction-sft")
69
+
70
+ if torch.cuda.is_available():
71
+ model = model.to("cuda")
72
+
73
+ text = "倧規樑言θͺžγƒ’デルに぀いてθͺ¬ζ˜Žγ—てください。"
74
+ text = f'δ»₯下は、タスクをθͺ¬ζ˜Žγ™γ‚‹ζŒ‡η€Ίγ§γ™γ€‚θ¦ζ±‚γ‚’ι©εˆ‡γ«ζΊ€γŸγ™εΏœη­”γ‚’ζ›Έγγͺさい。\n\n### ζŒ‡η€Ί:\n{text}\n\n### εΏœη­”:'
75
+ token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
76
+
77
+ with torch.no_grad():
78
+ output_ids = model.generate(
79
+ token_ids.to(model.device),
80
+ max_new_tokens=100,
81
+ do_sample=True,
82
+ temperature=0.6,
83
+ top_p=0.9,
84
+ pad_token_id=tokenizer.pad_token_id,
85
+ bos_token_id=tokenizer.bos_token_id,
86
+ eos_token_id=tokenizer.eos_token_id
87
+ )
88
+
89
+ output = tokenizer.decode(output_ids.tolist()[0])
90
+ print(output)
91
+
92
+ ~~~~
93
+ ---
94
+
95
+ # Licenese
96
+ [cc-by-nc-4.0](https://creativecommons.org/licenses/by-nc/4.0/)