omitakahiro
commited on
Commit
•
5d4af2f
1
Parent(s):
506a6ed
Update README.md
Browse files
README.md
CHANGED
@@ -21,7 +21,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
21 |
tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-100b")
|
22 |
model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-100b", device_map="auto", torch_dtype=torch.bfloat16)
|
23 |
|
24 |
-
|
25 |
with torch.inference_mode():
|
26 |
tokens = model.generate(
|
27 |
input_ids,
|
@@ -33,4 +33,29 @@ with torch.inference_mode():
|
|
33 |
|
34 |
output = tokenizer.decode(tokens[0], skip_special_tokens=True)
|
35 |
print(output)
|
36 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-100b")
|
22 |
model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-100b", device_map="auto", torch_dtype=torch.bfloat16)
|
23 |
|
24 |
+
input_ids = tokenizer("人工知能とは、", return_tensors="pt").input_ids.to(model.device)
|
25 |
with torch.inference_mode():
|
26 |
tokens = model.generate(
|
27 |
input_ids,
|
|
|
33 |
|
34 |
output = tokenizer.decode(tokens[0], skip_special_tokens=True)
|
35 |
print(output)
|
36 |
+
```
|
37 |
+
|
38 |
+
## Dataset (pretraining)
|
39 |
+
|
40 |
+
Stockmark-100b was trained using a total of about 910B tokens of Japanese and English text corpus. The detail of Japanese data is summarized in the below table.
|
41 |
+
|
42 |
+
| corpus | tokens after preprocessing |
|
43 |
+
|:---:|:---:|
|
44 |
+
| Stockmark Web Corpus (This dataset will not be released) | 8.8 billion |
|
45 |
+
| Patent | 37.5 billion |
|
46 |
+
| Wikipedia |1.5 billion |
|
47 |
+
| mC4 | 52.6 billion |
|
48 |
+
| CommonCrawl (snapshot: 2020-50 ~ 2024-10) | 203.7 billion|
|
49 |
+
|
50 |
+
English data is sampled from [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1).
|
51 |
+
|
52 |
+
## Environment
|
53 |
+
- GPU: 48 nodes of 8*H100 instances
|
54 |
+
- Library: [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
|
55 |
+
|
56 |
+
## License
|
57 |
+
[MIT](https://opensource.org/licenses/MIT)
|
58 |
+
|
59 |
+
## Developed by
|
60 |
+
[Stockmark Inc.](https://stockmark.co.jp/)
|
61 |
+
|