stockmark
/

stockmark-100b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

omitakahiro commited on May 15

Commit

5d4af2f

•

1 Parent(s): 506a6ed

Update README.md

Files changed (1) hide show

README.md +27 -2

README.md CHANGED Viewed

@@ -21,7 +21,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-100b")
 model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-100b", device_map="auto", torch_dtype=torch.bfloat16)
-inputs = tokenizer("人工知能とは、", return_tensors="pt").input_ids.to(model.device)
 with torch.inference_mode():
     tokens = model.generate(
         input_ids,
@@ -33,4 +33,29 @@ with torch.inference_mode():
 output = tokenizer.decode(tokens[0], skip_special_tokens=True)
 print(output)
-```

 tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-100b")
 model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-100b", device_map="auto", torch_dtype=torch.bfloat16)
+input_ids = tokenizer("人工知能とは、", return_tensors="pt").input_ids.to(model.device)
 with torch.inference_mode():
     tokens = model.generate(
         input_ids,
 output = tokenizer.decode(tokens[0], skip_special_tokens=True)
 print(output)
+```
+## Dataset (pretraining)
+Stockmark-100b was trained using a total of about 910B tokens of Japanese and English text corpus. The detail of Japanese data is summarized in the below table.
+| corpus | tokens after preprocessing |
+|:---:|:---:|
+| Stockmark Web Corpus (This dataset will not be released) | 8.8 billion |
+| Patent | 37.5 billion |
+| Wikipedia |1.5 billion |
+| mC4 | 52.6 billion |
+| CommonCrawl (snapshot: 2020-50 ~ 2024-10) | 203.7 billion|
+English data is sampled from [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1).
+## Environment
+- GPU: 48 nodes of 8*H100 instances
+- Library: [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
+## License
+[MIT](https://opensource.org/licenses/MIT)
+## Developed by
+[Stockmark Inc.](https://stockmark.co.jp/)