xiafanzeng commited on
Commit
92dac1d
1 Parent(s): 773aba7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -0
README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # mGPT
3
+
4
+ mGPT is pre-trained on the [mC4 dataset](https://huggingface.co/datasets/mc4) using a causal language modeling objective. It was introduced in this [paper](https://arxiv.org/abs/2110.06609) and first released on this page.
5
+
6
+ ## Model description
7
+
8
+ mGPT is a Transformer-based model which pre-trained on massive multilingual data covering over 101 languages. Similar to GPT-2, It was pre-trained on the raw texts only, with no human labeling. We use the same tokenization and vocabulary as the [mT5 model](https://huggingface.co/google/mt5-base).
9
+
10
+ ## Intended uses
11
+
12
+ You can use the raw model for text generation or using prompts for adapting it to a downstream task.
13
+
14
+ ## How to use
15
+
16
+ You can use this model directly with a pipeline for text generation. Here is how to use this model to get the features of a given text in PyTorch:
17
+
18
+ ```python
19
+ from transformers import MT5Tokenizer, GPT2LMHeadModel, TextGenerationPipeline
20
+
21
+ tokenizer = MT5Tokenizer.from_pretrained("THUMT/mGPT")
22
+ model = GPT2LMHeadModel.from_pretrained("THUMT/mGPT")
23
+
24
+ pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
25
+ text = "Replace me by any text you'd like."
26
+ text = pipeline(text, do_sample=True, max_length=1024)[0]["generated_text"]
27
+ ```
28
+
29
+ ## Preprocessing
30
+
31
+ The texts are tokenized using `sentencepiece` and a vocabulary size of 250,100. The inputs are sequences of 1,024 consecutive tokens. We use `<extra_id_0>` to separate lines in a document.
32
+
33
+ ## BibTeX entry and citation info
34
+
35
+ ```bibtex
36
+ @misc{tan2021msp,
37
+ title={MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators},
38
+ author={Zhixing Tan and Xiangwen Zhang and Shuo Wang and Yang Liu},
39
+ year={2021},
40
+ eprint={2110.06609},
41
+ archivePrefix={arXiv},
42
+ primaryClass={cs.CL}
43
+ }
44
+ ```