File size: 2,150 Bytes

5546e75
64a727f
 
 
 
 
 
 
 
 
 
5546e75
8bc3e9a
5546e75
8bc3e9a
5546e75
8bc3e9a
5546e75
8bc3e9a
 
 
5546e75
8bc3e9a
 
 
 
 
 
 
 
5546e75
8bc3e9a
 
5546e75
8bc3e9a
5546e75
8bc3e9a
5546e75
8bc3e9a
5546e75
8bc3e9a
 
5546e75
8bc3e9a
 
5546e75
 
8bc3e9a
5546e75
8bc3e9a
5546e75
8bc3e9a
 
5546e75
8bc3e9a
 
 
5546e75
8bc3e9a
5546e75
8bc3e9a
5546e75
8bc3e9a
5546e75
8bc3e9a
5546e75
 
8bc3e9a


---
license: apache-2.0
datasets:
- oscar-corpus/OSCAR-2109
language:
- en
- pl
pipeline_tag: text-generation
library_name: transformers
---

# B-GPT_en_pl_sequential

The B-GPT Models are bilingual GPT-2 style models. For the first half of training, this model was trained only on English data. In the second half of training, the model was trained on only {language_2} data.. At the end of training, 50 % of training data seen by the model is English and 50 % is Polish. The tokenizer was trained on the same proportions of English and Polish data. 

## Model details:

    All models are trained with a [CLS] (same as [BOS]) token prepended, and a [SEP] (same as [EOS]) token separating sequences.
    For best results, make sure that [CLS] is prepended to your input sequence (see sample usage linked above)!
    Details for this model specifically:

    * Architecture: gpt2
    * Parameters: 124770816
    * Maximum sequence length: 512 tokens
    * Training text data (raw): [XXXX]
    * Training tokens: 12B
    * Vocabulary size: 50000
    * Compute cost: ~9 NVIDIA A6000 GPU hours
    * CO2 Emission: 1.17 kg

    Training datasets (percentages prior to deduplication):
    * 100.00000%: [OSCAR 2021/09](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109)

    Checkpoints are taken at training steps: 0, 10000, 20000, 30000, 40000, 50000, 64000, 64010, 64020, 64030, 64040, 64050, 64060, 64070, 64080, 64090, 64100, 64110, 64120, 64130, 64140, 64150, 64160, 64170, 64180, 64190, 64200, 64300, 64400, 64500, 64600, 64700, 64800, 64900, 65000, 66000, 67000, 68000, 69000, 70000, 80000, 90000, 100000, 110000, 120000, 128000.

    ## Use This Model

    Load the model:

    ```
    from transformers import AutoTokenizer, AutoModel

    tokenizer = AutoTokenizer.from_pretrained("B-GPT_en_pl_sequential")
    model = AutoModel.from_pretrained("B-GPT_en_pl_sequential")


    ````

    Text Generation:

    ```
    from transformers import pipeline

    pipe = pipeline("text-generation", model="B-GPT_en_pl_sequential")
    
    pipe("I am a")

    ```

    ## Citation

    If you use this model, please cite:

    ```


    ```