File size: 1,653 Bytes
1342199
 
5bdf98f
 
1342199
c762441
cbf952c
c762441
cbf952c
3230947
cbf952c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f488ccc
cbf952c
 
839da09
cbf952c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
license: apache-2.0
language: 
- multilingual
---

# Glot500 (base-sized model) 

Glot500 model (Glot500-m) pre-trained on 500+ languages using a masked language modeling (MLM) objective. It was introduced in
[this paper](https://arxiv.org/pdf/2305.12182.pdf) (ACL 2023) and first released in [this repository](https://github.com/cisnlp/Glot500).


## Usage

You can use this model directly with a pipeline for masked language modeling:

```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cis-lmu/glot500-base')
>>> unmasker("Hello I'm a <mask> model.")
```


Here is how to use this model to get the features of a given text in PyTorch:

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('cis-lmu/glot500-base')
model = AutoModelForMaskedLM.from_pretrained("cis-lmu/glot500-base")

# prepare input
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input)
```

### BibTeX entry and citation info

```bibtex
@inproceedings{imani-etal-2023-glot500,
    title = "Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages",
    author = " Imani, Ayyoob and Lin, Peiqin and Kargaran, Amir Hossein and Severini, Silvia and Sabet, Masoud Jalili and Kassner, Nora and Ma, Chunlan and Schmid, Helmut and Martins, André and Yvon, François and  Sch{\"u}tze, Hinrich",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.",
    year = "2023",
    url = "https://arxiv.org/abs/2305.12182", 
}
```