File size: 2,773 Bytes
021ebfc
982a459
021ebfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86a1f2f
021ebfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
982a459
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
license: mit
language:
- en
pipeline_tag: text2text-generation
---
# MANTa-LM (base)

Pretrained MANTa-LM architecture as introduced in the paper [MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling](https://aclanthology.org/2022.findings-emnlp.207.pdf).
<center><img src="https://github.com/NathanGodey/nathangodey.github.io/raw/main/img/posts/full_difftok_schema.png"  width="600"></center>

## Model Details

### Model Description

The MANTa tokenizer aims at mimicking the combination of a subword tokenizer and an embedding matrix in a classical language model in a differentiable way.
This trainable tokenizer is thus added as the first layer of an encoder-decoder model and trained using the language modeling objective.

Our results show that MANTa-LM only slightly degrades the performance of a T5 equivalent on the GLUE benchmark while being **much more robust** to artificial and user-generated noise. 


### Model Sources

- **Paper:** [MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling](https://aclanthology.org/2022.findings-emnlp.207.pdf) (EMNLP 2022 Findings)

## Uses

### Direct Use

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("almanach/manta-lm-base", trust_remote_code=True)
manta_model = AutoModelForSeq2SeqLM.from_pretrained("almanach/manta-lm-base", trust_remote_code=True)

tokens = tokenizer("The name of the capital of France is <extra_id_0> and it is a very big city.", return_tensors="pt")
output = manta_model.generate(**tokens, decoder_start_token_id=0)

print(tokenizer.batch_decode(output))
```

### Recommendations

We recommend using a smaller learning rate for the tokenizer module during fine-tuning (byte embeddings, frontier predictor, pooler).


## Training Details

### Training Data

This model was trained on the C4 dataset.

### Training Procedure 

The training objective is the same as ByT5, but most hyperparameters are taken from T5.


## Citation

**BibTeX:**

```
@inproceedings{godey-etal-2022-manta,
    title = "{MANT}a: Efficient Gradient-Based Tokenization for End-to-End Robust Language Modeling",
    author = "Godey, Nathan  and
      Castagn{\'e}, Roman  and
      de la Clergerie, {\'E}ric  and
      Sagot, Beno{\^\i}t",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.207",
    pages = "2859--2870",
}

```

## Model Card Authors

[Nathan Godey](https://nathangodey.github.io/)
[Roman Castagné](https://romancast.github.io/)