File size: 2,156 Bytes
f2be52d
 
ff0ccff
 
 
 
 
 
f2be52d
ff0ccff
 
 
 
 
 
 
 
 
c5d7feb
ff0ccff
 
 
 
 
 
 
c5d7feb
ff0ccff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6bd39cf
 
ff0ccff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: apache-2.0
datasets:
- giga_fren
- opus100
language:
- fr
- en
---


# Model Card for fr_en-t5-small

<!-- Provide a quick summary of what the model is/does. -->

This model has been optimized for French and English language processing while minimizing overall size. To achieve this, I only retained relevant parameters and tokens specific to these two languages, ensuring that performance remains as good as the original mt5.

## Model Details
I used a method outlined in a [blog post](https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90) by David Dale to downsize the multilingual T5 model for French and English use cases specifically. By utilizing the giga_fren dataset, I was able to successfully reduce the total number of tokens and decrease both the model and tokenizer sizes by 67% and 80% respectively.

### Model Description

- **Developed by:** Korventenn
- **Model type:** mt5
- **Language(s) (NLP):** French and English
- **License:** Apache 2.0
- **Generated from model:** mt5-small

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** https://colab.research.google.com/drive/1ag0u1WKdvuBeYTz1TrPAGucumiaYmqeW?usp=sharing

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
You can use the raw model for any sequence to sequence task that is focused on either french, english or both.


## How to Get Started with the Model

Use the code below to get started with the model.
```
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Korventenn/fr_en-t5-small")

model = AutoModelForSeq2SeqLM.from_pretrained("Korventenn/fr_en-t5-small")
```

### Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

[giga_fren](https://huggingface.co/datasets/giga_fren)


[opus100](https://huggingface.co/datasets/opus100)