Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,7 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
language:
|
|
|
4 |
- en
|
5 |
- ru
|
6 |
- es
|
@@ -421,32 +422,212 @@ language:
|
|
421 |
- msb
|
422 |
library_name: transformers
|
423 |
tags:
|
|
|
424 |
- text-generation-inference
|
425 |
datasets:
|
426 |
- allenai/MADLAD-400
|
427 |
pipeline_tag: translation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
428 |
---
|
429 |
|
430 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
431 |
|
432 |
-
|
|
|
433 |
|
434 |
-
|
435 |
|
436 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
437 |
|
438 |
```python
|
439 |
from transformers import T5ForConditionalGeneration, T5Tokenizer, GenerationConfig
|
440 |
|
441 |
-
|
442 |
-
|
|
|
443 |
|
444 |
-
text = "<
|
445 |
-
input_ids = tokenizer(text, return_tensors="pt").input_ids
|
446 |
outputs = model.generate(input_ids=input_ids)
|
447 |
|
448 |
tokenizer.decode(outputs[0], skip_special_tokens=True)
|
449 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
450 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
451 |
|
452 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
language:
|
4 |
+
- multilingual
|
5 |
- en
|
6 |
- ru
|
7 |
- es
|
|
|
422 |
- msb
|
423 |
library_name: transformers
|
424 |
tags:
|
425 |
+
- text2text-generation
|
426 |
- text-generation-inference
|
427 |
datasets:
|
428 |
- allenai/MADLAD-400
|
429 |
pipeline_tag: translation
|
430 |
+
|
431 |
+
widget:
|
432 |
+
- text: "<2en> Como vai, amigo?"
|
433 |
+
example_title: "Translation to English"
|
434 |
+
- text: "<2de> Do you speak German?"
|
435 |
+
example_title: "Translation to German"
|
436 |
+
|
437 |
---
|
438 |
|
439 |
+
# Model Card for MADLAD-400-7B-MT
|
440 |
+
|
441 |
+
# Table of Contents
|
442 |
+
|
443 |
+
0. [TL;DR](#TL;DR)
|
444 |
+
1. [Model Details](#model-details)
|
445 |
+
2. [Usage](#usage)
|
446 |
+
3. [Uses](#uses)
|
447 |
+
4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
448 |
+
5. [Training Details](#training-details)
|
449 |
+
6. [Evaluation](#evaluation)
|
450 |
+
7. [Environmental Impact](#environmental-impact)
|
451 |
+
8. [Citation](#citation)
|
452 |
+
|
453 |
+
# TL;DR
|
454 |
+
|
455 |
+
MADLAD-400-7B-MT-BT is a multilingual machine translation model based on the T5 architecture that was
|
456 |
+
trained on 250 billion tokens covering over 450 languages using publicly available data.
|
457 |
+
It is competitive with models that are significantly larger.
|
458 |
+
|
459 |
+
It's a finetuned version of the 7.2B parameter model on backtranslated data. Authors say in the [paper](https://arxiv.org/pdf/2309.04662.pdf) that:
|
460 |
+
|
461 |
+
> While this setup is very likely sub-optimal, we see that back-translation
|
462 |
+
> greatly improves en2xx translation (by 3.0 chrf, in the case of Flores-200) in most cases.
|
463 |
|
464 |
+
**Disclaimer**: [Juarez Bochi](https://huggingface.co/jbochi), who was not involved in this research, converted
|
465 |
+
the original weights and wrote the contents of this model card based on the original paper and Flan-T5.
|
466 |
|
467 |
+
# Model Details
|
468 |
|
469 |
+
## Model Description
|
470 |
+
|
471 |
+
- **Model type:** Language model
|
472 |
+
- **Language(s) (NLP):** Multilingual (400+ languages)
|
473 |
+
- **License:** Apache 2.0
|
474 |
+
- **Related Models:** [All MADLAD-400 Checkpoints](https://huggingface.co/models?search=madlad)
|
475 |
+
- **Original Checkpoints:** [All Original MADLAD-400 Checkpoints](https://github.com/google-research/google-research/tree/master/madlad_400)
|
476 |
+
- **Resources for more information:**
|
477 |
+
- [Research paper](https://arxiv.org/abs/2309.04662)
|
478 |
+
- [GitHub Repo](https://github.com/google-research/t5x)
|
479 |
+
- [Hugging Face MADLAD-400 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/MADLAD-400) - [Pending PR](https://github.com/huggingface/transformers/pull/27471)
|
480 |
+
|
481 |
+
# Usage
|
482 |
+
|
483 |
+
Find below some example scripts on how to use the model:
|
484 |
+
|
485 |
+
## Using the Pytorch model with `transformers`
|
486 |
+
|
487 |
+
### Running the model on a CPU or GPU
|
488 |
+
|
489 |
+
<details>
|
490 |
+
<summary> Click to expand </summary>
|
491 |
|
492 |
```python
|
493 |
from transformers import T5ForConditionalGeneration, T5Tokenizer, GenerationConfig
|
494 |
|
495 |
+
model_name = 'jbochi/madlad400-7b-mt-bt'
|
496 |
+
model = T5ForConditionalGeneration.from_pretrained(model_name, device="auto")
|
497 |
+
tokenizer = T5Tokenizer.from_pretrained(model_name)
|
498 |
|
499 |
+
text = "<2pt> I love pizza!"
|
500 |
+
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
|
501 |
outputs = model.generate(input_ids=input_ids)
|
502 |
|
503 |
tokenizer.decode(outputs[0], skip_special_tokens=True)
|
504 |
+
# Eu adoro pizza!
|
505 |
+
```
|
506 |
+
|
507 |
+
</details>
|
508 |
+
|
509 |
+
## Running the model with Candle
|
510 |
+
|
511 |
+
<details>
|
512 |
+
<summary> Click to expand </summary>
|
513 |
+
|
514 |
+
Usage with [candle](https://github.com/huggingface/candle):
|
515 |
+
|
516 |
+
```bash
|
517 |
+
$ cargo run --example t5 --release -- \
|
518 |
+
--model-id "jbochi/madlad400-7b-mt-bt" \
|
519 |
+
--prompt "<2de> How are you, my friend?" \
|
520 |
+
--decode --temperature 0
|
521 |
+
```
|
522 |
+
|
523 |
+
We also provide a quantized model (1.65 GB vs the original 11.8 GB file):
|
524 |
+
|
525 |
```
|
526 |
+
cargo run --example quantized-t5 --release -- \
|
527 |
+
--model-id "jbochi/madlad400-7b-mt-bt" --weight-file "model-q4k.gguf" \
|
528 |
+
--prompt "<2de> How are you, my friend?" \
|
529 |
+
--temperature 0
|
530 |
+
...
|
531 |
+
Wie geht es dir, mein Freund?
|
532 |
+
```
|
533 |
+
|
534 |
+
</details>
|
535 |
+
|
536 |
+
|
537 |
+
# Uses
|
538 |
+
|
539 |
+
## Direct Use and Downstream Use
|
540 |
+
|
541 |
+
> Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages.
|
542 |
+
> Primary intended users: Research community.
|
543 |
+
|
544 |
+
## Out-of-Scope Use
|
545 |
+
|
546 |
+
> These models are trained on general domain data and are therefore not meant to
|
547 |
+
> work on domain-specific models out-of-the box. Moreover, these research models have not been assessed
|
548 |
+
> for production usecases.
|
549 |
+
|
550 |
+
# Bias, Risks, and Limitations
|
551 |
+
|
552 |
+
> We note that we evaluate on only 204 of the languages supported by these models and on machine translation
|
553 |
+
> and few-shot machine translation tasks. Users must consider use of this model carefully for their own
|
554 |
+
> usecase.
|
555 |
+
|
556 |
+
## Ethical considerations and risks
|
557 |
+
|
558 |
+
> We trained these models with MADLAD-400 and publicly available data to create baseline models that
|
559 |
+
> support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora.
|
560 |
+
> Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or
|
561 |
+
> otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the
|
562 |
+
> underlying training data may cause differences in model performance and toxic (or otherwise problematic)
|
563 |
+
> output for certain domains. Moreover, large models are dual use technologies that have specific risks
|
564 |
+
> associated with their use and development. We point the reader to surveys such as those written by
|
565 |
+
> Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling
|
566 |
+
> et al. for a thorough discussion of the risks of machine translation systems.
|
567 |
+
|
568 |
+
## Known Limitations
|
569 |
+
|
570 |
+
More information needed
|
571 |
+
|
572 |
+
## Sensitive Use:
|
573 |
+
|
574 |
+
More information needed
|
575 |
+
|
576 |
+
# Training Details
|
577 |
+
|
578 |
+
> We train models of various sizes: a 3B, 32-layer parameter model,
|
579 |
+
> a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model.
|
580 |
+
> We share all parameters of the model across language pairs,
|
581 |
+
> and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder
|
582 |
+
> side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target
|
583 |
+
> language.
|
584 |
+
|
585 |
+
See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
|
586 |
+
|
587 |
+
## Training Data
|
588 |
+
|
589 |
+
> For both the machine translation and language model, MADLAD-400 is used. For the machine translation
|
590 |
+
> model, a combination of parallel datasources covering 157 languages is also used. Further details are
|
591 |
+
> described in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
|
592 |
+
|
593 |
+
## Training Procedure
|
594 |
+
|
595 |
+
See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
|
596 |
+
|
597 |
+
# Evaluation
|
598 |
+
|
599 |
+
## Testing Data, Factors & Metrics
|
600 |
+
|
601 |
+
> For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
|
602 |
+
|
603 |
+
> The translation quality of this model varies based on language, as seen in the paper, and likely varies on
|
604 |
+
> domain, though we have not assessed this.
|
605 |
+
|
606 |
+
## Results
|
607 |
+
|
608 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/EzsMD1AwCuFH0S0DeD-n8.png)
|
609 |
+
|
610 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/CJ5zCUVy7vTU76Lc8NZcK.png)
|
611 |
+
|
612 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/NK0S-yVeWuhKoidpLYh3m.png)
|
613 |
+
|
614 |
+
See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
|
615 |
+
|
616 |
+
# Environmental Impact
|
617 |
+
|
618 |
+
More information needed
|
619 |
+
|
620 |
+
# Citation
|
621 |
+
|
622 |
+
**BibTeX:**
|
623 |
|
624 |
+
```bibtex
|
625 |
+
@misc{kudugunta2023madlad400,
|
626 |
+
title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
|
627 |
+
author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
|
628 |
+
year={2023},
|
629 |
+
eprint={2309.04662},
|
630 |
+
archivePrefix={arXiv},
|
631 |
+
primaryClass={cs.CL}
|
632 |
+
}
|
633 |
+
```
|