jbochi
/

madlad400-7b-mt-bt

@@ -1,6 +1,7 @@
 ---
 license: apache-2.0
 language:
 - en
 - ru
 - es
@@ -421,32 +422,212 @@ language:
 - msb
 library_name: transformers
 tags:
 - text-generation-inference
 datasets:
 - allenai/MADLAD-400
 pipeline_tag: translation
 ---
-T5ForConditionalGeneration files for Google's [Madlad-400](https://github.com/google-research/google-research/tree/master/madlad_400) 7.2B parameter MT-BT model.
-Article: [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662)
-Abstract:
-> We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.
 ```python
 from transformers import T5ForConditionalGeneration, T5Tokenizer, GenerationConfig
-model = T5ForConditionalGeneration.from_pretrained('jbochi/madlad400-7b-mt-bt')
-tokenizer = T5Tokenizer.from_pretrained('jbochi/madlad400-7b-mt-bt')
-text = "<2it> I love pizza!"
-input_ids = tokenizer(text, return_tensors="pt").input_ids
 outputs = model.generate(input_ids=input_ids)
 tokenizer.decode(outputs[0], skip_special_tokens=True)
-# Adoro la pizza!
 ```
-Colab to generate these files is [here](https://colab.research.google.com/drive/1rZ2NRyl2zwmg0sQ2Wi-uZZF48iVYulTC#scrollTo=pVODoE6gA9sw).

 ---
 license: apache-2.0
 language:
+- multilingual
 - en
 - ru
 - es
 - msb
 library_name: transformers
 tags:
+- text2text-generation
 - text-generation-inference
 datasets:
 - allenai/MADLAD-400
 pipeline_tag: translation
+widget:
+- text: "<2en> Como vai, amigo?"
+  example_title: "Translation to English"
+- text: "<2de> Do you speak German?"
+  example_title: "Translation to German"
 ---
+# Model Card for MADLAD-400-7B-MT
+#  Table of Contents
+0. [TL;DR](#TL;DR)
+1. [Model Details](#model-details)
+2. [Usage](#usage)
+3. [Uses](#uses)
+4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
+5. [Training Details](#training-details)
+6. [Evaluation](#evaluation)
+7. [Environmental Impact](#environmental-impact)
+8. [Citation](#citation)
+# TL;DR
+MADLAD-400-7B-MT-BT is a multilingual machine translation model based on the T5 architecture that was
+trained on 250 billion tokens covering over 450 languages using publicly available data.
+It is competitive with models that are significantly larger.
+It's a finetuned version of the 7.2B parameter model on backtranslated data. Authors say in the [paper](https://arxiv.org/pdf/2309.04662.pdf) that:
+> While this setup is very likely sub-optimal, we see that back-translation
+> greatly improves en2xx translation (by 3.0 chrf, in the case of Flores-200) in most cases.
+**Disclaimer**: [Juarez Bochi](https://huggingface.co/jbochi), who was not involved in this research, converted
+the original weights and wrote the contents of this model card based on the original paper and Flan-T5.
+# Model Details
+## Model Description
+- **Model type:** Language model
+- **Language(s) (NLP):** Multilingual (400+ languages)
+- **License:** Apache 2.0
+- **Related Models:** [All MADLAD-400 Checkpoints](https://huggingface.co/models?search=madlad)
+- **Original Checkpoints:** [All Original MADLAD-400 Checkpoints](https://github.com/google-research/google-research/tree/master/madlad_400)
+- **Resources for more information:**
+  - [Research paper](https://arxiv.org/abs/2309.04662)
+  - [GitHub Repo](https://github.com/google-research/t5x)
+  - [Hugging Face MADLAD-400 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/MADLAD-400) - [Pending PR](https://github.com/huggingface/transformers/pull/27471)
+# Usage
+Find below some example scripts on how to use the model:
+## Using the Pytorch model with `transformers`
+### Running the model on a CPU or GPU
+<details>
+<summary> Click to expand </summary>
 ```python
 from transformers import T5ForConditionalGeneration, T5Tokenizer, GenerationConfig
+model_name = 'jbochi/madlad400-7b-mt-bt'
+model = T5ForConditionalGeneration.from_pretrained(model_name, device="auto")
+tokenizer = T5Tokenizer.from_pretrained(model_name)
+text = "<2pt> I love pizza!"
+input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
 outputs = model.generate(input_ids=input_ids)
 tokenizer.decode(outputs[0], skip_special_tokens=True)
+# Eu adoro pizza!
+```
+</details>
+## Running the model with Candle
+<details>
+<summary> Click to expand </summary>
+Usage with [candle](https://github.com/huggingface/candle):
+```bash
+$ cargo run --example t5 --release  -- \
+  --model-id "jbochi/madlad400-7b-mt-bt" \
+  --prompt "<2de> How are you, my friend?" \
+  --decode --temperature 0
+```
+We also provide a quantized model (1.65 GB vs the original 11.8 GB file):
 ```
+cargo run --example quantized-t5 --release  -- \
+  --model-id "jbochi/madlad400-7b-mt-bt" --weight-file "model-q4k.gguf" \
+  --prompt "<2de> How are you, my friend?" \
+  --temperature 0
+...
+ Wie geht es dir, mein Freund?
+```
+</details>
+# Uses
+## Direct Use and Downstream Use
+> Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages.
+> Primary intended users: Research community.
+## Out-of-Scope Use
+> These models are trained on general domain data and are therefore not meant to
+> work on domain-specific models out-of-the box. Moreover, these research models have not been assessed
+> for production usecases.
+# Bias, Risks, and Limitations
+> We note that we evaluate on only 204 of the languages supported by these models and on machine translation
+> and few-shot machine translation tasks. Users must consider use of this model carefully for their own
+> usecase.
+## Ethical considerations and risks
+> We trained these models with MADLAD-400 and publicly available data to create baseline models that
+> support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora.
+> Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or
+> otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the
+> underlying training data may cause differences in model performance and toxic (or otherwise problematic)
+> output for certain domains. Moreover, large models are dual use technologies that have specific risks
+> associated with their use and development. We point the reader to surveys such as those written by
+> Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling
+> et al. for a thorough discussion of the risks of machine translation systems.
+## Known Limitations
+More information needed
+## Sensitive Use:
+More information needed
+# Training Details
+> We train models of various sizes: a 3B, 32-layer parameter model,
+> a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model.
+> We share all parameters of the model across language pairs,
+> and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder
+> side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target
+> language.
+See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
+## Training Data
+> For both the machine translation and language model, MADLAD-400 is used. For the machine translation
+> model, a combination of parallel datasources covering 157 languages is also used. Further details are
+> described in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
+## Training Procedure
+See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
+# Evaluation
+## Testing Data, Factors & Metrics
+> For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
+> The translation quality of this model varies based on language, as seen in the paper, and likely varies on
+> domain, though we have not assessed this.
+## Results
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/EzsMD1AwCuFH0S0DeD-n8.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/CJ5zCUVy7vTU76Lc8NZcK.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/NK0S-yVeWuhKoidpLYh3m.png)
+See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
+# Environmental Impact
+More information needed
+# Citation
+**BibTeX:**
+```bibtex
+@misc{kudugunta2023madlad400,
+      title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
+      author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
+      year={2023},
+      eprint={2309.04662},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```