jbochi commited on
Commit
9522f21
1 Parent(s): 2e0d2e9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +191 -10
README.md CHANGED
@@ -1,6 +1,7 @@
1
  ---
2
  license: apache-2.0
3
  language:
 
4
  - en
5
  - ru
6
  - es
@@ -421,32 +422,212 @@ language:
421
  - msb
422
  library_name: transformers
423
  tags:
 
424
  - text-generation-inference
425
  datasets:
426
  - allenai/MADLAD-400
427
  pipeline_tag: translation
 
 
 
 
 
 
 
428
  ---
429
 
430
- T5ForConditionalGeneration files for Google's [Madlad-400](https://github.com/google-research/google-research/tree/master/madlad_400) 7.2B parameter MT-BT model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
431
 
432
- Article: [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662)
 
433
 
434
- Abstract:
435
 
436
- > We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
437
 
438
  ```python
439
  from transformers import T5ForConditionalGeneration, T5Tokenizer, GenerationConfig
440
 
441
- model = T5ForConditionalGeneration.from_pretrained('jbochi/madlad400-7b-mt-bt')
442
- tokenizer = T5Tokenizer.from_pretrained('jbochi/madlad400-7b-mt-bt')
 
443
 
444
- text = "<2it> I love pizza!"
445
- input_ids = tokenizer(text, return_tensors="pt").input_ids
446
  outputs = model.generate(input_ids=input_ids)
447
 
448
  tokenizer.decode(outputs[0], skip_special_tokens=True)
449
- # Adoro la pizza!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
450
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
451
 
452
- Colab to generate these files is [here](https://colab.research.google.com/drive/1rZ2NRyl2zwmg0sQ2Wi-uZZF48iVYulTC#scrollTo=pVODoE6gA9sw).
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - multilingual
5
  - en
6
  - ru
7
  - es
 
422
  - msb
423
  library_name: transformers
424
  tags:
425
+ - text2text-generation
426
  - text-generation-inference
427
  datasets:
428
  - allenai/MADLAD-400
429
  pipeline_tag: translation
430
+
431
+ widget:
432
+ - text: "<2en> Como vai, amigo?"
433
+ example_title: "Translation to English"
434
+ - text: "<2de> Do you speak German?"
435
+ example_title: "Translation to German"
436
+
437
  ---
438
 
439
+ # Model Card for MADLAD-400-7B-MT
440
+
441
+ # Table of Contents
442
+
443
+ 0. [TL;DR](#TL;DR)
444
+ 1. [Model Details](#model-details)
445
+ 2. [Usage](#usage)
446
+ 3. [Uses](#uses)
447
+ 4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
448
+ 5. [Training Details](#training-details)
449
+ 6. [Evaluation](#evaluation)
450
+ 7. [Environmental Impact](#environmental-impact)
451
+ 8. [Citation](#citation)
452
+
453
+ # TL;DR
454
+
455
+ MADLAD-400-7B-MT-BT is a multilingual machine translation model based on the T5 architecture that was
456
+ trained on 250 billion tokens covering over 450 languages using publicly available data.
457
+ It is competitive with models that are significantly larger.
458
+
459
+ It's a finetuned version of the 7.2B parameter model on backtranslated data. Authors say in the [paper](https://arxiv.org/pdf/2309.04662.pdf) that:
460
+
461
+ > While this setup is very likely sub-optimal, we see that back-translation
462
+ > greatly improves en2xx translation (by 3.0 chrf, in the case of Flores-200) in most cases.
463
 
464
+ **Disclaimer**: [Juarez Bochi](https://huggingface.co/jbochi), who was not involved in this research, converted
465
+ the original weights and wrote the contents of this model card based on the original paper and Flan-T5.
466
 
467
+ # Model Details
468
 
469
+ ## Model Description
470
+
471
+ - **Model type:** Language model
472
+ - **Language(s) (NLP):** Multilingual (400+ languages)
473
+ - **License:** Apache 2.0
474
+ - **Related Models:** [All MADLAD-400 Checkpoints](https://huggingface.co/models?search=madlad)
475
+ - **Original Checkpoints:** [All Original MADLAD-400 Checkpoints](https://github.com/google-research/google-research/tree/master/madlad_400)
476
+ - **Resources for more information:**
477
+ - [Research paper](https://arxiv.org/abs/2309.04662)
478
+ - [GitHub Repo](https://github.com/google-research/t5x)
479
+ - [Hugging Face MADLAD-400 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/MADLAD-400) - [Pending PR](https://github.com/huggingface/transformers/pull/27471)
480
+
481
+ # Usage
482
+
483
+ Find below some example scripts on how to use the model:
484
+
485
+ ## Using the Pytorch model with `transformers`
486
+
487
+ ### Running the model on a CPU or GPU
488
+
489
+ <details>
490
+ <summary> Click to expand </summary>
491
 
492
  ```python
493
  from transformers import T5ForConditionalGeneration, T5Tokenizer, GenerationConfig
494
 
495
+ model_name = 'jbochi/madlad400-7b-mt-bt'
496
+ model = T5ForConditionalGeneration.from_pretrained(model_name, device="auto")
497
+ tokenizer = T5Tokenizer.from_pretrained(model_name)
498
 
499
+ text = "<2pt> I love pizza!"
500
+ input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
501
  outputs = model.generate(input_ids=input_ids)
502
 
503
  tokenizer.decode(outputs[0], skip_special_tokens=True)
504
+ # Eu adoro pizza!
505
+ ```
506
+
507
+ </details>
508
+
509
+ ## Running the model with Candle
510
+
511
+ <details>
512
+ <summary> Click to expand </summary>
513
+
514
+ Usage with [candle](https://github.com/huggingface/candle):
515
+
516
+ ```bash
517
+ $ cargo run --example t5 --release -- \
518
+ --model-id "jbochi/madlad400-7b-mt-bt" \
519
+ --prompt "<2de> How are you, my friend?" \
520
+ --decode --temperature 0
521
+ ```
522
+
523
+ We also provide a quantized model (1.65 GB vs the original 11.8 GB file):
524
+
525
  ```
526
+ cargo run --example quantized-t5 --release -- \
527
+ --model-id "jbochi/madlad400-7b-mt-bt" --weight-file "model-q4k.gguf" \
528
+ --prompt "<2de> How are you, my friend?" \
529
+ --temperature 0
530
+ ...
531
+ Wie geht es dir, mein Freund?
532
+ ```
533
+
534
+ </details>
535
+
536
+
537
+ # Uses
538
+
539
+ ## Direct Use and Downstream Use
540
+
541
+ > Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages.
542
+ > Primary intended users: Research community.
543
+
544
+ ## Out-of-Scope Use
545
+
546
+ > These models are trained on general domain data and are therefore not meant to
547
+ > work on domain-specific models out-of-the box. Moreover, these research models have not been assessed
548
+ > for production usecases.
549
+
550
+ # Bias, Risks, and Limitations
551
+
552
+ > We note that we evaluate on only 204 of the languages supported by these models and on machine translation
553
+ > and few-shot machine translation tasks. Users must consider use of this model carefully for their own
554
+ > usecase.
555
+
556
+ ## Ethical considerations and risks
557
+
558
+ > We trained these models with MADLAD-400 and publicly available data to create baseline models that
559
+ > support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora.
560
+ > Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or
561
+ > otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the
562
+ > underlying training data may cause differences in model performance and toxic (or otherwise problematic)
563
+ > output for certain domains. Moreover, large models are dual use technologies that have specific risks
564
+ > associated with their use and development. We point the reader to surveys such as those written by
565
+ > Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling
566
+ > et al. for a thorough discussion of the risks of machine translation systems.
567
+
568
+ ## Known Limitations
569
+
570
+ More information needed
571
+
572
+ ## Sensitive Use:
573
+
574
+ More information needed
575
+
576
+ # Training Details
577
+
578
+ > We train models of various sizes: a 3B, 32-layer parameter model,
579
+ > a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model.
580
+ > We share all parameters of the model across language pairs,
581
+ > and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder
582
+ > side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target
583
+ > language.
584
+
585
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
586
+
587
+ ## Training Data
588
+
589
+ > For both the machine translation and language model, MADLAD-400 is used. For the machine translation
590
+ > model, a combination of parallel datasources covering 157 languages is also used. Further details are
591
+ > described in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
592
+
593
+ ## Training Procedure
594
+
595
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
596
+
597
+ # Evaluation
598
+
599
+ ## Testing Data, Factors & Metrics
600
+
601
+ > For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
602
+
603
+ > The translation quality of this model varies based on language, as seen in the paper, and likely varies on
604
+ > domain, though we have not assessed this.
605
+
606
+ ## Results
607
+
608
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/EzsMD1AwCuFH0S0DeD-n8.png)
609
+
610
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/CJ5zCUVy7vTU76Lc8NZcK.png)
611
+
612
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/NK0S-yVeWuhKoidpLYh3m.png)
613
+
614
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
615
+
616
+ # Environmental Impact
617
+
618
+ More information needed
619
+
620
+ # Citation
621
+
622
+ **BibTeX:**
623
 
624
+ ```bibtex
625
+ @misc{kudugunta2023madlad400,
626
+ title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
627
+ author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
628
+ year={2023},
629
+ eprint={2309.04662},
630
+ archivePrefix={arXiv},
631
+ primaryClass={cs.CL}
632
+ }
633
+ ```