jbochi commited on
Commit
eb925c3
1 Parent(s): a21c677

Update model card, using Flan-T5's as example

Browse files
Files changed (1) hide show
  1. README.md +158 -18
README.md CHANGED
@@ -1,6 +1,7 @@
1
  ---
2
  license: apache-2.0
3
  language:
 
4
  - en
5
  - ru
6
  - es
@@ -421,44 +422,90 @@ language:
421
  - msb
422
  library_name: transformers
423
  tags:
 
424
  - text-generation-inference
425
  datasets:
426
  - allenai/MADLAD-400
427
  pipeline_tag: translation
 
 
 
 
 
 
 
428
  ---
429
 
430
- T5ForConditionalGeneration files for Google's [Madlad-400](https://github.com/google-research/google-research/tree/master/madlad_400) 3B parameter MT model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
431
 
432
- Article: [MADLAD-400: A Multilingual And Document-Level Large Audited Dataset](https://arxiv.org/abs/2309.04662)
 
433
 
434
- Available models:
435
- - [3B](https://huggingface.co/jbochi/madlad400-3b-mt)
436
- - [7B](https://huggingface.co/jbochi/madlad400-7b-mt)
437
- - [7B-BT](https://huggingface.co/jbochi/madlad400-7b-mt-bt)
438
- - [10B](https://huggingface.co/jbochi/madlad400-10b-mt)
439
 
440
- Abstract:
441
 
442
- > We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset based on CommonCrawl, spanning 419 languages. We discuss the limitations revealed by self-auditing MADLAD-400, and the role data auditing had in the dataset creation process. We then train and release a 10.7B-parameter multilingual machine translation model on 250 billion tokens covering over 450 languages using publicly available data, and find that it is competitive with models that are significantly larger, and report the results on different domains. In addition, we train a 8B-parameter language model, and assess the results on few-shot translation. We make the baseline models available to the research community.
 
 
 
 
 
 
 
 
443
 
444
- ## Usage
445
 
446
- Usage with Huggingface's transformers:
 
 
 
 
 
 
 
447
 
448
  ```python
449
  from transformers import T5ForConditionalGeneration, T5Tokenizer, GenerationConfig
450
 
451
- model = T5ForConditionalGeneration.from_pretrained('jbochi/madlad400-3b-mt')
452
- tokenizer = T5Tokenizer.from_pretrained('jbochi/madlad400-3b-mt')
 
453
 
454
  text = "<2pt> I love pizza!"
455
- input_ids = tokenizer(text, return_tensors="pt").input_ids
456
  outputs = model.generate(input_ids=input_ids)
457
 
458
  tokenizer.decode(outputs[0], skip_special_tokens=True)
459
  # Eu adoro pizza!
460
  ```
461
 
 
 
 
 
 
 
 
462
  Usage with [candle](https://github.com/huggingface/candle):
463
 
464
  ```bash
@@ -479,11 +526,104 @@ cargo run --example quantized-t5 --release -- \
479
  Wie geht es dir, mein Freund?
480
  ```
481
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
482
 
483
- ## Model conversion
484
 
485
- I'm not affiliated with Google and was not involved in this research.
486
 
487
- The colab I used to generate these files is [here](https://colab.research.google.com/drive/1rZ2NRyl2zwmg0sQ2Wi-uZZF48iVYulTC#scrollTo=pVODoE6gA9sw).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
488
 
489
- Quantization was done with candle following this [instruction](https://github.com/huggingface/candle/tree/main/candle-examples/examples/quantized-t5#generating-quantized-weight-files).
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - multilingual
5
  - en
6
  - ru
7
  - es
 
422
  - msb
423
  library_name: transformers
424
  tags:
425
+ - text2text-generation
426
  - text-generation-inference
427
  datasets:
428
  - allenai/MADLAD-400
429
  pipeline_tag: translation
430
+
431
+ widget:
432
+ - text: "<2en> Como vai, amigo?"
433
+ example_title: "Translation to English"
434
+ - text: "<2de> Do you speak German?"
435
+ example_title: "Translation to German"
436
+
437
  ---
438
 
439
+ # Model Card for MADLAD-400-3B-MT
440
+
441
+ # Table of Contents
442
+
443
+ 0. [TL;DR](#TL;DR)
444
+ 1. [Model Details](#model-details)
445
+ 2. [Usage](#usage)
446
+ 3. [Uses](#uses)
447
+ 4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
448
+ 5. [Training Details](#training-details)
449
+ 6. [Evaluation](#evaluation)
450
+ 7. [Environmental Impact](#environmental-impact)
451
+ 8. [Citation](#citation)
452
+
453
+ # TL;DR
454
+
455
+ MADLAD-400-3B-MT is a multilingual machine translation model based on the T5 architecture that was
456
+ trained on 1 trillion tokens covering over 450 languages using publicly available data.
457
+ It is competitive with models that are significantly larger.
458
 
459
+ **Disclaimer**: [Juarez Bochi](https://huggingface.co/jbochi), who was not involved in this research, converted
460
+ the original weights and wrote the contents of this model card based on the original paper and Flan-T5.
461
 
462
+ # Model Details
 
 
 
 
463
 
464
+ ## Model Description
465
 
466
+ - **Model type:** Language model
467
+ - **Language(s) (NLP):** Multilingual (400+ languages)
468
+ - **License:** Apache 2.0
469
+ - **Related Models:** [All MADLAD-400 Checkpoints](https://huggingface.co/models?search=madlad)
470
+ - **Original Checkpoints:** [All Original MADLAD-400 Checkpoints](https://github.com/google-research/google-research/tree/master/madlad_400)
471
+ - **Resources for more information:**
472
+ - [Research paper](https://arxiv.org/abs/2309.04662)
473
+ - [GitHub Repo](https://github.com/google-research/t5x)
474
+ - [Hugging Face MADLAD-400 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/MADLAD-400) - [Pending PR](https://github.com/huggingface/transformers/pull/27471)
475
 
476
+ # Usage
477
 
478
+ Find below some example scripts on how to use the model:
479
+
480
+ ## Using the Pytorch model with `transformers`
481
+
482
+ ### Running the model on a CPU or GPU
483
+
484
+ <details>
485
+ <summary> Click to expand </summary>
486
 
487
  ```python
488
  from transformers import T5ForConditionalGeneration, T5Tokenizer, GenerationConfig
489
 
490
+ model_name = 'jbochi/madlad400-3b-mt'
491
+ model = T5ForConditionalGeneration.from_pretrained(model_name, device="auto")
492
+ tokenizer = T5Tokenizer.from_pretrained(model_name)
493
 
494
  text = "<2pt> I love pizza!"
495
+ input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
496
  outputs = model.generate(input_ids=input_ids)
497
 
498
  tokenizer.decode(outputs[0], skip_special_tokens=True)
499
  # Eu adoro pizza!
500
  ```
501
 
502
+ </details>
503
+
504
+ ## Running the model with Candle
505
+
506
+ <details>
507
+ <summary> Click to expand </summary>
508
+
509
  Usage with [candle](https://github.com/huggingface/candle):
510
 
511
  ```bash
 
526
  Wie geht es dir, mein Freund?
527
  ```
528
 
529
+ </details>
530
+
531
+
532
+ # Uses
533
+
534
+ ## Direct Use and Downstream Use
535
+
536
+ > Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages.
537
+ > Primary intended users: Research community.
538
+
539
+ ## Out-of-Scope Use
540
+
541
+ > These models are trained on general domain data and are therefore not meant to
542
+ > work on domain-specific models out-of-the box. Moreover, these research models have not been assessed
543
+ > for production usecases.
544
+
545
+ # Bias, Risks, and Limitations
546
+
547
+ > We note that we evaluate on only 204 of the languages supported by these models and on machine translation
548
+ > and few-shot machine translation tasks. Users must consider use of this model carefully for their own
549
+ > usecase.
550
+
551
+ ## Ethical considerations and risks
552
+
553
+ > We trained these models with MADLAD-400 and publicly available data to create baseline models that
554
+ > support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora.
555
+ > Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or
556
+ > otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the
557
+ > underlying training data may cause differences in model performance and toxic (or otherwise problematic)
558
+ > output for certain domains. Moreover, large models are dual use technologies that have specific risks
559
+ > associated with their use and development. We point the reader to surveys such as those written by
560
+ > Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling
561
+ > et al. for a thorough discussion of the risks of machine translation systems.
562
+
563
+ ## Known Limitations
564
 
565
+ More information needed
566
 
567
+ ## Sensitive Use:
568
 
569
+ More information needed
570
+
571
+ # Training Details
572
+
573
+ > We train models of various sizes: a 3B, 32-layer parameter model,
574
+ > a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model.
575
+ > We share all parameters of the model across language pairs,
576
+ > and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder
577
+ > side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target
578
+ > language.
579
+
580
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
581
+
582
+ ## Training Data
583
+
584
+ > For both the machine translation and language model, MADLAD-400 is used. For the machine translation
585
+ > model, a combination of parallel datasources covering 157 languages is also used. Further details are
586
+ > described in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
587
+
588
+ ## Training Procedure
589
+
590
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
591
+
592
+ # Evaluation
593
+
594
+ ## Testing Data, Factors & Metrics
595
+
596
+ > For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the [paper](https://arxiv.org/pdf/2309.04662.pdf).
597
+
598
+ > The translation quality of this model varies based on language, as seen in the paper, and likely varies on
599
+ > domain, though we have not assessed this.
600
+
601
+ ## Results
602
+
603
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/EzsMD1AwCuFH0S0DeD-n8.png)
604
+
605
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/CJ5zCUVy7vTU76Lc8NZcK.png)
606
+
607
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/NK0S-yVeWuhKoidpLYh3m.png)
608
+
609
+ See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.
610
+
611
+ # Environmental Impact
612
+
613
+ More information needed
614
+
615
+ # Citation
616
+
617
+ **BibTeX:**
618
+
619
+ ```bibtex
620
+ @misc{kudugunta2023madlad400,
621
+ title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
622
+ author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
623
+ year={2023},
624
+ eprint={2309.04662},
625
+ archivePrefix={arXiv},
626
+ primaryClass={cs.CL}
627
+ }
628
+ ```
629