metadata

license: mit
datasets: thegoodfellas/brwac_tiny
widget:
  - text: Demanda por fundos de <mask> para crianças cresce em 2022
    example_title: Exemplo 1
  - text: Havia uma <mask> no meio do caminho
    example_title: Exemplo 2
  - text: >-
      Na verdade, começar a <mask> cedo é ideal para ter um bom dinheiro no
      futuro
    example_title: Exemplo 3
  - text: Mitos e verdades sobre o <mask>. Doença que mais mata mulheres no Brasil.
    example_title: Exemplo 4
base_model: xlm-roberta-base
model-index:
  - name: tgf-xlm-roberta-base-pt-br
    results: []

tgf-xlm-roberta-base-pt-br

This model is a fine-tuned version of xlm-roberta-base on the BrWac dataset.

Model description

This is a fine-tuned version of the Brazilian Portuguese language. It was trained using the BrWac dataset and followed the principles from Roberta's paper. The key strategies are:

Full-Sentences: Quoted from the paper: "Each input is packed with full sentences sampled contiguously from one or more documents, such that the total length is at most 512 tokens. Inputs may cross document boundaries. When we reach the end of one document, we begin sampling sentences from the next document and add an extra separator token between documents".
Tunned hyperparameters: adam_beta1=0.9, adam_beta2=0.98, adam_epsilon=1e-6 (as paper suggests)

Availability

The source code is available here

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-4
train_batch_size: 16
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 512
optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 2
mixed_precision_training: Native AMP

Framework versions

Transformers 4.23.1
Pytorch 1.11.0a0+b6df043
Datasets 2.6.1
Tokenizers 0.13.1

Environment

4xA100.88V NVIDIA

Special thanks to DataCrunch.io with their amazing, and affordable GPUs.