pablocosta's picture
language: pt
license: mit
  - bert
  - pytorch
  - Twitter

BERTabaporu: a genre-specific pre-trained model of Portuguese-speaking social media


Having the same architecture of [Bert] we trained our model from scratch following BERT pre-training procedure. And has been built from a collection of about 238 million tweets written by over 100 thousand unique Twitter users, and conveying over 2.9 billion words in total.

Available models

Model Arch. #Layers #Params
pablocosta/bertabaporu-base-uncased BERT-Base 12 110M
pablocosta/bertabaporu-large-uncased BERT-Large 24 335M


from transformers import AutoTokenizer  # Or BertTokenizer
from transformers import AutoModelForPreTraining  # Or BertForPreTraining for loading pretraining heads
from transformers import AutoModel  # or BertModel, for BERT without pretraining heads
model = AutoModelForPreTraining.from_pretrained('pablocosta/bertabaporu-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('pablocosta/bertabaporu-base-uncased')

Cite us

@inproceedings{bertabaporu, author={Pablo Botton da Costa and Matheus Camasmie Pavan and Wesley Ramos dos Santos and Samuel Caetano da Silva and Ivandr'e Paraboni}, title={{BERTabaporu: assessing a genre-specific language model for Portuguese NLP}}, booktitle={Recents Advances in Natural Language Processing ({RANLP-2023})}, year={2023}, address={Varna, Bulgaria} }