Edit model card

Paper: For more details, please refer to our paper: BERTabaporu: Assessing a Genre-Specific Language Model for Portuguese NLP

Introduction

BERTabaporu is a Brazilian Portuguese BERT model in the Twitter domain. The model has been built from a collection of 238 million tweets written by over 100 thousand unique Twitter users, and conveying over 2.9 billion tokens in total.

Available models

Model Arch. #Layers #Params
pablocosta/bertabaporu-base-uncased BERT-Base 12 110M
pablocosta/bertabaporu-large-uncased BERT-Large 24 335M

Usage

from transformers import AutoTokenizer  # Or BertTokenizer
from transformers import AutoModelForPreTraining  # Or BertForPreTraining for loading pretraining heads
from transformers import AutoModel  # or BertModel, for BERT without pretraining heads
model = AutoModelForPreTraining.from_pretrained('pablocosta/bertabaporu-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('pablocosta/bertabaporu-base-uncased')

Cite us

@inproceedings{costa-etal-2023-bertabaporu, title = "{BERT}abaporu: Assessing a Genre-Specific Language Model for {P}ortuguese {NLP}", author = "Costa, Pablo Botton and Pavan, Matheus Camasmie and Santos, Wesley Ramos and Silva, Samuel Caetano and Paraboni, Ivandr{'e}", editor = "Mitkov, Ruslan and Angelova, Galia", booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing", month = sep, year = "2023", address = "Varna, Bulgaria", publisher = "INCOMA Ltd., Shoumen, Bulgaria", url = "https://aclanthology.org/2023.ranlp-1.24", pages = "217--223", abstract = "Transformer-based language models such as Bidirectional Encoder Representations from Transformers (BERT) are now mainstream in the NLP field, but extensions to languages other than English, to new domains and/or to more specific text genres are still in demand. In this paper we introduced BERTabaporu, a BERT language model that has been pre-trained on Twitter data in the Brazilian Portuguese language. The model is shown to outperform the best-known general-purpose model for this language in three Twitter-related NLP tasks, making a potentially useful resource for Portuguese NLP in general.", }

Downloads last month
790
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.