|
--- |
|
language: pt |
|
license: mit |
|
tags: |
|
- bert |
|
- pytorch |
|
datasets: |
|
- Twitter |
|
--- |
|
|
|
|
|
# <a name="introduction"></a> BERTabaporu: a genre-specific pre-trained model of Portuguese-speaking social media |
|
|
|
## Introduction |
|
|
|
Having the same architecture of [Bert] we trained our model from scratch following [BERT](https://arxiv.org/abs/1810.04805) pre-training procedure. And has been built from a collection of about 238 million tweets written by over 100 thousand unique Twitter users, and conveying over 2.9 billion words in total. |
|
|
|
## Available models |
|
|
|
| Model | Arch. | #Layers | #Params | |
|
| ---------------------------------------- | ---------- | ------- | ------- | |
|
| `pablocosta/bertabaporu-base-uncased` | BERT-Base | 12 | 110M | |
|
| `pablocosta/bertabaporu-large-uncased` | BERT-Large | 24 | 335M | |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer # Or BertTokenizer |
|
from transformers import AutoModelForPreTraining # Or BertForPreTraining for loading pretraining heads |
|
from transformers import AutoModel # or BertModel, for BERT without pretraining heads |
|
model = AutoModelForPreTraining.from_pretrained('pablocosta/bertabaporu-base-uncased') |
|
tokenizer = AutoTokenizer.from_pretrained('pablocosta/bertabaporu-base-uncased') |
|
``` |
|
|