File size: 1,305 Bytes
ad2bafd
 
 
 
 
 
 
 
 
 
 
e12144d
ad2bafd
 
 
9a6cc18
ad2bafd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
---
language: pt
license: mit
tags:
  - bert
  - pytorch
datasets:
  - Twitter
---


# <a name="introduction"></a> BERTabaporu: a genre-specific pre-trained model of Portuguese-speaking social media

## Introduction

Having the same architecture of [Bert] we trained our model from scratch following [BERT](https://arxiv.org/abs/1810.04805) pre-training procedure. And has been built from a collection of about 238 million tweets written by over 100 thousand unique Twitter users, and conveying over 2.9 billion words in total.

## Available models

| Model                                    | Arch.      | #Layers | #Params |
| ---------------------------------------- | ---------- | ------- | ------- |
| `pablocosta/bertabaporu-base-uncased`    | BERT-Base  | 12      | 110M    |
| `pablocosta/bertabaporu-large-uncased`   | BERT-Large | 24      | 335M    |

## Usage

```python
from transformers import AutoTokenizer  # Or BertTokenizer
from transformers import AutoModelForPreTraining  # Or BertForPreTraining for loading pretraining heads
from transformers import AutoModel  # or BertModel, for BERT without pretraining heads
model = AutoModelForPreTraining.from_pretrained('pablocosta/bertabaporu-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('pablocosta/bertabaporu-base-uncased')
```