---
language: ro
---


# Romanian DistilBERT

This repository contains the DistilBERT version for Romanian. Teacher model used for distillation: [dumitrescustefan/bert-base-romanian-cased-v1](https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1).

## Usage

```python
from transformers import AutoTokenizer, AutoModel

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("racai/distilbert-base-romanian-cased")
model = AutoModel.from_pretrained("racai/distilbert-base-romanian-cased")

# tokenize a test sentence
input_ids = tokenizer.encode("Aceasta este o propoziție de test.", add_special_tokens=True, return_tensors="pt")

# run the tokens trough the model
outputs = model(input_ids)

print(outputs)
```

## Model Size

Romanian DistilBERT is 35% smaller than the original Romanian BERT.

| Model                          | Size (MB) | Params (Millions) |
|--------------------------------|:---------:|:----------------:| 
| bert-base-romanian-cased-v1    | 477.2 | 124.4 |
| distilbert-base-romanian-cased | 312.7 | 81.3 |

## Evaluation

We evaluated the Romanian DistilBERT in comparison with the original Romanian BERT on 5 tasks:

- **UPOS**: Universal Part of Speech (F1-macro)
- **XPOS**: Extended Part of Speech (F1-macro)
- **NER**: Named Entity Recognition (F1-macro)
- **SAPN**: Sentiment Anlaysis - Positive vs Negative (Accuracy)
- **SAR**: Sentiment Analysis - Rating (F1-macro)
- **DI**: Dialect identification  (F1-macro)
- **STS**: Semantic Textual Similarity (Pearson)

| Model                          | UPOS | XPOS | NER | SAPN | SAR | DI | STS |
|--------------------------------|:----:|:----:|:---:|:----:|:---:|:--:|:---:|
| bert-base-romanian-cased-v1    | 98.00 | 96.46 | 85.88 | 98.07 | 79.61 | 95.58 | 79.11 |
| distilbert-base-romanian-cased | 97.97 | 97.08 | 83.35 | 98.40 | 83.01 | 96.31 | 80.57 |