Finnish Text Difficulty Assessor
Fine-tuned TurkuNLP/bert-base-finnish-cased-v1
for ordinal classification of Finnish text difficulty on an 11-point CEFR-aligned scale.
Model details
| Base model | TurkuNLP/bert-base-finnish-cased-v1 |
| Task | Single-label ordinal classification |
| Labels | 10 ordinal difficulty levels |
| Loss | KL-divergence with Gaussian soft labels (SORD) |
| Augmentation | Back-translation (Estonian ↔ Finnish) + paraphrasing |
| Max length | 512 tokens |
Usage
from transformers import BertForSequenceClassification, BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained("chiunhau/finnish-difficulty-assessor")
model = BertForSequenceClassification.from_pretrained("chiunhau/finnish-difficulty-assessor")
model.eval()
text = "Hän käy koulussa joka päivä."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
pred_idx = logits.argmax().item()
pred_label = model.config.id2label[pred_idx]
print(f"Difficulty level: {pred_label}") # numeric CEFR value
Label mapping
Labels are numeric representations of CEFR levels (A1 → C2).
| Index | Difficulty value |
|---|---|
| 0 | 1.0 |
| 1 | 1.5 |
| 2 | 2.0 |
| 3 | 2.5 |
| 4 | 3.0 |
| 5 | 3.5 |
| 6 | 4.0 |
| 7 | 5.0 |
| 8 | 5.5 |
| 9 | 6.0 |
- Downloads last month
- 1