Finnish Text Difficulty Assessor

Fine-tuned TurkuNLP/bert-base-finnish-cased-v1 for ordinal classification of Finnish text difficulty on an 11-point CEFR-aligned scale.

Model details


Base model	`TurkuNLP/bert-base-finnish-cased-v1`
Task	Single-label ordinal classification
Labels	10 ordinal difficulty levels
Loss	KL-divergence with Gaussian soft labels (SORD)
Augmentation	Back-translation (Estonian ↔ Finnish) + paraphrasing
Max length	512 tokens

Usage

from transformers import BertForSequenceClassification, BertTokenizer
import torch

tokenizer = BertTokenizer.from_pretrained("chiunhau/finnish-difficulty-assessor")
model     = BertForSequenceClassification.from_pretrained("chiunhau/finnish-difficulty-assessor")
model.eval()

text   = "Hän käy koulussa joka päivä."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits
pred_idx   = logits.argmax().item()
pred_label = model.config.id2label[pred_idx]
print(f"Difficulty level: {pred_label}")   # numeric CEFR value

Label mapping

Labels are numeric representations of CEFR levels (A1 → C2).

Index	Difficulty value
0	1.0
1	1.5
2	2.0
3	2.5
4	3.0
5	3.5
6	4.0
7	5.0
8	5.5
9	6.0

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32