DeBERTa-v3 CEFR Vocabulary Classifier

A fine-tuned DeBERTa-v3-base model for predicting the CEFR proficiency level of English vocabulary items.

The model classifies words into six Common European Framework of Reference (CEFR) levels:

This model is intended for:

Vocabulary difficulty estimation
Language learning applications
CEFR-aware educational tools
Vocabulary profiling
Adaptive learning systems
Linguistic research

Model Details

Item	Value
Base Model	microsoft/deberta-v3-base
Task	CEFR Classification
Labels	A1, A2, B1, B2, C1, C2
Architecture	DeBERTa-v3
Framework	Hugging Face Transformers
Language	English

Dataset

This model was trained using CEFR annotations derived from:

Dataset: star092304/CEFR-Annotated-WordNet

Dataset card:

https://huggingface.co/datasets/star092304/CEFR-Annotated-WordNet

The dataset provides CEFR proficiency annotations for WordNet lexical entries and was created based on the following work:

Reference Paper

CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning

Authors:

Masato Kikuchi
Masatsugu Ono
Toshioki Soga
Tetsu Tanabe
Tadachika Ozono

Paper:

https://arxiv.org/html/2510.18466v2

Citation

@article{kikuchi2025cefrannotatedwordnet,
  title={CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning},
  author={Kikuchi, Masato and Ono, Masatsugu and Soga, Toshioki and Tanabe, Tetsu and Ozono, Tadachika},
  year={2025}
}

Training Data Construction

Training examples were generated by aligning:

SemCor sense annotations
WordNet lexical entries
CEFR labels from CEFR-Annotated WordNet

Additional preprocessing included:

Lemmatization
Sense matching
Multi-word expression removal
Proper noun filtering

Only single-word lexical items were retained for training.

Training Performance

Training Curves

Test Confusion Matrix

Evaluation Preview

Usage

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
import torch

model_name = "star092304/cefr-level-deberta-v3-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

label_names = ["A1", "A2", "B1", "B2", "C1", "C2"]

word = "investigation"

inputs = tokenizer(
    word,
    return_tensors="pt",
    truncation=True
)

with torch.no_grad():
    outputs = model(**inputs)

pred_id = outputs.logits.argmax(dim=-1).item()

print(label_names[pred_id])

Example Predictions

Word	Predicted CEFR
book	A1
happy	A1
journey	A2
improve	B1
investigation	B2
sophisticated	C1
quintessential	C2

Limitations

The model predicts vocabulary difficulty at the word level.
CEFR levels can vary depending on context and meaning.
Polysemous words may belong to multiple CEFR levels depending on usage.
Predictions should be interpreted as estimated proficiency levels rather than absolute ground truth.

Intended Use

This model is intended for:

Educational research
Language learning systems
Vocabulary recommendation engines
CEFR-aware NLP pipelines

The model is not intended for high-stakes educational assessment or certification decisions.

Acknowledgements

This work builds upon:

WordNet
SemCor
CEFR-Annotated WordNet
Hugging Face Transformers
Microsoft DeBERTa-v3

Special thanks to the authors of the CEFR-Annotated WordNet dataset and paper.

Downloads last month: 90

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for star092304/cefr-level-deberta-v3-base

Base model

microsoft/deberta-v3-base

Finetuned

(633)

this model

star092304
/

cefr-level-deberta-v3-base