DeBERTa-v3 CEFR Vocabulary Classifier

A fine-tuned DeBERTa-v3-base model for predicting the CEFR proficiency level of English vocabulary items.

The model classifies words into six Common European Framework of Reference (CEFR) levels:

  • A1
  • A2
  • B1
  • B2
  • C1
  • C2

This model is intended for:

  • Vocabulary difficulty estimation
  • Language learning applications
  • CEFR-aware educational tools
  • Vocabulary profiling
  • Adaptive learning systems
  • Linguistic research

Model Details

Item Value
Base Model microsoft/deberta-v3-base
Task CEFR Classification
Labels A1, A2, B1, B2, C1, C2
Architecture DeBERTa-v3
Framework Hugging Face Transformers
Language English

Dataset

This model was trained using CEFR annotations derived from:

Dataset: star092304/CEFR-Annotated-WordNet

Dataset card:

https://huggingface.co/datasets/star092304/CEFR-Annotated-WordNet

The dataset provides CEFR proficiency annotations for WordNet lexical entries and was created based on the following work:

Reference Paper

CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning

Authors:

  • Masato Kikuchi
  • Masatsugu Ono
  • Toshioki Soga
  • Tetsu Tanabe
  • Tadachika Ozono

Paper:

https://arxiv.org/html/2510.18466v2

Citation

@article{kikuchi2025cefrannotatedwordnet,
  title={CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning},
  author={Kikuchi, Masato and Ono, Masatsugu and Soga, Toshioki and Tanabe, Tetsu and Ozono, Tadachika},
  year={2025}
}

Training Data Construction

Training examples were generated by aligning:

  • SemCor sense annotations
  • WordNet lexical entries
  • CEFR labels from CEFR-Annotated WordNet

Additional preprocessing included:

  • Lemmatization
  • Sense matching
  • Multi-word expression removal
  • Proper noun filtering

Only single-word lexical items were retained for training.


Training Performance

Training Curves

Training Curves


Test Confusion Matrix

Confusion Matrix


Evaluation Preview

Classification Report


Usage

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
import torch

model_name = "star092304/cefr-level-deberta-v3-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

label_names = ["A1", "A2", "B1", "B2", "C1", "C2"]

word = "investigation"

inputs = tokenizer(
    word,
    return_tensors="pt",
    truncation=True
)

with torch.no_grad():
    outputs = model(**inputs)

pred_id = outputs.logits.argmax(dim=-1).item()

print(label_names[pred_id])

Example Predictions

Word Predicted CEFR
book A1
happy A1
journey A2
improve B1
investigation B2
sophisticated C1
quintessential C2

Limitations

  • The model predicts vocabulary difficulty at the word level.
  • CEFR levels can vary depending on context and meaning.
  • Polysemous words may belong to multiple CEFR levels depending on usage.
  • Predictions should be interpreted as estimated proficiency levels rather than absolute ground truth.

Intended Use

This model is intended for:

  • Educational research
  • Language learning systems
  • Vocabulary recommendation engines
  • CEFR-aware NLP pipelines

The model is not intended for high-stakes educational assessment or certification decisions.


Acknowledgements

This work builds upon:

  • WordNet
  • SemCor
  • CEFR-Annotated WordNet
  • Hugging Face Transformers
  • Microsoft DeBERTa-v3

Special thanks to the authors of the CEFR-Annotated WordNet dataset and paper.

Downloads last month
90
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for star092304/cefr-level-deberta-v3-base

Finetuned
(633)
this model

Dataset used to train star092304/cefr-level-deberta-v3-base

Space using star092304/cefr-level-deberta-v3-base 1