MAP β DeBERTa-v3-large 5-Fold Classifier
Kaggle Competition: MAP β Charting Student Math Misunderstandings
Final Score: Public LB 0.91924 / Private LB 0.91107 (DeBERTa-v3-large, 5-fold ensemble)
Model Description
This repository contains 5-fold checkpoints of a DeBERTa-v3-large classifier trained for the MAP Kaggle competition. The task is to predict the Category:Misconception label for each student response, given the question text, the student's selected answer, and the student's written explanation.
The label space has 65 classes combining:
- Category (6 types):
True_Correct,True_Neither,True_Misconception,False_Correct,False_Neither,False_Misconception - Misconception name (or
NAif no misconception)
Repository Structure
βββ deberta_fold0/
β βββ config.json
β βββ model.safetensors # DeBERTa-v3-large backbone weights
β βββ head_weights.pt # Custom pooler + classifier head weights
β βββ tokenizer.json
β βββ tokenizer_config.json
βββ deberta_fold1/ ... (same structure)
βββ deberta_fold2/ ... (same structure)
βββ deberta_fold3/ ... (same structure)
βββ deberta_fold4/ ... (same structure)
βββ deberta_label_list.txt # All 65 label strings, one per line
Model Architecture
A custom DebertaClassifier wrapping microsoft/deberta-v3-large:
class DebertaClassifier(nn.Module):
def __init__(self, backbone, num_labels):
super().__init__()
self.backbone = backbone
hidden_size = backbone.config.hidden_size # 1024 for large
self.pooler = nn.Linear(hidden_size, hidden_size)
self.classifier = nn.Linear(hidden_size, num_labels)
self.dropout = nn.Dropout(0.1)
def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None, **kwargs):
out = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
cls = out.last_hidden_state[:, 0, :].float()
pooled = torch.tanh(self.pooler(self.dropout(cls)))
logits = self.classifier(self.dropout(pooled))
loss = None
if labels is not None:
loss = nn.CrossEntropyLoss()(logits, labels)
return SequenceClassifierOutput(loss=loss, logits=logits)
The backbone weights are stored as model.safetensors (HuggingFace standard format). The custom pooler and classifier head weights are stored separately in head_weights.pt.
Training Details
| Hyperparameter | Value |
|---|---|
| Base model | microsoft/deberta-v3-large |
| Max length | 256 |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Epochs | 3 (with early stopping, patience=1) |
| LR scheduler | Cosine |
| Optimizer | AdamW |
| Mixed precision | BF16 |
| Cross-validation | GroupKFold (n=5, grouped by QuestionId) |
| Loss function | CrossEntropyLoss with inverse-frequency class weights, clipped to [0.5, 10.0] |
Input format:
Question: {QuestionText}
Student selected: {MC_Answer}
Student explanation: {StudentExplanation}
Inference
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from torch import nn
from transformers.modeling_outputs import SequenceClassifierOutput
class DebertaClassifier(nn.Module):
def __init__(self, backbone, num_labels):
super().__init__()
self.backbone = backbone
hidden_size = backbone.config.hidden_size
self.pooler = nn.Linear(hidden_size, hidden_size)
self.classifier = nn.Linear(hidden_size, num_labels)
self.dropout = nn.Dropout(0.1)
def forward(self, input_ids, attention_mask, **kwargs):
out = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
cls = out.last_hidden_state[:, 0, :].float()
pooled = torch.tanh(self.pooler(self.dropout(cls)))
logits = self.classifier(self.dropout(pooled))
return SequenceClassifierOutput(logits=logits)
# Load label list
with open("deberta_label_list.txt") as f:
LABEL_LIST = [l.strip() for l in f if l.strip()]
device = "cuda" if torch.cuda.is_available() else "cpu"
fold_logits = []
for fold in range(5):
ckpt_path = f"deberta_fold{fold}"
tok = AutoTokenizer.from_pretrained(ckpt_path)
backbone = AutoModel.from_pretrained(ckpt_path)
model = DebertaClassifier(backbone, num_labels=len(LABEL_LIST))
head = torch.load(f"{ckpt_path}/head_weights.pt", map_location="cpu", weights_only=True)
model.pooler.load_state_dict(head["pooler"])
model.classifier.load_state_dict(head["classifier"])
model.eval().to(device)
texts = [
"Question: Which fraction is equivalent to 0.5?\nStudent selected: 1/2\nStudent explanation: Because 1 divided by 2 equals 0.5"
]
with torch.no_grad():
enc = tok(texts, padding=True, truncation=True, max_length=256, return_tensors="pt")
enc = {k: v.to(device) for k, v in enc.items()}
logits = model(**enc).logits.float().cpu().numpy()
fold_logits.append(logits)
# Logit ensemble (average logits, then softmax)
mean_logits = np.mean(fold_logits, axis=0)
probs = torch.softmax(torch.tensor(mean_logits), dim=-1).numpy()
top3_idx = np.argsort(-probs, axis=1)[:, :3]
top3_labels = [[LABEL_LIST[j] for j in row] for row in top3_idx]
print(top3_labels)
# e.g. [['True_Correct:NA', 'True_Neither:NA', 'False_Correct:NA']]
Results
| Version | CV MAP@3 | Simulated LB | Public LB | Private LB |
|---|---|---|---|---|
| deberta-v3-base (GroupKFold) | 0.3213 | 0.9051 | 0.89397 | 0.89433 |
| deberta-v3-large (this repo) | 0.2925 | 0.9351 | 0.91924 | 0.91107 |
| deberta-v3-large + Logit Ensemble | 0.2925 | 0.9351 | 0.93081 | 0.92442 |
Note on CV vs LB: GroupKFold CV is low (0.29) because each fold validates on completely unseen questions (only 15 unique questions in training data). The Kaggle test set shares the same question IDs as training, so the 5-fold ensemble effectively has seen 4/5 of the test questions during training β making the LB much higher than CV suggests.
Citation
@misc{lyixuan2026map,
author = {Li, Yi-Shiuan},
title = {MAP Charting Student Math Misunderstandings β DeBERTa-v3-large 5-Fold},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/lyixuan0718/map-deberta-v3-large-5fold}
}
Model tree for lyixuan0718/map-deberta-v3-large-5fold
Base model
microsoft/deberta-v3-large