Token Classification
Transformers
Safetensors
Persian
bert
feature-extraction
persian
word-importance
salience
dhh
asr-evaluation
ace-metric
distillation
Instructions to use Reza2kn/ShenavaSanj-v1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Reza2kn/ShenavaSanj-v1.0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="Reza2kn/ShenavaSanj-v1.0")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Reza2kn/ShenavaSanj-v1.0") model = AutoModel.from_pretrained("Reza2kn/ShenavaSanj-v1.0") - Notebooks
- Google Colab
- Kaggle
ShenavaSanj v1.0 (شنواسنج)
Persian word-importance model — scores how important each word is for understanding an utterance (0 = filler/function word, 1 = essential content). Built to power a DHH-oriented semantic error metric (ACE-style, importance-weighted WER) for Persian ASR — so a missed keyword is penalized far more than a missed filler.
- Student: ParsBERT (
HooshvareLab/bert-base-parsbert-uncased, 110M) + token-regression head. - Teacher:
google/gemma-4-31b-it(fp4) with DHH-framed prompt + 5 Persian few-shot anchors. - Distillation: soft-label regression (HuberLoss, δ=0.1) on the teacher's continuous [0,1] scores; first-subword alignment (NER-style). No human annotation.
- Data: 26,490 unique conversational Persian utterances from
shekar-ai/neyshekar-v4-persian-asr-fa. Teacher labels:Reza2kn/neyshekar-fa-wimp-teacher-labels.
Validation
- Student vs teacher (held-out 1,324): token-ρ 0.934, per-utterance-ρ 0.916, MSE 0.0114.
- Provenance: the teacher was validated on English DHH gold (Kafle & Huenerfauth LREC-2018): pooled token-ρ ≈ 0.80 vs ~0.84 human inter-annotator ceiling. Persian cross-model agreement (teacher vs gemini-3.5-flash) per-utt ρ ≈ 0.89.
Usage
import torch, torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from huggingface_hub import hf_hub_download
REPO = "Reza2kn/ShenavaSanj-v1.0"
tok = AutoTokenizer.from_pretrained(REPO)
enc = AutoModel.from_pretrained(REPO).eval()
head = nn.Linear(enc.config.hidden_size, 1)
head.load_state_dict(torch.load(hf_hub_download(REPO, "head.pt"), map_location="cpu")); head.eval()
@torch.no_grad()
def importance(text):
words = text.split()
e = tok(words, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=64)
p = torch.sigmoid(head(enc(**e).last_hidden_state)).squeeze(-1)[0].tolist()
out, seen = [], set()
for ti, wid in enumerate(e.word_ids(0)):
if wid is not None and wid not in seen:
seen.add(wid); out.append(round(p[ti], 3))
return list(zip(words, out))
print(importance("خب یعنی چی الان؟"))
# [('خب', 0.02), ('یعنی', 0.28), ('چی', 0.83), ('الان؟', 0.71)]
Scores are per whitespace token. For the ACE-style weighted-WER metric, weight each reference-word error by its ShenavaSanj importance and normalize by total reference importance.
- Downloads last month
- 97
Model tree for Reza2kn/ShenavaSanj-v1.0
Base model
HooshvareLab/bert-base-parsbert-uncased