ShenavaSanj v1.0 (شنواسنج)

Persian word-importance model — scores how important each word is for understanding an utterance (0 = filler/function word, 1 = essential content). Built to power a DHH-oriented semantic error metric (ACE-style, importance-weighted WER) for Persian ASR — so a missed keyword is penalized far more than a missed filler.

Student: ParsBERT (HooshvareLab/bert-base-parsbert-uncased, 110M) + token-regression head.
Teacher: google/gemma-4-31b-it (fp4) with DHH-framed prompt + 5 Persian few-shot anchors.
Distillation: soft-label regression (HuberLoss, δ=0.1) on the teacher's continuous [0,1] scores; first-subword alignment (NER-style). No human annotation.
Data: 26,490 unique conversational Persian utterances from shekar-ai/neyshekar-v4-persian-asr-fa. Teacher labels: Reza2kn/neyshekar-fa-wimp-teacher-labels.

Validation

Student vs teacher (held-out 1,324): token-ρ 0.934, per-utterance-ρ 0.916, MSE 0.0114.
Provenance: the teacher was validated on English DHH gold (Kafle & Huenerfauth LREC-2018): pooled token-ρ ≈ 0.80 vs ~0.84 human inter-annotator ceiling. Persian cross-model agreement (teacher vs gemini-3.5-flash) per-utt ρ ≈ 0.89.

Usage

import torch, torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from huggingface_hub import hf_hub_download

REPO = "Reza2kn/ShenavaSanj-v1.0"
tok = AutoTokenizer.from_pretrained(REPO)
enc = AutoModel.from_pretrained(REPO).eval()
head = nn.Linear(enc.config.hidden_size, 1)
head.load_state_dict(torch.load(hf_hub_download(REPO, "head.pt"), map_location="cpu")); head.eval()

@torch.no_grad()
def importance(text):
    words = text.split()
    e = tok(words, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=64)
    p = torch.sigmoid(head(enc(**e).last_hidden_state)).squeeze(-1)[0].tolist()
    out, seen = [], set()
    for ti, wid in enumerate(e.word_ids(0)):
        if wid is not None and wid not in seen:
            seen.add(wid); out.append(round(p[ti], 3))
    return list(zip(words, out))

print(importance("خب یعنی چی الان؟"))
# [('خب', 0.02), ('یعنی', 0.28), ('چی', 0.83), ('الان؟', 0.71)]

Scores are per whitespace token. For the ACE-style weighted-WER metric, weight each reference-word error by its ShenavaSanj importance and normalize by total reference importance.

Downloads last month: 97

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Reza2kn/ShenavaSanj-v1.0

Base model

HooshvareLab/bert-base-parsbert-uncased

Finetuned

(24)

this model