IdiomBERT β€” Joint mBERT for Multilingual Idiom Detection

Fine-tuned google-bert/bert-base-multilingual-cased for joint idiom detection across English, Spanish, Hindi, and Telugu. One forward pass produces three outputs:

  • Classification: literal (0) vs idiomatic (1)
  • Span start/end: token indices of the idiomatic span

Trained on the MultiIdiom dataset (EN+ES+HI+TE split).

Files

  • pytorch_model.bin / model.safetensors β€” fine-tuned mBERT backbone
  • task_heads.pt β€” three linear heads (cls_head, start_head, end_head)
  • tokenizer.* β€” standard mBERT tokenizer

Usage

import torch
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download

REPO = "Justarandomperson/IdiomBERT-system-e"

backbone  = AutoModel.from_pretrained(REPO)
tokenizer = AutoTokenizer.from_pretrained(REPO)
heads     = torch.load(hf_hub_download(REPO, 'task_heads.pt'), map_location='cpu', weights_only=True)

# Attach heads
hidden = backbone.config.hidden_size  # 768
cls_head   = torch.nn.Linear(hidden, 2)
start_head = torch.nn.Linear(hidden, 1)
end_head   = torch.nn.Linear(hidden, 1)
cls_head.load_state_dict(heads['cls_head'])
start_head.load_state_dict(heads['start_head'])
end_head.load_state_dict(heads['end_head'])

backbone.eval(); cls_head.eval(); start_head.eval(); end_head.eval()

# Inference
enc = tokenizer("He kicked the bucket last night .", return_tensors='pt')
with torch.no_grad():
    seq     = backbone(**enc).last_hidden_state
    label   = cls_head(seq[:, 0, :]).argmax(-1).item()   # 0=literal, 1=idiomatic
    start   = start_head(seq).squeeze(-1).argmax(-1).item()
    end     = end_head(seq).squeeze(-1).argmax(-1).item()
print(label, start, end)
Downloads last month
17
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Justarandomperson/IdiomBERT-system-e

Finetuned
(1000)
this model