IdiomBERT — Joint mBERT for Multilingual Idiom Detection

Fine-tuned google-bert/bert-base-multilingual-cased for joint idiom detection across English, Spanish, Hindi, and Telugu. One forward pass produces three outputs:

Classification: literal (0) vs idiomatic (1)
Span start/end: token indices of the idiomatic span

Trained on the MultiIdiom dataset (EN+ES+HI+TE split).

Files

pytorch_model.bin / model.safetensors — fine-tuned mBERT backbone
task_heads.pt — three linear heads (cls_head, start_head, end_head)
tokenizer.* — standard mBERT tokenizer

Usage

import torch
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download

REPO = "Justarandomperson/IdiomBERT-system-e"

backbone  = AutoModel.from_pretrained(REPO)
tokenizer = AutoTokenizer.from_pretrained(REPO)
heads     = torch.load(hf_hub_download(REPO, 'task_heads.pt'), map_location='cpu', weights_only=True)

# Attach heads
hidden = backbone.config.hidden_size  # 768
cls_head   = torch.nn.Linear(hidden, 2)
start_head = torch.nn.Linear(hidden, 1)
end_head   = torch.nn.Linear(hidden, 1)
cls_head.load_state_dict(heads['cls_head'])
start_head.load_state_dict(heads['start_head'])
end_head.load_state_dict(heads['end_head'])

backbone.eval(); cls_head.eval(); start_head.eval(); end_head.eval()

# Inference
enc = tokenizer("He kicked the bucket last night .", return_tensors='pt')
with torch.no_grad():
    seq     = backbone(**enc).last_hidden_state
    label   = cls_head(seq[:, 0, :]).argmax(-1).item()   # 0=literal, 1=idiomatic
    start   = start_head(seq).squeeze(-1).argmax(-1).item()
    end     = end_head(seq).squeeze(-1).argmax(-1).item()
print(label, start, end)

Downloads last month: 17

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Justarandomperson/IdiomBERT-system-e

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(1000)

this model