YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

URBERT - Hugging Face Model Card

This repository provides the URBERT backbone as a Hugging Face transformers model. The uploaded checkpoint is a BERT encoder trained in the URBERT pipeline with character-level uroman tokenization.

Model Details

  • Model type: BERT backbone (AutoModel)
  • Base architecture: bert-base-uncased config family
  • Tokenizer: custom character-level tokenizer packaged to load with AutoTokenizer
  • Primary objective during training pipeline:
    • Text MLM
    • Audio distillation (in multitask training). This HF export contains the backbone encoder only.

Quick Start

import torch
from transformers import AutoModel, AutoTokenizer

REPO_ID = "Sanghyang00/urbert-256"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(REPO_ID, force_download=True)
model = AutoModel.from_pretrained(REPO_ID).to(device).eval()

text = "hello urbert"
inputs = tokenizer(text, add_special_tokens=False, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)
    last_hidden = outputs.last_hidden_state

print("input_ids:", inputs["input_ids"].tolist())
print("input shape:", tuple(inputs["input_ids"].shape))
print("last_hidden shape:", tuple(last_hidden.shape))

Notes on Tokenization

  • This tokenizer is character-level and intended to be HF-compatible through AutoTokenizer.
  • Special token behavior follows HF conventions:
    • Example: "[MASK]" is treated as one special token by HF tokenizer.
  • If you compare against a legacy local tokenizer implementation, special-token string handling may differ even when normal text encoding matches.

Intended Use

  • Feature extraction from URBERT backbone hidden states
  • Initialization for downstream tasks that use uroman character-level representations

Limitations

  • This export is the backbone encoder, not the full multitask training head.
  • Domain and language coverage are constrained by the training data used in URBERT experiments.
  • Additional task-specific fine-tuning may be required for production use.

Training Reference

For training code, data processing, and experiment setup, please refer to:

Citation

If you use this model in your research or applications, please cite:

@article{lee2026urbert,
  title   = {UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction},
  author  = {Lee, Sangmin and Ahn, Eekgyun and Choi, Woongjib and Kang, Hong-Goo},
  journal = {arXiv preprint arXiv:2606.11681},
  year    = {2026}
}
Downloads last month
27
Safetensors
Model size
86.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Sanghyang00/urbert-256