Instructions to use cp500/infon-coref-pointer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use cp500/infon-coref-pointer with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('token-classification', 'cp500/infon-coref-pointer');
Infon multilingual coreference pointer
Multilingual coreference resolution: detects mentions and links them into clusters across English, Japanese, Korean, Thai, and Chinese. Designed for browser inference via ONNX, replacing the English-only fastcoref baseline for multilingual workloads.
Quick start (JavaScript)
npm install @cp500/infon-coref onnxruntime-web
import { InfonCorefModel } from '@cp500/infon-coref';
const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
precision: 'fp16', // 235 MB (default) β vs 470 MB for fp32
device: 'auto', // tries WebGPU, falls back to WASM
});
const result = await model.resolve(
'Toyota announced a partnership with Panasonic. ' +
'The Japanese automaker said the deal is worth $250M.'
);
for (const cluster of result.clusters) {
console.log(cluster.map(i => result.mentions[i].text).join(' = '));
// Toyota = The Japanese automaker
}
The JS client source is mirrored under js/ in this
repo for self-contained installs:
npm install ./js
Quick start (Python / PyTorch)
import torch
from transformers import AutoModel, AutoTokenizer
# Architecture lives in scripts/train_coref_pointer.py / coref_onnx_experiment.py
# (the training repo). Loading the heads is a 4-line check:
heads = torch.load("heads.pt", map_location="cpu", weights_only=True)
backbone = AutoModel.from_pretrained("./backbone/")
tokenizer = AutoTokenizer.from_pretrained("./backbone/")
Architecture
text ββΆ tokenize ββΆ MiniLM-L12 backbone ββΆ β¬ββΆ last_hidden_state ββ
βββΆ bio_logits (T,3) β
β β
βΌ β
decode BIO spans β
β β
βΌ β
mention_scorer βββββββββββββ
β
βΌ
pair_scores (P,)
β
βΌ
per-mention argmax
β
βΌ
coreference clusters
Two ONNX graphs:
onnx/coref_backbone_bio.onnxβ XLM-R-distilled MiniLM-L12 (H=384, 12 layers, 117M params) plus a 3-class BIO mention-detection head.onnx/coref_mention_scorer.onnxβ vectorised mention pooling (boundary tokens + segment-mean) and a pairwise antecedent scorer. DUMMY antecedent is concatenated at index 0 sopair_j == 0means "no antecedent."
Evaluation
Best checkpoint (selected on combined (ptr_acc + bio_f1) / 2):
| Language | Pointer acc | BIO F1 | Val mentions |
|---|---|---|---|
| en | 0.805 | 0.809 | 1827 |
| ja | 0.823 | 0.794 | 1601 |
| ko | 0.824 | 0.814 | 1702 |
| th | 0.820 | 0.906 | 1495 |
| zh | 0.829 | 0.872 | 1589 |
Aggregate: pointer accuracy 0.820, BIO F1 0.815, combined score 0.817.
Trained on cp500/infon-coref-multilingual.
Known limits
- BIO precision degrades after epoch 0 if training continues with the default joint-loss schedule (pointer head saturates and the optimizer pushes BIO toward recall). The deployed checkpoint is from epoch 0 to keep BIO precision and pointer accuracy balanced. A fix using separate optimizers per head is on the roadmap.
- Trained only on the 5 listed languages. Other XLM-R-supported languages may work via zero-shot transfer; verify on your domain.
- Synthetic training data follows news-article register; out-of-domain text (chat, code comments, formal contracts) may underperform.
Backbone
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 β public Apache-2.0 distillation of XLM-R-base.
Tokenizer copied here for offline-installable parity.
License
Apache 2.0 for both weights and code.
- Downloads last month
- 253