Infon multilingual coreference pointer

Multilingual coreference resolution: detects mentions and links them into clusters across English, Japanese, Korean, Thai, and Chinese. Designed for browser inference via ONNX, replacing the English-only fastcoref baseline for multilingual workloads.

Quick start (JavaScript)

npm install @cp500/infon-coref onnxruntime-web

import { InfonCorefModel } from '@cp500/infon-coref';

const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
  precision: 'fp16',     // 235 MB (default) — vs 470 MB for fp32
  device: 'auto',        // tries WebGPU, falls back to WASM
});

const result = await model.resolve(
  'Toyota announced a partnership with Panasonic. ' +
  'The Japanese automaker said the deal is worth $250M.'
);

for (const cluster of result.clusters) {
  console.log(cluster.map(i => result.mentions[i].text).join(' = '));
  // Toyota = The Japanese automaker
}

The JS client source is mirrored under js/ in this repo for self-contained installs:

npm install ./js

Quick start (Python / PyTorch)

import torch
from transformers import AutoModel, AutoTokenizer
# Architecture lives in scripts/train_coref_pointer.py / coref_onnx_experiment.py
# (the training repo). Loading the heads is a 4-line check:
heads = torch.load("heads.pt", map_location="cpu", weights_only=True)
backbone = AutoModel.from_pretrained("./backbone/")
tokenizer = AutoTokenizer.from_pretrained("./backbone/")

Architecture

text ─▶ tokenize ─▶ MiniLM-L12 backbone ─▶ ┬─▶ last_hidden_state ─┐
                                            └─▶ bio_logits (T,3)    │
                                                       │             │
                                                       ▼             │
                                              decode BIO spans       │
                                                       │             │
                                                       ▼             │
                                          mention_scorer ◀───────────┘
                                                  │
                                                  ▼
                                            pair_scores (P,)
                                                  │
                                                  ▼
                                          per-mention argmax
                                                  │
                                                  ▼
                                          coreference clusters

Two ONNX graphs:

onnx/coref_backbone_bio.onnx — XLM-R-distilled MiniLM-L12 (H=384, 12 layers, 117M params) plus a 3-class BIO mention-detection head.
onnx/coref_mention_scorer.onnx — vectorised mention pooling (boundary tokens + segment-mean) and a pairwise antecedent scorer. DUMMY antecedent is concatenated at index 0 so pair_j == 0 means "no antecedent."

Evaluation

Best checkpoint (selected on combined (ptr_acc + bio_f1) / 2):

Language	Pointer acc	BIO F1	Val mentions
en	0.805	0.809	1827
ja	0.823	0.794	1601
ko	0.824	0.814	1702
th	0.820	0.906	1495
zh	0.829	0.872	1589

Aggregate: pointer accuracy 0.820, BIO F1 0.815, combined score 0.817.

Trained on cp500/infon-coref-multilingual.

Known limits

BIO precision degrades after epoch 0 if training continues with the default joint-loss schedule (pointer head saturates and the optimizer pushes BIO toward recall). The deployed checkpoint is from epoch 0 to keep BIO precision and pointer accuracy balanced. A fix using separate optimizers per head is on the roadmap.
Trained only on the 5 listed languages. Other XLM-R-supported languages may work via zero-shot transfer; verify on your domain.
Synthetic training data follows news-article register; out-of-domain text (chat, code comments, formal contracts) may underperform.

Backbone

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 — public Apache-2.0 distillation of XLM-R-base. Tokenizer copied here for offline-installable parity.

License

Apache 2.0 for both weights and code.

Downloads last month: 253