bibr reference-field parser (GIANT v4)

Token-classification model that labels the fields of a bibliographic reference string (AUTHOR, TITLE, CONTAINER, YEAR, …) with a BIO tag scheme. Consumed at runtime by bibr; trained by bibr-training.

Architecture

FeatureGatedEncoderCRF: answerdotai/ModernBERT-base encoder → linear projection → CRF decoder, with a per-token numeric feature gate (layout/font signals). GIANT has no layout signal, so features are all-empty (mask 0) here and the gate down-weights them to zero — the gate/projection ship with the checkpoint so the same artifact can later be gold-fine-tuned with real layout features.

  • Tag scheme: 39 tags (O + B-/I- for 19 field types). See tag_config.json.
  • Weights: best.pt (PyTorch state_dict, loads strict=True).

Training data

GIANT (Crossref ~1B citation metadata) converted to word-level BIO, then re-tokenized with the ModernBERT subword tokenizer. ~2.0M references, 40/60 noised/clean mix (augment_reference: per-token character noise), with a deterministic clean GIANT holdout (val_size 40,002) used for model selection.

Evaluation

Entity-level F1 (seqeval strict) on the clean GIANT holdout:

metric value
val entity F1 0.9993

Per-field F1 (see report.txt): AUTHOR 0.9990, TITLE 0.9994, CONTAINER 0.9998, DOI 0.9997, ISSUE 0.9996, PUBLISHER 0.9996, URL 1.0000, VOLUME 0.9994, YEAR 0.9985, EDITOR 0.9968.

⚠️ Read this number in context. It is measured on in-distribution, synthetic GIANT data (clean Crossref metadata + the same per-token noise seen in training). It is not a real-world-reference benchmark; expect lower performance on OCR'd or free-text bibliographies. The metric is meaningful for model selection but should not be quoted as real-world accuracy.

Fields

GIANT emits 10 of the 19 field types: AUTHOR, TITLE, CONTAINER, YEAR, VOLUME, ISSUE, DOI, PUBLISHER, URL, EDITOR. The other fields (PAGES, ARXIV, PMID, etc.) are never predicted by this GIANT-only checkpoint.

Inference convention

The model is trained with add_special_tokens=False. Any inference MUST tokenize the same way (no CLS/SEP), or predictions shift by one token. Token labels are derived from the majority field over each token's whole character span (ModernBERT byte-level BPE folds the leading space into a token's offset, so a first-character-only rule mislabels every space-led subword).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thesanogoeffect/bibr-parser-giant-v4

Finetuned
(1336)
this model