marc-match-ai / README.md
RvanB's picture
Add to readme
44fb10e
metadata
title: MARC Match AI
emoji: 📚
colorFrom: gray
colorTo: gray
sdk: gradio
sdk_version: 4.27.0
app_file: demo/app.py
pinned: false
language: en
tags:
  - entity-matching
  - MARC
  - pytorch
library_name: pytorch
inference: false

MARC Record Matching with Bibliographic Metadata

Traditional matching of MARC (Machine-Readable Cataloging) records has relied heavily on identifiers like OCLC numbers, ISBNs, LCCNs, etc. assigned by catalogers. However, this approach struggles with records having incorrect identifiers or lacking them altogether. This model has been developed to match MARC records based solely on their bibliographic metadata (title, author, publisher, etc.), enabling successful matches even when identifiers are missing or inaccurate.

Key Features

  • Bibliographic Metadata Matching: Performs matching based solely on bibliographic data, eliminating the need for identifiers.
  • Text Field Flexibility: Accommodates minor variations in bibliographic metadata fields for accurate matching.
  • Adjustable Matching Threshold: Allows tuning the balance between false positives and false negatives based on specific use cases.

Check out our interactive demo to see the model in action!

Performance

Our model achieves 98.46% accuracy on our validation set (see our dataset), and had comparable accuracy with SCSB, Goldrush, and OCLC matching (with and without merging with the WorldCat API). Each matching algorithm was run on a common set of English monographs to produce a union set of all of the algorithms' matches, and a matching threshold of 0.99 was chosen for our model to minimize false positives. Disagreements between the algorithms were manually reviewed, resulting in false positives and false negatives for those disagreements:

Algorithm % False Positives % False Negatives
Goldrush 0.30% 4.79%
SCSB 0.52% 0.40%
Our Model 0.23% 1.95%
OCLC 0.05% 2.73%
OCLC Reconciled 0.10% 1.23%

Installation

Install the marcai package directly from HuggingFace:

pip install git+https://huggingface.co/cdlib/marc-match-ai

Alternatively, you can clone the repository and install it locally:

git clone https://huggingface.co/cdlib/marc-match-ai
pip install ./marc-match-ai

Usage

The marcai package comes with a command-line interface offering a suite of commands for processing data, training models, and making predictions. All commands have their own help functions, which can be accessed by running marc-ai <command> --help.

Processing data

marc-ai process takes a file containing MARC records and a CSV containing indices of record comparisons, and calculates similarity scores for several fields in the MARC records. These similarity values serve as the input features to the machine learning model.

Training a model

marc-ai train trains a model with the hyperparameters defined in config.yaml, including the paths to dataset splits. The model is saved in a tar.gz file, containing a PyTorch Lightning checkpoint, an ONNX conversion, and a copy of the config.yaml used.

Our model was trained on pairs of records from our database, skewed for more difficult comparisons (matches with variation, mismatches that are very similar). This dataset can be found at the marc-ai GitHub repository.

Making predictions

marc-ai predict takes the output from marc-ai process and a trained model, and runs the similarity scores through the model to produce match confidence scores.

Finding matches without I/O

marc-ai find_matches combines the commands for processing and predicting to cut out the unnecessary step of saving similarity values to disk. This is substantially faster when working with large amounts of data.