File size: 2,272 Bytes
04df0ca
17e42dd
27214b3
 
feccaea
 
 
 
 
04df0ca
44fb10e
feccaea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
---
tags:
- pytorch_model_hub_mixin
- model_hub_mixin
- entity-matching
- MARC
- pytorch
inference: false
library_name: pytorch
---

# MARC Record Matching with Bibliographic Metadata
Traditional matching of MARC (Machine-Readable Cataloging) records has relied heavily on identifiers like OCLC numbers, ISBNs, LCCNs, etc. assigned by catalogers. However, this approach struggles with records having incorrect identifiers or lacking them altogether. This model has been developed to match MARC records based solely on their bibliographic metadata (title, author, publisher, etc.), enabling successful matches even when identifiers are missing or inaccurate.

Check out the code and dataset at our [GitHub repository](https://github.com/cdlib/marcai).

Try out our [interactive demo](https://huggingface.co/spaces/cdlib/marc-match-ai-demo) to see the model in action!

## Key Features
- Bibliographic Metadata Matching: Performs matching based solely on bibliographic data, eliminating the need for identifiers.
- Text Field Flexibility: Accommodates minor variations in bibliographic metadata fields for accurate matching.
- Adjustable Matching Threshold: Allows tuning the balance between false positives and false negatives based on specific use cases.

## Performance
This model achieves 98.46% accuracy on our validation set (see our [dataset](https://github.com/cdlib/marc-ai)), and had comparable accuracy with SCSB, Goldrush, and OCLC matching (with and without merging with the WorldCat API). Each matching algorithm was run on a common set of English monographs to produce a union set of all of the algorithms' matches, and a matching threshold of 0.99 was chosen for our model to minimize false positives.  Disagreements between the algorithms were manually reviewed, resulting in false positives and false negatives for those disagreements:

| Algorithm       | % False Positives | % False Negatives |
|-----------------|-------------------|-------------------|
| Goldrush        | 0.30%             | 4.79%             |
| SCSB            | 0.52%             | 0.40%             |
| __Our Model__   | __0.23%__         | __1.95%__         |
| OCLC            | 0.05%             | 2.73%             |
| OCLC Reconciled | 0.10%             | 1.23%             |