RvanB commited on
Commit
27214b3
1 Parent(s): 44fb10e

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +5 -63
  2. config.json +6 -0
  3. model.safetensors +3 -0
README.md CHANGED
@@ -1,67 +1,9 @@
1
  ---
2
- title: MARC Match AI
3
- emoji: 📚
4
- colorFrom: gray
5
- colorTo: gray
6
- sdk: gradio
7
- sdk_version: 4.27.0
8
- app_file: demo/app.py
9
- pinned: false
10
- language: en
11
  tags:
12
- - entity-matching
13
- - MARC
14
- - pytorch
15
- library_name: pytorch
16
- inference: false
17
  ---
18
 
19
- # MARC Record Matching with Bibliographic Metadata
20
- Traditional matching of MARC (Machine-Readable Cataloging) records has relied heavily on identifiers like OCLC numbers, ISBNs, LCCNs, etc. assigned by catalogers. However, this approach struggles with records having incorrect identifiers or lacking them altogether. This model has been developed to match MARC records based solely on their bibliographic metadata (title, author, publisher, etc.), enabling successful matches even when identifiers are missing or inaccurate.
21
-
22
- ## Key Features
23
- - Bibliographic Metadata Matching: Performs matching based solely on bibliographic data, eliminating the need for identifiers.
24
- - Text Field Flexibility: Accommodates minor variations in bibliographic metadata fields for accurate matching.
25
- - Adjustable Matching Threshold: Allows tuning the balance between false positives and false negatives based on specific use cases.
26
-
27
- Check out our [interactive demo](https://huggingface.co/spaces/cdlib/marc-match-ai-demo) to see the model in action!
28
-
29
- ## Performance
30
- Our model achieves 98.46% accuracy on our validation set (see our [dataset](https://github.com/cdlib/marc-ai)), and had comparable accuracy with SCSB, Goldrush, and OCLC matching (with and without merging with the WorldCat API). Each matching algorithm was run on a common set of English monographs to produce a union set of all of the algorithms' matches, and a matching threshold of 0.99 was chosen for our model to minimize false positives. Disagreements between the algorithms were manually reviewed, resulting in false positives and false negatives for those disagreements:
31
-
32
- | Algorithm | % False Positives | % False Negatives |
33
- |-----------------|-------------------|-------------------|
34
- | Goldrush | 0.30% | 4.79% |
35
- | SCSB | 0.52% | 0.40% |
36
- | __Our Model__ | __0.23%__ | __1.95%__ |
37
- | OCLC | 0.05% | 2.73% |
38
- | OCLC Reconciled | 0.10% | 1.23% |
39
-
40
-
41
- ## Installation
42
- Install the marcai package directly from HuggingFace:
43
- ```
44
- pip install git+https://huggingface.co/cdlib/marc-match-ai
45
- ```
46
- Alternatively, you can clone the repository and install it locally:
47
- ```
48
- git clone https://huggingface.co/cdlib/marc-match-ai
49
- pip install ./marc-match-ai
50
- ```
51
-
52
- ## Usage
53
- The `marcai` package comes with a command-line interface offering a suite of commands for processing data, training models, and making predictions. All commands have their own help functions, which can be accessed by running `marc-ai <command> --help`.
54
-
55
- ### Processing data
56
- `marc-ai process` takes a file containing MARC records and a CSV containing indices of record comparisons, and calculates similarity scores for several fields in the MARC records. These similarity values serve as the input features to the machine learning model.
57
-
58
- ### Training a model
59
- `marc-ai train` trains a model with the hyperparameters defined in `config.yaml`, including the paths to dataset splits. The model is saved in a tar.gz file, containing a PyTorch Lightning checkpoint, an ONNX conversion, and a copy of the `config.yaml` used.
60
-
61
- Our model was trained on pairs of records from our database, skewed for more difficult comparisons (matches with variation, mismatches that are very similar). This dataset can be found at the [marc-ai](https://github.com/cdlib/marc-ai) GitHub repository.
62
-
63
- ### Making predictions
64
- `marc-ai predict` takes the output from `marc-ai process` and a trained model, and runs the similarity scores through the model to produce match confidence scores.
65
-
66
- ### Finding matches without I/O
67
- `marc-ai find_matches` combines the commands for processing and predicting to cut out the unnecessary step of saving similarity values to disk. This is substantially faster when working with large amounts of data.
 
1
  ---
 
 
 
 
 
 
 
 
 
2
  tags:
3
+ - pytorch_model_hub_mixin
4
+ - model_hub_mixin
 
 
 
5
  ---
6
 
7
+ This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
+ - Library: [More Information Needed]
9
+ - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "batch_size": 512,
3
+ "lr": 0.006,
4
+ "optimizer": "Adam",
5
+ "weight_decay": 0.0
6
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1b43ca06da87e7d9c66401c1205b615a12edf2b14e75614d2b502ed983b2bc3d
3
+ size 10180