Upload folder using huggingface_hub
Browse files- README.md +5 -63
- config.json +6 -0
- model.safetensors +3 -0
README.md
CHANGED
@@ -1,67 +1,9 @@
|
|
1 |
---
|
2 |
-
title: MARC Match AI
|
3 |
-
emoji: 📚
|
4 |
-
colorFrom: gray
|
5 |
-
colorTo: gray
|
6 |
-
sdk: gradio
|
7 |
-
sdk_version: 4.27.0
|
8 |
-
app_file: demo/app.py
|
9 |
-
pinned: false
|
10 |
-
language: en
|
11 |
tags:
|
12 |
-
-
|
13 |
-
-
|
14 |
-
- pytorch
|
15 |
-
library_name: pytorch
|
16 |
-
inference: false
|
17 |
---
|
18 |
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
## Key Features
|
23 |
-
- Bibliographic Metadata Matching: Performs matching based solely on bibliographic data, eliminating the need for identifiers.
|
24 |
-
- Text Field Flexibility: Accommodates minor variations in bibliographic metadata fields for accurate matching.
|
25 |
-
- Adjustable Matching Threshold: Allows tuning the balance between false positives and false negatives based on specific use cases.
|
26 |
-
|
27 |
-
Check out our [interactive demo](https://huggingface.co/spaces/cdlib/marc-match-ai-demo) to see the model in action!
|
28 |
-
|
29 |
-
## Performance
|
30 |
-
Our model achieves 98.46% accuracy on our validation set (see our [dataset](https://github.com/cdlib/marc-ai)), and had comparable accuracy with SCSB, Goldrush, and OCLC matching (with and without merging with the WorldCat API). Each matching algorithm was run on a common set of English monographs to produce a union set of all of the algorithms' matches, and a matching threshold of 0.99 was chosen for our model to minimize false positives. Disagreements between the algorithms were manually reviewed, resulting in false positives and false negatives for those disagreements:
|
31 |
-
|
32 |
-
| Algorithm | % False Positives | % False Negatives |
|
33 |
-
|-----------------|-------------------|-------------------|
|
34 |
-
| Goldrush | 0.30% | 4.79% |
|
35 |
-
| SCSB | 0.52% | 0.40% |
|
36 |
-
| __Our Model__ | __0.23%__ | __1.95%__ |
|
37 |
-
| OCLC | 0.05% | 2.73% |
|
38 |
-
| OCLC Reconciled | 0.10% | 1.23% |
|
39 |
-
|
40 |
-
|
41 |
-
## Installation
|
42 |
-
Install the marcai package directly from HuggingFace:
|
43 |
-
```
|
44 |
-
pip install git+https://huggingface.co/cdlib/marc-match-ai
|
45 |
-
```
|
46 |
-
Alternatively, you can clone the repository and install it locally:
|
47 |
-
```
|
48 |
-
git clone https://huggingface.co/cdlib/marc-match-ai
|
49 |
-
pip install ./marc-match-ai
|
50 |
-
```
|
51 |
-
|
52 |
-
## Usage
|
53 |
-
The `marcai` package comes with a command-line interface offering a suite of commands for processing data, training models, and making predictions. All commands have their own help functions, which can be accessed by running `marc-ai <command> --help`.
|
54 |
-
|
55 |
-
### Processing data
|
56 |
-
`marc-ai process` takes a file containing MARC records and a CSV containing indices of record comparisons, and calculates similarity scores for several fields in the MARC records. These similarity values serve as the input features to the machine learning model.
|
57 |
-
|
58 |
-
### Training a model
|
59 |
-
`marc-ai train` trains a model with the hyperparameters defined in `config.yaml`, including the paths to dataset splits. The model is saved in a tar.gz file, containing a PyTorch Lightning checkpoint, an ONNX conversion, and a copy of the `config.yaml` used.
|
60 |
-
|
61 |
-
Our model was trained on pairs of records from our database, skewed for more difficult comparisons (matches with variation, mismatches that are very similar). This dataset can be found at the [marc-ai](https://github.com/cdlib/marc-ai) GitHub repository.
|
62 |
-
|
63 |
-
### Making predictions
|
64 |
-
`marc-ai predict` takes the output from `marc-ai process` and a trained model, and runs the similarity scores through the model to produce match confidence scores.
|
65 |
-
|
66 |
-
### Finding matches without I/O
|
67 |
-
`marc-ai find_matches` combines the commands for processing and predicting to cut out the unnecessary step of saving similarity values to disk. This is substantially faster when working with large amounts of data.
|
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
tags:
|
3 |
+
- pytorch_model_hub_mixin
|
4 |
+
- model_hub_mixin
|
|
|
|
|
|
|
5 |
---
|
6 |
|
7 |
+
This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
|
8 |
+
- Library: [More Information Needed]
|
9 |
+
- Docs: [More Information Needed]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
config.json
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"batch_size": 512,
|
3 |
+
"lr": 0.006,
|
4 |
+
"optimizer": "Adam",
|
5 |
+
"weight_decay": 0.0
|
6 |
+
}
|
model.safetensors
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1b43ca06da87e7d9c66401c1205b615a12edf2b14e75614d2b502ed983b2bc3d
|
3 |
+
size 10180
|