Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +5 -63
config.json +6 -0
model.safetensors +3 -0

README.md CHANGED Viewed

@@ -1,67 +1,9 @@
 ---
-title: MARC Match AI
-emoji: 📚
-colorFrom: gray
-colorTo: gray
-sdk: gradio
-sdk_version: 4.27.0
-app_file: demo/app.py
-pinned: false
-language: en
 tags:
-- entity-matching
-- MARC
-- pytorch
-library_name: pytorch
-inference: false
 ---
-# MARC Record Matching with Bibliographic Metadata
-Traditional matching of MARC (Machine-Readable Cataloging) records has relied heavily on identifiers like OCLC numbers, ISBNs, LCCNs, etc. assigned by catalogers. However, this approach struggles with records having incorrect identifiers or lacking them altogether. This model has been developed to match MARC records based solely on their bibliographic metadata (title, author, publisher, etc.), enabling successful matches even when identifiers are missing or inaccurate.
-## Key Features
-- Bibliographic Metadata Matching: Performs matching based solely on bibliographic data, eliminating the need for identifiers.
-- Text Field Flexibility: Accommodates minor variations in bibliographic metadata fields for accurate matching.
-- Adjustable Matching Threshold: Allows tuning the balance between false positives and false negatives based on specific use cases.
-Check out our [interactive demo](https://huggingface.co/spaces/cdlib/marc-match-ai-demo) to see the model in action!
-## Performance
-Our model achieves 98.46% accuracy on our validation set (see our [dataset](https://github.com/cdlib/marc-ai)), and had comparable accuracy with SCSB, Goldrush, and OCLC matching (with and without merging with the WorldCat API). Each matching algorithm was run on a common set of English monographs to produce a union set of all of the algorithms' matches, and a matching threshold of 0.99 was chosen for our model to minimize false positives.  Disagreements between the algorithms were manually reviewed, resulting in false positives and false negatives for those disagreements:
-| Algorithm       | % False Positives | % False Negatives |
-|-----------------|-------------------|-------------------|
-| Goldrush        | 0.30%             | 4.79%             |
-| SCSB            | 0.52%             | 0.40%             |
-| __Our Model__   | __0.23%__         | __1.95%__         |
-| OCLC            | 0.05%             | 2.73%             |
-| OCLC Reconciled | 0.10%             | 1.23%             |
-## Installation
-Install the marcai package directly from HuggingFace:
-```
-pip install git+https://huggingface.co/cdlib/marc-match-ai
-```
-Alternatively, you can clone the repository and install it locally:
-```
-git clone https://huggingface.co/cdlib/marc-match-ai
-pip install ./marc-match-ai
-```
-## Usage
-The `marcai` package comes with a command-line interface offering a suite of commands for processing data, training models, and making predictions. All commands have their own help functions, which can be accessed by running `marc-ai <command> --help`.
-### Processing data
-`marc-ai process` takes a file containing MARC records and a CSV containing indices of record comparisons, and calculates similarity scores for several fields in the MARC records. These similarity values serve as the input features to the machine learning model.
-### Training a model
-`marc-ai train` trains a model with the hyperparameters defined in `config.yaml`, including the paths to dataset splits. The model is saved in a tar.gz file, containing a PyTorch Lightning checkpoint, an ONNX conversion, and a copy of the `config.yaml` used.
-Our model was trained on pairs of records from our database, skewed for more difficult comparisons (matches with variation, mismatches that are very similar). This dataset can be found at the [marc-ai](https://github.com/cdlib/marc-ai) GitHub repository.
-### Making predictions
-`marc-ai predict` takes the output from `marc-ai process` and a trained model, and runs the similarity scores through the model to produce match confidence scores.
-### Finding matches without I/O
-`marc-ai find_matches` combines the commands for processing and predicting to cut out the unnecessary step of saving similarity values to disk. This is substantially faster when working with large amounts of data.

 ---
 tags:
+- pytorch_model_hub_mixin
+- model_hub_mixin
 ---
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
+- Library: [More Information Needed]
+- Docs: [More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "batch_size": 512,
+  "lr": 0.006,
+  "optimizer": "Adam",
+  "weight_decay": 0.0
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1b43ca06da87e7d9c66401c1205b615a12edf2b14e75614d2b502ed983b2bc3d
+size 10180