RvanB commited on
Commit
5334ec8
1 Parent(s): 44fb10e

Slim down space repo

Browse files
README.md CHANGED
@@ -5,63 +5,7 @@ colorFrom: gray
5
  colorTo: gray
6
  sdk: gradio
7
  sdk_version: 4.27.0
8
- app_file: demo/app.py
9
- pinned: false
10
  language: en
11
- tags:
12
- - entity-matching
13
- - MARC
14
- - pytorch
15
- library_name: pytorch
16
- inference: false
17
  ---
18
-
19
- # MARC Record Matching with Bibliographic Metadata
20
- Traditional matching of MARC (Machine-Readable Cataloging) records has relied heavily on identifiers like OCLC numbers, ISBNs, LCCNs, etc. assigned by catalogers. However, this approach struggles with records having incorrect identifiers or lacking them altogether. This model has been developed to match MARC records based solely on their bibliographic metadata (title, author, publisher, etc.), enabling successful matches even when identifiers are missing or inaccurate.
21
-
22
- ## Key Features
23
- - Bibliographic Metadata Matching: Performs matching based solely on bibliographic data, eliminating the need for identifiers.
24
- - Text Field Flexibility: Accommodates minor variations in bibliographic metadata fields for accurate matching.
25
- - Adjustable Matching Threshold: Allows tuning the balance between false positives and false negatives based on specific use cases.
26
-
27
- Check out our [interactive demo](https://huggingface.co/spaces/cdlib/marc-match-ai-demo) to see the model in action!
28
-
29
- ## Performance
30
- Our model achieves 98.46% accuracy on our validation set (see our [dataset](https://github.com/cdlib/marc-ai)), and had comparable accuracy with SCSB, Goldrush, and OCLC matching (with and without merging with the WorldCat API). Each matching algorithm was run on a common set of English monographs to produce a union set of all of the algorithms' matches, and a matching threshold of 0.99 was chosen for our model to minimize false positives. Disagreements between the algorithms were manually reviewed, resulting in false positives and false negatives for those disagreements:
31
-
32
- | Algorithm | % False Positives | % False Negatives |
33
- |-----------------|-------------------|-------------------|
34
- | Goldrush | 0.30% | 4.79% |
35
- | SCSB | 0.52% | 0.40% |
36
- | __Our Model__ | __0.23%__ | __1.95%__ |
37
- | OCLC | 0.05% | 2.73% |
38
- | OCLC Reconciled | 0.10% | 1.23% |
39
-
40
-
41
- ## Installation
42
- Install the marcai package directly from HuggingFace:
43
- ```
44
- pip install git+https://huggingface.co/cdlib/marc-match-ai
45
- ```
46
- Alternatively, you can clone the repository and install it locally:
47
- ```
48
- git clone https://huggingface.co/cdlib/marc-match-ai
49
- pip install ./marc-match-ai
50
- ```
51
-
52
- ## Usage
53
- The `marcai` package comes with a command-line interface offering a suite of commands for processing data, training models, and making predictions. All commands have their own help functions, which can be accessed by running `marc-ai <command> --help`.
54
-
55
- ### Processing data
56
- `marc-ai process` takes a file containing MARC records and a CSV containing indices of record comparisons, and calculates similarity scores for several fields in the MARC records. These similarity values serve as the input features to the machine learning model.
57
-
58
- ### Training a model
59
- `marc-ai train` trains a model with the hyperparameters defined in `config.yaml`, including the paths to dataset splits. The model is saved in a tar.gz file, containing a PyTorch Lightning checkpoint, an ONNX conversion, and a copy of the `config.yaml` used.
60
-
61
- Our model was trained on pairs of records from our database, skewed for more difficult comparisons (matches with variation, mismatches that are very similar). This dataset can be found at the [marc-ai](https://github.com/cdlib/marc-ai) GitHub repository.
62
-
63
- ### Making predictions
64
- `marc-ai predict` takes the output from `marc-ai process` and a trained model, and runs the similarity scores through the model to produce match confidence scores.
65
-
66
- ### Finding matches without I/O
67
- `marc-ai find_matches` combines the commands for processing and predicting to cut out the unnecessary step of saving similarity values to disk. This is substantially faster when working with large amounts of data.
 
5
  colorTo: gray
6
  sdk: gradio
7
  sdk_version: 4.27.0
8
+ app_file: app.py
9
+ pinned: true
10
  language: en
 
 
 
 
 
 
11
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
demo/app.py → app.py RENAMED
@@ -4,13 +4,13 @@ import gradio as gr
4
  import pandas as pd
5
  import pymarc
6
 
7
- from marcai.predict import predict_onnx
8
  from marcai.process import process
9
  from marcai.utils import load_config
10
  from marcai.utils.parsing import record_dict
 
11
 
12
- demo_dir = os.path.dirname(os.path.realpath(__file__))
13
-
14
 
15
  def compare(file1, file2):
16
  # Load records
@@ -24,12 +24,14 @@ def compare(file1, file2):
24
  df = process(df1, df2)
25
 
26
  # Load config
27
- config = load_config(os.path.join(demo_dir, "config.yaml"))
 
 
28
 
29
- # Run ONNX model
30
- model_onnx = os.path.join(demo_dir, "model.onnx")
31
  input_df = df[config["model"]["features"]]
32
- prediction = predict_onnx(model_onnx, input_df).item()
 
 
33
 
34
  return {"match": prediction, "not match": 1 - prediction}
35
 
 
4
  import pandas as pd
5
  import pymarc
6
 
7
+ from marcai.predict import predict
8
  from marcai.process import process
9
  from marcai.utils import load_config
10
  from marcai.utils.parsing import record_dict
11
+ from marcai.pl import SimilarityVectorModel
12
 
13
+ root = os.path.dirname(os.path.abspath(__file__))
 
14
 
15
  def compare(file1, file2):
16
  # Load records
 
24
  df = process(df1, df2)
25
 
26
  # Load config
27
+ config = load_config(os.path.join(root, "config.yaml"))
28
+
29
+ model = SimilarityVectorModel.from_pretrained("cdlib/marc-match-ai")
30
 
 
 
31
  input_df = df[config["model"]["features"]]
32
+
33
+ # Run model
34
+ prediction = predict(model, input_df).item()
35
 
36
  return {"match": prediction, "not match": 1 - prediction}
37
 
config.yaml ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ # Inputs features
3
+ features:
4
+ - title_tokenset
5
+ - title_agg
6
+ - author
7
+ - publisher
8
+ - pub_date
9
+ - pub_place
10
+ - pagination
11
+
12
+ # Training
13
+ batch_size: 512
14
+ weight_decay: 0.0
15
+ max_epochs: -1
16
+
17
+ # Disable early stopping with -1
18
+ patience: 20
19
+
20
+ lr: 0.006
21
+ optimizer: Adam
22
+ saved_models_dir: saved_models
23
+
24
+ # Paths to dataset splits
25
+ test_processed_path: data/test_processed.csv
26
+ train_processed_path: data/train_processed.csv
27
+ val_processed_path: data/val_processed.csv
demo/config.yaml DELETED
@@ -1,22 +0,0 @@
1
- model:
2
- batch_size: 512
3
- features:
4
- - title_tokenset
5
- - title_agg
6
- - author
7
- - publisher
8
- - pub_date
9
- - pub_place
10
- - pagination
11
- hidden_sizes:
12
- - 32
13
- - 64
14
- lr: 0.006
15
- max_epochs: -1
16
- optimizer: Adam
17
- patience: 20
18
- saved_models_dir: saved_models
19
- test_processed_path: data/202303_goldfinch_set_1.1/processed/test_processed.csv
20
- train_processed_path: data/202303_goldfinch_set_1.1/processed/train_processed.csv
21
- val_processed_path: data/202303_goldfinch_set_1.1/processed/val_processed.csv
22
- weight_decay: 0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
demo/model.onnx DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:8a549a29ebb618819a227d9568e8c1a6555e4f6407c3b4031a9170f4746ecdde
3
- size 10669
 
 
 
 
marcai/__init__.py DELETED
File without changes
marcai/cli.py DELETED
@@ -1,40 +0,0 @@
1
- import argparse
2
- from . import train, predict, process, find_matches
3
-
4
-
5
- def main():
6
- parser = argparse.ArgumentParser(
7
- description="Command-line interface for marcai package"
8
- )
9
- subparsers = parser.add_subparsers(required=True)
10
-
11
- train_parser = subparsers.add_parser(
12
- "train", parents=[train.args_parser()], help="Train a model", add_help=False
13
- )
14
- predict_parser = subparsers.add_parser(
15
- "predict",
16
- parents=[predict.args_parser()],
17
- help="Make predictions using a trained model",
18
- add_help=False,
19
- )
20
- process_parser = subparsers.add_parser(
21
- "process", parents=[process.args_parser()], help="Process data", add_help=False
22
- )
23
- find_matches_parser = subparsers.add_parser(
24
- "find_matches",
25
- parents=[find_matches.args_parser()],
26
- help="Find matches in data",
27
- add_help=False,
28
- )
29
-
30
- train_parser.set_defaults(func=train.main)
31
- predict_parser.set_defaults(func=predict.main)
32
- process_parser.set_defaults(func=process.main)
33
- find_matches_parser.set_defaults(func=find_matches.main)
34
-
35
- args = parser.parse_args()
36
- args.func(args)
37
-
38
-
39
- if __name__ == "__main__":
40
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
marcai/find_matches.py DELETED
@@ -1,78 +0,0 @@
1
- import argparse
2
- import csv
3
-
4
- import pandas as pd
5
- from tqdm import tqdm
6
-
7
- from marcai.predict import predict_onnx
8
- from marcai.process import multiprocess_pairs
9
- from marcai.utils import load_config
10
- from marcai.utils.parsing import load_records, record_dict
11
-
12
-
13
- def args_parser():
14
- parser = argparse.ArgumentParser()
15
- parser.add_argument("-i", "--inputs", nargs="+", help="MARC files", required=True)
16
- parser.add_argument(
17
- "-p",
18
- "--pair-indices",
19
- help="File containing indices of comparisons",
20
- required=True,
21
- )
22
- parser.add_argument("-C", "--chunksize", help="Chunk size", type=int, default=50000)
23
- parser.add_argument(
24
- "-P", "--processes", help="Number of processes", type=int, default=1
25
- )
26
- parser.add_argument(
27
- "-m",
28
- "--model-dir",
29
- help="Directory containing model ONNX and YAML files",
30
- required=True,
31
- )
32
- parser.add_argument("-o", "--output", help="Output file", required=True)
33
- parser.add_argument("-t", "--threshold", help="Threshold for matching", type=float)
34
-
35
- return parser
36
-
37
-
38
- def main(args):
39
- config_path = f"{args.model_dir}/config.yaml"
40
- model_onnx = f"{args.model_dir}/model.onnx"
41
-
42
- config = load_config(config_path)
43
-
44
- # Load records
45
- print("Loading records...")
46
- records = []
47
- for path in args.inputs:
48
- records.extend([record_dict(r) for r in load_records(path)])
49
-
50
- records_df = pd.DataFrame(records)
51
-
52
- print(f"Loaded {len(records)} records.")
53
-
54
- print("Processing and comparing records...")
55
- written = False
56
- with open(args.pair_indices, "r") as indices_file:
57
- reader = csv.reader(indices_file)
58
- # Process records
59
- for df in tqdm(
60
- multiprocess_pairs(records_df, reader, args.chunksize, args.processes)
61
- ):
62
- input_df = df[config["model"]["features"]]
63
- prediction = predict_onnx(model_onnx, input_df)
64
- df.loc[:, "prediction"] = prediction.squeeze()
65
-
66
- df = df[df["prediction"] >= args.threshold]
67
-
68
- if not df.empty:
69
- if not written:
70
- df.to_csv(args.output, index=False)
71
- written = True
72
- else:
73
- df.to_csv(args.output, index=False, mode="a", header=False)
74
-
75
-
76
- if __name__ == "__main__":
77
- args = args_parser().parse_args()
78
- main(args)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
marcai/pl/__init__.py DELETED
@@ -1,2 +0,0 @@
1
- from .similarity_vector_model import SimilarityVectorModel
2
- from .marc_data_module import MARCDataModule
 
 
 
marcai/pl/attribute_selector.py DELETED
@@ -1,12 +0,0 @@
1
- import torch.nn as nn
2
-
3
-
4
- class AttributeSelector(nn.Module):
5
- def __init__(self, attrs):
6
- super().__init__()
7
-
8
- self.attrs = attrs
9
-
10
- def forward(self, sim: dict) -> dict:
11
- sim = {key: sim[key] for key in self.attrs if key in sim.keys()}
12
- return sim
 
 
 
 
 
 
 
 
 
 
 
 
 
marcai/pl/marc_data_module.py DELETED
@@ -1,51 +0,0 @@
1
- import pytorch_lightning as pl
2
- from torch.utils.data import DataLoader
3
- import torch
4
- from .attribute_selector import AttributeSelector
5
- from .similarity_vector_dataset import SimilarityVectorDataset
6
- from typing import List
7
-
8
-
9
- class MARCDataModule(pl.LightningDataModule):
10
- def __init__(
11
- self,
12
- train_processed_path: str,
13
- val_processed_path: str,
14
- test_processed_path: str,
15
- attrs: List[str],
16
- batch_size: int,
17
- ):
18
- super().__init__()
19
-
20
- self.train_processed_path = train_processed_path
21
- self.val_processed_path = val_processed_path
22
- self.test_processed_path = test_processed_path
23
-
24
- self.batch_size = batch_size
25
- self.transform = torch.nn.Sequential(AttributeSelector(attrs))
26
-
27
- self.train_set = None
28
- self.val_set = None
29
- self.test_set = None
30
-
31
- def setup(self, stage=None):
32
- self.train_set = SimilarityVectorDataset(
33
- self.train_processed_path, transform=self.transform
34
- )
35
- self.val_set = SimilarityVectorDataset(
36
- self.val_processed_path, transform=self.transform
37
- )
38
- self.test_set = SimilarityVectorDataset(
39
- self.test_processed_path, transform=self.transform
40
- )
41
-
42
- def train_dataloader(self):
43
- return DataLoader(
44
- self.train_set, batch_size=self.batch_size, num_workers=0, shuffle=True
45
- )
46
-
47
- def val_dataloader(self):
48
- return DataLoader(self.val_set, batch_size=self.batch_size, num_workers=0)
49
-
50
- def test_dataloader(self):
51
- return DataLoader(self.test_set, batch_size=self.batch_size, num_workers=0)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
marcai/pl/similarity_vector_dataset.py DELETED
@@ -1,26 +0,0 @@
1
- from torch.utils.data import Dataset
2
- import numpy as np
3
- import pandas as pd
4
-
5
-
6
- class SimilarityVectorDataset(Dataset):
7
-
8
- def __init__(self, processed_path: str, transform=None):
9
-
10
- self.transform = transform
11
- self.data = pd.read_csv(processed_path)
12
-
13
- def __len__(self):
14
- return self.data.shape[0]
15
-
16
- def __getitem__(self, idx):
17
- row = self.data.iloc[idx].to_dict()
18
-
19
- label = float(float(row['cid']) == 1.0)
20
-
21
- if self.transform:
22
- row = self.transform(row)
23
-
24
- row = np.array(list(row.values())).astype(float)
25
-
26
- return row, label
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
marcai/pl/similarity_vector_model.py DELETED
@@ -1,90 +0,0 @@
1
- import pytorch_lightning as pl
2
- import torch
3
- import torch.nn as nn
4
- from torchmetrics import Accuracy
5
-
6
-
7
- class SimilarityVectorModel(pl.LightningModule):
8
- def __init__(self, lr, weight_decay, optimizer, batch_size, attrs, hidden_sizes):
9
- super().__init__()
10
-
11
- # Hyperparameters
12
- self.attrs = attrs
13
- self.lr = lr
14
- self.weight_decay = weight_decay
15
- self.optimizer = optimizer
16
- self.batch_size = batch_size
17
- self.save_hyperparameters()
18
-
19
- # Create model layers
20
- layer_sizes = [len(attrs)] + hidden_sizes + [1]
21
- layers = []
22
- for i in range(len(layer_sizes) - 1):
23
- in_size, out_size = layer_sizes[i], layer_sizes[i + 1]
24
- layers.append(nn.Linear(in_size, out_size))
25
-
26
- if i < len(layer_sizes) - 2:
27
- layers.append(nn.ReLU())
28
-
29
- self.layers = nn.Sequential(*layers)
30
-
31
- self.sigmoid = nn.Sigmoid()
32
- self.criterion = nn.BCEWithLogitsLoss()
33
- self.accuracy = Accuracy(task="binary")
34
-
35
- def forward(self, x):
36
- return self.layers(x)
37
-
38
- def predict(self, x):
39
- return self.sigmoid(self(x))
40
-
41
- def training_step(self, batch, batch_idx):
42
- sim, label = batch
43
- pred = self(sim.float())
44
- label = label.unsqueeze(1)
45
-
46
- loss = self.criterion(pred, label)
47
- acc = self.accuracy(pred, label.long())
48
-
49
- self.log("train_loss", loss, on_step=False, on_epoch=True)
50
- self.log("train_acc", acc, on_step=False, on_epoch=True)
51
-
52
- return loss
53
-
54
- def validation_step(self, batch, batch_idx):
55
- sim, label = batch
56
- pred = self(sim.float())
57
- label = label.unsqueeze(1)
58
-
59
- loss = self.criterion(pred, label)
60
- acc = self.accuracy(pred, label.long())
61
-
62
- self.log("val_loss", loss, on_step=False, on_epoch=True)
63
- self.log("val_acc", acc, on_step=False, on_epoch=True, prog_bar=True)
64
-
65
- return loss
66
-
67
- def test_step(self, batch, batch_idx):
68
- sim, label = batch
69
- pred = self(sim.float())
70
- label = label.unsqueeze(1)
71
-
72
- loss = self.criterion(pred, label)
73
- acc = self.accuracy(pred, label.long())
74
-
75
- self.log("test_loss", loss, on_step=False, on_epoch=True)
76
- self.log("test_acc", acc, on_step=False, on_epoch=True, prog_bar=True)
77
-
78
- return loss
79
-
80
- def configure_optimizers(self):
81
- optimizers = {
82
- "Adadelta": torch.optim.Adadelta,
83
- "Adagrad": torch.optim.Adagrad,
84
- "Adam": torch.optim.Adam,
85
- "RMSprop": torch.optim.RMSprop,
86
- "SGD": torch.optim.SGD,
87
- }
88
- return optimizers[self.optimizer](
89
- self.parameters(), lr=self.lr, weight_decay=self.weight_decay
90
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
marcai/predict.py DELETED
@@ -1,76 +0,0 @@
1
- import argparse
2
-
3
- import numpy as np
4
- import onnxruntime
5
- import pandas as pd
6
-
7
- from marcai.utils import load_config
8
-
9
-
10
- def sigmoid(x):
11
- return 1 / (1 + np.exp(-1 * x))
12
-
13
-
14
- def predict_onnx(model_onnx_path, data):
15
- ort_session = onnxruntime.InferenceSession(model_onnx_path)
16
-
17
- x = data.to_numpy(dtype=np.float32)
18
-
19
- input_name = ort_session.get_inputs()[0].name
20
- ort_inputs = {input_name: x}
21
- ort_outs = np.array(ort_session.run(None, ort_inputs))
22
- ort_outs = sigmoid(ort_outs)
23
-
24
- return ort_outs
25
-
26
- def args_parser():
27
- parser = argparse.ArgumentParser()
28
- parser.add_argument(
29
- "-i", "--input", help="Path to preprocessed data file", required=True
30
- )
31
- parser.add_argument("-o", "--output", help="Output path", required=True)
32
- parser.add_argument(
33
- "-m",
34
- "--model-dir",
35
- help="Directory containing model ONNX and YAML files",
36
- required=True,
37
- )
38
- parser.add_argument(
39
- "--chunksize",
40
- help="Chunk size for reading and predicting",
41
- default=1024,
42
- type=int,
43
- )
44
- return parser
45
-
46
-
47
- def main(args):
48
- config_path = f"{args.model_dir}/config.yaml"
49
- model_onnx = f"{args.model_dir}/model.onnx"
50
-
51
- config = load_config(config_path)
52
-
53
- # Load data
54
- data = pd.read_csv(args.input, chunksize=args.chunksize)
55
-
56
- written = False
57
- for chunk in data:
58
- # Limit columns to model input features
59
- input_df = chunk[config["model"]["features"]]
60
-
61
- prediction = predict_onnx(model_onnx, input_df)
62
-
63
- # Add prediction to chunk
64
- chunk["prediction"] = prediction.squeeze()
65
-
66
- # Append to CSV
67
- if not written:
68
- chunk.to_csv(args.output, index=False)
69
- written = True
70
- else:
71
- chunk.to_csv(args.output, mode="a", header=False, index=False)
72
-
73
-
74
- if __name__ == "__main__":
75
- args = args_parser().parse_args()
76
- main(args)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
marcai/process.py DELETED
@@ -1,254 +0,0 @@
1
- import argparse
2
- import concurrent.futures
3
- import csv
4
- import time
5
- from multiprocessing import get_context
6
-
7
- import numpy as np
8
- import pandas as pd
9
- from more_itertools import chunked
10
-
11
- import marcai.processing.comparisons as comps
12
- import marcai.processing.normalizations as norms
13
- from marcai.utils.parsing import load_records, record_dict
14
-
15
-
16
- def multiprocess_pairs(
17
- records_df,
18
- pair_indices,
19
- chunksize=50000,
20
- processes=1,
21
- ):
22
- # Create chunked iterator
23
- pairs_chunked = chunked(pair_indices, chunksize)
24
-
25
- # Create processing jobs
26
- max_jobs = processes * 2
27
-
28
- context = get_context("fork")
29
-
30
- with concurrent.futures.ProcessPoolExecutor(
31
- max_workers=processes, mp_context=context
32
- ) as executor:
33
- futures = set()
34
- done = set()
35
- first_spawn = True
36
-
37
- while futures or first_spawn:
38
- if first_spawn:
39
- spawn_count = max_jobs
40
- first_spawn = False
41
- else:
42
- # Wait for a job to complete
43
- done, futures = concurrent.futures.wait(
44
- futures, return_when=concurrent.futures.FIRST_COMPLETED
45
- )
46
- spawn_count = max_jobs - len(futures)
47
-
48
- for future in done:
49
- # Get job's output
50
- df = future.result()
51
-
52
- # Yield output
53
- yield df
54
-
55
- # Spawn jobs
56
- for _ in range(spawn_count):
57
- pairs_chunk = next(pairs_chunked, None)
58
-
59
- if pairs_chunk is None:
60
- break
61
-
62
- indices = np.array(pairs_chunk).astype(int)
63
-
64
- left_indices = indices[:, 0]
65
- right_indices = indices[:, 1]
66
-
67
- left_records = records_df.iloc[left_indices].reset_index(drop=True)
68
- right_records = records_df.iloc[right_indices].reset_index(drop=True)
69
-
70
- futures.add(executor.submit(process, left_records, right_records))
71
-
72
-
73
- def process(df0, df1):
74
- normalize_fields = [
75
- "author_names",
76
- "corporate_names",
77
- "meeting_names",
78
- "publisher",
79
- "title",
80
- "title_a",
81
- "title_b",
82
- "title_c",
83
- "title_p",
84
- ]
85
-
86
- # Normalize text fields
87
- for field in normalize_fields:
88
- df0[field] = norms.lowercase(df0[field])
89
- df1[field] = norms.lowercase(df1[field])
90
-
91
- df0[field] = norms.remove_punctuation(df0[field])
92
- df1[field] = norms.remove_punctuation(df1[field])
93
-
94
- df0[field] = norms.remove_diacritics(df0[field])
95
- df1[field] = norms.remove_diacritics(df1[field])
96
-
97
- df0[field] = norms.normalize_whitespace(df0[field])
98
- df1[field] = norms.normalize_whitespace(df1[field])
99
-
100
- # Compare fields
101
- result_df = pd.DataFrame()
102
-
103
- result_df["id_0"] = df0["id"]
104
- result_df["id_1"] = df1["id"]
105
-
106
- result_df["raw_tokenset"] = comps.token_set_similarity(
107
- df0["raw"], df1["raw"], null_value=0.5
108
- )
109
-
110
- # Token sort ratio
111
- result_df["publisher"] = comps.token_sort_similarity(
112
- df0["publisher"], df1["publisher"], null_value=0.5
113
- )
114
-
115
- author_names = comps.token_sort_similarity(
116
- df0["author_names"], df1["author_names"], null_value=np.nan
117
- )
118
- corporate_names = comps.token_sort_similarity(
119
- df0["corporate_names"], df1["corporate_names"], null_value=np.nan
120
- )
121
- meeting_names = comps.token_sort_similarity(
122
- df0["meeting_names"], df1["meeting_names"], null_value=np.nan
123
- )
124
- authors = pd.concat([author_names, corporate_names, meeting_names], axis=1)
125
-
126
- # Take max of author comparisons
127
- result_df["author"] = comps.maximum(authors, null_value=0.5)
128
-
129
- # Weighted title comparison
130
- weights = {"title_a": 1, "raw": 0, "title_p": 1}
131
-
132
- result_df["title_agg"] = comps.column_aggregate_similarity(
133
- df0[weights.keys()], df1[weights.keys()], weights.values(), null_value=0
134
- )
135
-
136
- # Length difference
137
- result_df["title_length"] = comps.length_similarity(
138
- df0["title"], df1["title"], null_value=0.5
139
- )
140
-
141
- # Token set similarity
142
- result_df["title_tokenset"] = comps.token_set_similarity(
143
- df0["title"], df1["title"], null_value=0
144
- )
145
-
146
- # Token sort ratio
147
- result_df["title_tokensort"] = comps.token_sort_similarity(
148
- df0["title"], df1["title"], null_value=0
149
- )
150
-
151
- # Levenshtein
152
- result_df["title_levenshtein"] = comps.levenshtein_similarity(
153
- df0["title"], df1["title"], null_value=0
154
- )
155
-
156
- # Jaro
157
- result_df["title_jaro"] = comps.jaro_similarity(
158
- df0["title"], df1["title"], null_value=0
159
- )
160
-
161
- # Jaro Winkler
162
- result_df["title_jaro_winkler"] = comps.jaro_winkler_similarity(
163
- df0["title"], df1["title"], null_value=0
164
- )
165
-
166
- # Pagination
167
- result_df["pagination"] = comps.pagination_match(
168
- df0["pagination"], df1["pagination"], null_value=0.5
169
- )
170
-
171
- # Dates
172
- result_df["pub_date"] = comps.year_similarity(
173
- df0["pub_date"], df1["pub_date"], null_value=0.5, exp_coeff=0.15
174
- )
175
-
176
- # Pub place
177
- result_df["pub_place"] = comps.equal(
178
- df0["pub_place"], df1["pub_place"], null_value=0.5
179
- )
180
-
181
- # CID/Label
182
- result_df["cid"] = comps.equal(df0["cid"], df1["cid"], null_value=0.5)
183
-
184
- return result_df
185
-
186
-
187
- def args_parser():
188
- parser = argparse.ArgumentParser(
189
- formatter_class=argparse.ArgumentDefaultsHelpFormatter
190
- )
191
-
192
- required = parser.add_argument_group("required arguments")
193
- required.add_argument("-i", "--inputs", nargs="+", help="MARC files", required=True)
194
- required.add_argument("-o", "--output", help="Output file", required=True)
195
-
196
- parser.add_argument(
197
- "-C",
198
- "--chunksize",
199
- type=int,
200
- help="Number of comparisons per job",
201
- default=50000,
202
- )
203
- parser.add_argument(
204
- "-p", "--pair-indices", help="File containing indices of comparisons"
205
- )
206
- parser.add_argument(
207
- "-P",
208
- "--processes",
209
- type=int,
210
- help="Number of processes to run in parallel.",
211
- default=1,
212
- )
213
-
214
- return parser
215
-
216
-
217
- def main(args):
218
- start = time.time()
219
-
220
- # Load records
221
- print("Loading records...")
222
- records = []
223
- for path in args.inputs:
224
- records.extend([record_dict(r) for r in load_records(path)])
225
-
226
- records_df = pd.DataFrame(records)
227
-
228
- print(f"Loaded {len(records)} records.")
229
-
230
- print("Processing records...")
231
- # Process records
232
- written = False
233
- with open(args.pair_indices, "r") as indices_file:
234
- reader = csv.reader(indices_file)
235
-
236
- for df in multiprocess_pairs(
237
- records_df, reader, args.chunksize, args.processes
238
- ):
239
- if not written:
240
- # Write header
241
- df.to_csv(args.output, mode="w", header=True, index=False)
242
- written = True
243
- else:
244
- # Write rows of df to output CSV
245
- df.to_csv(args.output, mode="a", header=False, index=False)
246
-
247
- end = time.time()
248
- print(f"Processed {len(records)} records.")
249
- print(f"Time elapsed: {end - start:.2f} seconds.")
250
-
251
-
252
- if __name__ == "__main__":
253
- args = args_parser().parse_args()
254
- main(args)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
marcai/processing/__init__.py DELETED
@@ -1 +0,0 @@
1
-
 
 
marcai/processing/comparisons.py DELETED
@@ -1,227 +0,0 @@
1
- import numpy as np
2
- import re
3
- import pandas as pd
4
- from thefuzz import fuzz
5
- import textdistance
6
-
7
-
8
-
9
- HAND_COUNT_PAGE_PATTERN = re.compile(r"\[(?P<hand_count>\d+)\]\s*p(ages)?[^\w]")
10
- PAGE_PATTERN = re.compile(r"(?P<pages>\d+)\s*p(ages)?[^\w]")
11
-
12
-
13
- def equal(se0, se1, null_value):
14
- se0_np = se0.to_numpy(dtype=str)
15
- se1_np = se1.to_numpy(dtype=str)
16
-
17
- col = (se0_np == se1_np).astype(float)
18
-
19
- se0_nulls = np.argwhere(np.char.strip(se0_np, " ") == "")
20
- se1_nulls = np.argwhere(np.char.strip(se1_np, " ") == "")
21
-
22
- col[se0_nulls] = null_value
23
- col[se1_nulls] = null_value
24
-
25
- return pd.Series(col)
26
-
27
-
28
- def maximum(df, null_value, ignore_value=np.nan):
29
- df_np = df.to_numpy(dtype=float)
30
-
31
- df_np[df_np == ignore_value] = np.nan
32
-
33
- # Mask ignore_value
34
- masked = np.ma.masked_invalid(df_np)
35
-
36
- # Get the max, ignoring NaNs
37
- col = np.max(masked, axis=1)
38
-
39
- # Replace NaNs with null_value
40
- col = col.filled(fill_value=null_value)
41
-
42
- return pd.Series(col)
43
-
44
-
45
- def minimum(se0, se1, null_value, ignore_value=np.nan):
46
- se0_np = se0.to_numpy(dtype=float)
47
- se1_np = se1.to_numpy(dtype=float)
48
-
49
- # Replace ignore_value with np.nans
50
- se0_np[se0_np == ignore_value] = np.nan
51
- se1_np[se1_np == ignore_value] = np.nan
52
-
53
- # Get the min, ignoring NaNs
54
- col = np.nanmin(np.stack([se0_np, se1_np], axis=1), axis=1)
55
-
56
- # Replace NaNs with null_value
57
- col[np.isnan(col)] = null_value
58
-
59
- return pd.Series(col)
60
-
61
-
62
- def pagination_match(se0, se1, null_value):
63
- def group_values(pat, group, s):
64
- return {m.groupdict()[group] for m in pat.finditer(s)}
65
-
66
- def compare(pag0, pag1):
67
- hand_counts0 = group_values(HAND_COUNT_PAGE_PATTERN, "hand_count", pag0)
68
- hand_counts1 = group_values(HAND_COUNT_PAGE_PATTERN, "hand_count", pag1)
69
-
70
- # Remove bracketed digits
71
- pag0 = re.sub(r"\[\d+\]", "", pag0)
72
- pag1 = re.sub(r"\[\d+\]", " ", pag1)
73
-
74
- # Remove punctuation
75
- pag0 = re.sub(r"[^\w\s]", " ", pag0)
76
- pag1 = re.sub(r"[^\w\s]", " ", pag1)
77
-
78
- # Extract page counts
79
- counts0 = group_values(PAGE_PATTERN, "pages", pag0 + " ")
80
- counts1 = group_values(PAGE_PATTERN, "pages", pag1 + " ")
81
-
82
- page_counts0 = counts0 | hand_counts0
83
- page_counts1 = counts1 | hand_counts1
84
-
85
- # Check if any pages are in common.
86
- if page_counts0 and page_counts1:
87
- for pg0 in page_counts0:
88
- for pg1 in page_counts1:
89
- pg0 = int(pg0)
90
- pg1 = int(pg1)
91
-
92
- if pg0 == pg1:
93
- return 1.0
94
- return 0.0
95
-
96
- return null_value
97
-
98
- se0_np = se0.to_numpy(dtype=str)
99
- se1_np = se1.to_numpy(dtype=str)
100
-
101
- col = np.vectorize(compare)(se0_np, se1_np)
102
- return pd.Series(col)
103
-
104
-
105
- def year_similarity(se0, se1, null_value, exp_coeff):
106
- def compare(yr0, yr1):
107
- if yr0.isnumeric() and yr1.isnumeric():
108
- x = abs(int(yr0) - int(yr1))
109
-
110
- # Sigmoid where x = 0, y = 1, tail to the right
111
- return 2 / (1 + np.exp(exp_coeff * x))
112
-
113
- return null_value
114
-
115
- se0_np = se0.to_numpy(dtype=str)
116
- se1_np = se1.to_numpy(dtype=str)
117
-
118
- return np.vectorize(compare)(se0_np, se1_np)
119
-
120
-
121
- def column_aggregate_similarity(df0, df1, column_weights, null_value):
122
- weights_dict = {k: v for k, v in zip(df0.columns, column_weights)}
123
-
124
- def get_word_weights(row):
125
- word_weights = {}
126
- for i, value in enumerate(row):
127
- column = df0.columns[i]
128
- if column in weights_dict:
129
- current_weight = weights_dict[column]
130
- else:
131
- current_weight = 0
132
-
133
- for w in value.split():
134
- if w not in word_weights:
135
- word_weights[w] = current_weight
136
- else:
137
- word_weights[w] = max(current_weight, word_weights[w])
138
- return word_weights
139
-
140
- def compare(row0, row1):
141
- weights0 = get_word_weights(row0)
142
- weights1 = get_word_weights(row1)
143
-
144
- total_weight = 0
145
- missing_weight = 0
146
-
147
- for w in weights0:
148
- weight = weights0[w]
149
- if w not in weights1:
150
- missing_weight += weights0[w]
151
- else:
152
- weight = max(weight, weights1[w])
153
- total_weight += weight
154
-
155
- for w in weights1:
156
- weight = weights1[w]
157
- if w not in weights0:
158
- missing_weight += weights1[w]
159
- else:
160
- weight = max(weight, weights0[w])
161
- total_weight += weight
162
-
163
- if total_weight == 0:
164
- return null_value
165
-
166
- return float((total_weight - missing_weight) / total_weight)
167
-
168
- if df0.columns.to_list() != df1.columns.to_list():
169
- raise ValueError("DataFrames must have the same columns")
170
-
171
- # Run compare on rows of each df
172
- col = np.array(
173
- [compare(row0, row1) for row0, row1 in zip(df0.to_numpy(), df1.to_numpy())]
174
- )
175
-
176
- return pd.Series(col)
177
-
178
-
179
- def length_similarity(se0, se1, null_value):
180
- se0_np = se0.to_numpy(dtype=str)
181
- se1_np = se1.to_numpy(dtype=str)
182
-
183
- col = np.array([1 - abs(len(s0) - len(s1)) / max(len(s0), len(s1)) for s0, s1 in zip(se0_np, se1_np)])
184
-
185
- # If either string is empty, set similarity to null_value
186
- col[(se0_np == "") | (se1_np == "")] = null_value
187
-
188
- return pd.Series(col)
189
-
190
-
191
- def jaccard_similarity(se0, se1, null_value):
192
- se0_np = se0.to_numpy(dtype=str)
193
- se1_np = se1.to_numpy(dtype=str)
194
-
195
- col = np.array([textdistance.jaccard.normalized_similarity(set(s0.split()), set(s1.split())) for s0, s1 in zip(se0_np, se1_np)])
196
-
197
- # If either string is empty, set similarity to null_value
198
- col[(se0_np == "") | (se1_np == "")] = null_value
199
-
200
- return pd.Series(col)
201
-
202
-
203
- def similarity_factory(similarity_function):
204
- def similarity(se0, se1, null_value):
205
- se0_np = se0.to_numpy(dtype=str)
206
- se1_np = se1.to_numpy(dtype=str)
207
-
208
- col = np.vectorize(similarity_function)(se0_np, se1_np)
209
-
210
- # Replace original null values with null_value
211
- col[se0_np == ""] = null_value
212
- col[se0_np == ""] = null_value
213
-
214
- return pd.Series(col)
215
-
216
- return similarity
217
-
218
-
219
- token_set_similarity = similarity_factory(
220
- lambda s0, s1: fuzz.token_set_ratio(s0, s1) / 100
221
- )
222
- token_sort_similarity = similarity_factory(
223
- lambda s0, s1: fuzz.token_sort_ratio(s0, s1) / 100
224
- )
225
- levenshtein_similarity = similarity_factory(lambda s0, s1: (fuzz.ratio(s0, s1) / 100))
226
- jaro_winkler_similarity = similarity_factory(lambda s0, s1: textdistance.jaro_winkler.similarity(s0, s1))
227
- jaro_similarity = similarity_factory(lambda s0, s1: textdistance.jaro.similarity(s0, s1))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
marcai/processing/normalizations.py DELETED
@@ -1,36 +0,0 @@
1
- from unidecode import unidecode
2
- import numpy as np
3
- import pandas as pd
4
-
5
-
6
- def remove_diacritics(series):
7
- se_np = series.to_numpy()
8
- se_np = np.vectorize(unidecode)(se_np)
9
- return pd.Series(se_np)
10
-
11
-
12
- def lowercase(series):
13
- return series.str.lower()
14
-
15
-
16
- def remove_punctuation(series):
17
- return series.str.replace(r"[^\w\s]", "")
18
-
19
-
20
- def normalize_whitespace(series):
21
- # Replace all whitespace with a single space
22
- s = series.str.replace(r"\s", " ")
23
- # Remove leading and trailing whitespace
24
- s = s.str.strip()
25
- # Remove double spaces
26
- return s.str.replace(r"\s+", " ")
27
-
28
-
29
- def substring(series, start, end):
30
- return series.str[start:end]
31
-
32
-
33
- def apply_normalizers(series, transforms):
34
- for transform in transforms:
35
- series = transform(series)
36
- return series
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
marcai/train.py DELETED
@@ -1,103 +0,0 @@
1
- import argparse
2
- import os
3
- import tarfile
4
- import warnings
5
-
6
- import pytorch_lightning as lightning
7
- import torch
8
- import yaml
9
- from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint
10
-
11
- from marcai.pl import MARCDataModule, SimilarityVectorModel
12
- from marcai.utils import load_config
13
-
14
-
15
- def train(name=None):
16
- config_path = "config.yaml"
17
- config = load_config(config_path)
18
- model_config = load_config(config_path)["model"]
19
-
20
- # Create data module from processed data
21
- warnings.filterwarnings("ignore", ".*does not have many workers.*")
22
- data = MARCDataModule(
23
- model_config["train_processed_path"],
24
- model_config["val_processed_path"],
25
- model_config["test_processed_path"],
26
- model_config["features"],
27
- model_config["batch_size"],
28
- )
29
-
30
- # Create model
31
- model = SimilarityVectorModel(
32
- model_config["lr"],
33
- model_config["weight_decay"],
34
- model_config["optimizer"],
35
- model_config["batch_size"],
36
- model_config["features"],
37
- model_config["hidden_sizes"],
38
- )
39
-
40
- save_dir = os.path.join(model_config["saved_models_dir"], name)
41
- os.makedirs(save_dir, exist_ok=True)
42
-
43
- # Save best models
44
- checkpoint_callback = ModelCheckpoint(
45
- monitor="val_acc", mode="max", dirpath=save_dir, filename="model"
46
- )
47
- callbacks = [checkpoint_callback]
48
-
49
- if model_config["patience"] != -1:
50
- early_stop_callback = EarlyStopping(
51
- monitor="val_acc",
52
- min_delta=0.00,
53
- patience=model_config["patience"],
54
- verbose=False,
55
- mode="max",
56
- )
57
- callbacks.append(early_stop_callback)
58
-
59
- trainer = lightning.Trainer(
60
- max_epochs=model_config["max_epochs"], callbacks=callbacks, accelerator="cpu"
61
- )
62
- trainer.fit(model, data)
63
-
64
- # Save ONNX
65
- onnx_path = os.path.join(save_dir, "model.onnx")
66
- input_sample = torch.randn((1, len(model.attrs)))
67
- torch.onnx.export(
68
- model,
69
- input_sample,
70
- onnx_path,
71
- export_params=True,
72
- do_constant_folding=True,
73
- input_names=["input"],
74
- output_names=["output"],
75
- dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}},
76
- )
77
-
78
- # Save config
79
- config_filename = os.path.join(save_dir, "config.yaml")
80
-
81
- with open(config_filename, "w") as f:
82
- dump = yaml.dump(config)
83
- f.write(dump)
84
-
85
- # Compress model directory files
86
- tar_path = f"{save_dir}/{name}.tar.gz"
87
- with tarfile.open(tar_path, mode="w:gz") as archive:
88
- archive.add(save_dir, arcname=os.path.basename(save_dir))
89
-
90
-
91
- def args_parser():
92
- parser = argparse.ArgumentParser()
93
- parser.add_argument("-n", "--run-name", help="Name for training run", required=True)
94
- return parser
95
-
96
-
97
- def main(args):
98
- train(args.run_name)
99
-
100
-
101
- if __name__ == "__main__":
102
- args = args_parser().parse_args()
103
- main(args)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
marcai/utils/__init__.py DELETED
@@ -1 +0,0 @@
1
- from .load_config import load_config
 
 
marcai/utils/load_config.py DELETED
@@ -1,6 +0,0 @@
1
- import yaml
2
-
3
-
4
- def load_config(filename):
5
- with open(filename, 'r') as file:
6
- return yaml.safe_load(file)
 
 
 
 
 
 
 
marcai/utils/parsing.py DELETED
@@ -1,93 +0,0 @@
1
- from collections import OrderedDict
2
-
3
- import pymarc
4
-
5
-
6
- def get_record_values(record, location):
7
- split = location.split("$")
8
-
9
- if len(split) == 1:
10
- tag = split[0]
11
- code = None
12
- elif len(split) == 2:
13
- tag, code = split
14
- else:
15
- raise ValueError("Invalid location")
16
-
17
- # Find fields matching tag
18
- fields = record.get_fields(tag)
19
-
20
- results = []
21
- for current_value in fields:
22
- if current_value is not None:
23
- if code is not None:
24
- values = current_value.get_subfields(code)
25
- results.extend(values)
26
- elif isinstance(current_value, pymarc.Field):
27
- results.append(current_value.value())
28
-
29
- return " ".join(results)
30
-
31
-
32
- def record_dict(record):
33
- d = OrderedDict()
34
-
35
- # Dump every field value into a string
36
- d["raw"] = " ".join([f.value() for f in record.fields])
37
-
38
- d["cid"] = get_record_values(record, "CID")
39
- d["id"] = get_record_values(record, "001")
40
-
41
- fixed_data = get_record_values(record, "008")
42
- d["pub_date"] = fixed_data[7:11]
43
- d["pub_place"] = fixed_data[15:18]
44
- d["language"] = fixed_data[35:38]
45
-
46
- d["title_a"] = get_record_values(record, "245$a")
47
- d["title_b"] = get_record_values(record, "245$b")
48
- d["title_c"] = get_record_values(record, "245$c")
49
- d["title_p"] = get_record_values(record, "245$p")
50
-
51
- d["title"] = " ".join([d["title_a"], d["title_b"], d["title_p"]])
52
-
53
- d["title_variation_a"] = get_record_values(record, "246$a")
54
- d["title_variation_b"] = get_record_values(record, "246$b")
55
-
56
- d["subject_headings"] = " ".join(
57
- get_record_values(record, "650$a") + get_record_values(record, "650$x")
58
- )
59
-
60
- d["author_names"] = " ".join(
61
- [get_record_values(record, "100$a"), get_record_values(record, "700$a")]
62
- )
63
- d["corporate_names"] = " ".join(
64
- [get_record_values(record, "110$a"), get_record_values(record, "710$a")]
65
- )
66
- d["meeting_names"] = " ".join(
67
- [get_record_values(record, "111$a"), get_record_values(record, "711$a")]
68
- )
69
-
70
- d["publisher"] = record.publisher or ""
71
-
72
- d["pagination"] = get_record_values(record, "300$a")
73
- d["dimensions"] = get_record_values(record, "300$c")
74
-
75
- return d
76
-
77
-
78
- def load_records(path):
79
- records = []
80
- extension = path.split(".")[-1]
81
- if extension == "mrc" or extension == "marc":
82
- with open(path, "rb") as marcfile:
83
- reader = pymarc.MARCReader(marcfile)
84
- records.extend(list(reader))
85
- elif extension == "json":
86
- with open(path, "r") as jsonfile:
87
- for line in jsonfile:
88
- record = pymarc.parse_json_to_array(line)[0]
89
- records.append(record)
90
- else:
91
- raise ValueError(f"Unsupported file extension: {extension}")
92
-
93
- return records
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pyproject.toml DELETED
@@ -1,3 +0,0 @@
1
- [build-system]
2
- requires = ["setuptools", "wheel"]
3
- build-backend = "setuptools.build_meta"
 
 
 
 
requirements.txt CHANGED
@@ -1,11 +1,4 @@
1
- pymarc
2
- thefuzz
3
  pandas
4
- unidecode
5
- python-levenshtein
6
- onnxruntime
7
- textdistance
8
- more-itertools
9
- pyyaml
10
- onnx
11
- tqdm
 
1
+ huggingface-hub
2
+ git+https://github.com/cdlib/marc-ai.git
3
  pandas
4
+ gradio
 
 
 
 
 
 
 
setup.cfg DELETED
@@ -1,11 +0,0 @@
1
- [metadata]
2
- name = marcai
3
- version = 1.0.0
4
-
5
- [options]
6
- packages = find:
7
-
8
- [options.entry_points]
9
- console_scripts =
10
- marc-ai = marcai:cli.main
11
-