gbpatentdata/lt-patent-inventor-linking

This is a LinkTransformer model. At its core this model this is a sentence transformer model sentence-transformers model - it just wraps around the class. Take a look at the documentation of sentence-transformers if you want to use this model for more than what we support in our applications.

This model has been fine-tuned on the model: sentence-transformers/all-mpnet-base-v2. It is pretrained for the language: en.

Usage (LinkTransformer)

Using this model becomes easy when you have LinkTransformer installed:

pip install -U linktransformer

Then you can use the model like this:

import linktransformer as lt
import pandas as pd

df_lm_matched = lt.cluster_rows(df,
                                model='gbpatentdata/lt-patent-inventor-linking',
                                on=['name', 'occupation', 'year', 'address', 'firm', 'patent_title'],
                                cluster_type='SLINK',
                                cluster_params={'threshold': 0.1, 'min cluster size': 1, 'metric': 'cosine'}
)

Evaluation

We evaluate using the standard LinkTransformer information retrieval metrics. Our test set evaluations are available here.

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 31 with parameters:

{'batch_size': 64, 'sampler': 'torch.utils.data.dataloader._InfiniteConstantSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

linktransformer.modified_sbert.losses.SupConLoss_wandb

Parameters of the fit()-Method:

{
    "epochs": 100,
    "evaluation_steps": 16,
    "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 3100,
    "weight_decay": 0.01
}

LinkTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Normalize()
)

Citation

If you use our model or custom training/evaluation data in your research, please cite our accompanying paper as follows:

@article{bct2025,
  title = {300 Years of British Patents},
  author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero},
  journal = {arXiv preprint arXiv:2401.12345},
  year = {2025},
  url = {https://arxiv.org/abs/2401.12345}
}

Please also cite the original LinkTransformer authors:

@misc{arora2023linktransformer,
                  title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
                  author={Abhishek Arora and Melissa Dell},
                  year={2023},
                  eprint={2309.00789},
                  archivePrefix={arXiv},
                  primaryClass={cs.CL}
                }