gbpatentdata/lt-patent-inventor-linking

This is a LinkTransformer model. At its core this model this is a sentence transformer model sentence-transformers model - it just wraps around the class. Take a look at the documentation of sentence-transformers if you want to use this model for more than what we support in our applications.

This model has been fine-tuned on the model: sentence-transformers/all-mpnet-base-v2. It is pretrained for the language: en.

Usage (LinkTransformer)

Using this model becomes easy when you have LinkTransformer installed:

pip install -U linktransformer

Then you can use the model like this:

import linktransformer as lt
import pandas as pd

df_lm_matched = lt.cluster_rows(df,
                                model='gbpatentdata/lt-patent-inventor-linking',
                                on=['name', 'occupation', 'year', 'address', 'firm', 'patent_title'],
                                cluster_type='SLINK',
                                cluster_params={'threshold': 0.1, 'min cluster size': 1, 'metric': 'cosine'}
)

Evaluation

We evaluate using the standard LinkTransformer information retrieval metrics. Our test set evaluations are available here.

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 31 with parameters:

{'batch_size': 64, 'sampler': 'torch.utils.data.dataloader._InfiniteConstantSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

linktransformer.modified_sbert.losses.SupConLoss_wandb

Parameters of the fit()-Method:

{
    "epochs": 100,
    "evaluation_steps": 16,
    "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 3100,
    "weight_decay": 0.01
}
LinkTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Normalize()
)

Citation

If you use our model or custom training/evaluation data in your research, please cite our accompanying paper as follows:

@article{bct2025,
  title = {300 Years of British Patents},
  author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero},
  journal = {arXiv preprint arXiv:2401.12345},
  year = {2025},
  url = {https://arxiv.org/abs/2401.12345}
}

Please also cite the original LinkTransformer authors:

@misc{arora2023linktransformer,
                  title={LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models},
                  author={Abhishek Arora and Melissa Dell},
                  year={2023},
                  eprint={2309.00789},
                  archivePrefix={arXiv},
                  primaryClass={cs.CL}
                }
Downloads last month
25
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.