HUPD DistilRoBERTa-Base Model

This HUPD DistilRoBERTa model was fine-tuned on the HUPD dataset with a masked language modeling objective. It was originally introduced in this paper.

For more information about the Harvard USPTO Patent Dataset, please feel free to visit the project website or the project's GitHub repository.

How to Use

You can use this model directly with a pipeline for masked language modeling:

from transformers import pipeline
model = pipeline(task="fill-mask", model="hupd/hupd-distilroberta-base")
model("Improved <mask> for playing a game of thumb wrestling.")

Here is the output:

[{'score': 0.4274042248725891,
  'sequence': 'Improved method for playing a game of thumb wrestling.',
  'token': 5448,
  'token_str': ' method'},
 {'score': 0.06967400759458542,
  'sequence': 'Improved system for playing a game of thumb wrestling.',
  'token': 467,
  'token_str': ' system'},
 {'score': 0.06849079579114914,
  'sequence': 'Improved device for playing a game of thumb wrestling.',
  'token': 2187,
  'token_str': ' device'},
 {'score': 0.04544765502214432,
  'sequence': 'Improved apparatus for playing a game of thumb wrestling.',
  'token': 26529,
  'token_str': ' apparatus'},
 {'score': 0.025765646249055862,
  'sequence': 'Improved means for playing a game of thumb wrestling.',
  'token': 839,
  'token_str': ' means'}]

Alternatively, you can load the model and use it as follows:

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

# cuda/cpu
device = 'cuda' if torch.cuda.is_available() else 'cpu'

tokenizer = AutoTokenizer.from_pretrained("hupd/hupd-distilroberta-base")
model = AutoModelForMaskedLM.from_pretrained("hupd/hupd-distilroberta-base").to(device)

TEXT = "Improved <mask> for playing a game of thumb wrestling."

inputs = tokenizer(TEXT, return_tensors="pt").to(device)

with torch.no_grad():
    logits = model(**inputs).logits

# retrieve indices of <mask>
mask_token_indxs = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]

for mask_idx in mask_token_indxs:
    predicted_token_id = logits[0, mask_idx].argmax(axis=-1)
    output = tokenizer.decode(predicted_token_id)
    print(f'Prediction for the <mask> token at index {mask_idx}: "{output}"')

Here is the output:

Prediction for the <mask> token at index 2: " method"

Citation

For more information, please take a look at the original paper.

@article{suzgun2022hupd,
  title={The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications},
  author={Suzgun, Mirac and Melas-Kyriazi, Luke and Sarkar, Suproteem K and Kominers, Scott and Shieber, Stuart},
  year={2022}
}
Downloads last month
30
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train HUPD/hupd-distilroberta-base