HUPD
/

hupd-distilroberta-base

+---
+language:
+  - en
+thumbnail: "url to a thumbnail used in social sharing"
+tags:
+- hupd
+- roberta
+- distilroberta
+- patents
+license: cc-by-sa-4.0
+datasets:
+- HUPD/hupd
+---
+# HUPD DistilRoBERTA-Base Model
+This HUPD DistilRoBERTa model was fine-tuned on the HUPD dataset with a masked language modeling objective. It was originally introduced in [this paper](TBD).
+For more information about the Harvard USPTO Patent Dataset, please feel free to visit the [project website](https://patentdataset.org/) or the [project's GitHub repository](https://github.com/suzgunmirac/hupd).
+### How to Use
+You can use this model directly with a pipeline for masked language modeling:
+```python
+from transformers import pipeline
+model = pipeline(task="fill-mask", model="turingmachine/hupd-distilroberta-base")
+model("Improved <mask> for playing a game of thumb wrestling.")
+```
+Here is the output:
+```python
+[{'score': 0.4274042248725891,
+  'sequence': 'Improved method for playing a game of thumb wrestling.',
+  'token': 5448,
+  'token_str': ' method'},
+ {'score': 0.06967400759458542,
+  'sequence': 'Improved system for playing a game of thumb wrestling.',
+  'token': 467,
+  'token_str': ' system'},
+ {'score': 0.06849079579114914,
+  'sequence': 'Improved device for playing a game of thumb wrestling.',
+  'token': 2187,
+  'token_str': ' device'},
+ {'score': 0.04544765502214432,
+  'sequence': 'Improved apparatus for playing a game of thumb wrestling.',
+  'token': 26529,
+  'token_str': ' apparatus'},
+ {'score': 0.025765646249055862,
+  'sequence': 'Improved means for playing a game of thumb wrestling.',
+  'token': 839,
+  'token_str': ' means'}]
+```
+Alternatively, you can load the model and use it as follows:
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+# cuda/cpu
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+tokenizer = AutoTokenizer.from_pretrained("turingmachine/hupd-distilroberta-base")
+model = AutoModelForMaskedLM.from_pretrained("turingmachine/hupd-distilroberta-base").to(device)
+TEXT = "Improved <mask> for playing a game of thumb wrestling."
+inputs = tokenizer(TEXT, return_tensors="pt").to(device)
+with torch.no_grad():
+    logits = model(**inputs).logits
+# retrieve indices of <mask>
+mask_token_indxs = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
+for mask_idx in mask_token_indxs:
+    predicted_token_id = logits[0, mask_idx].argmax(axis=-1)
+    output = tokenizer.decode(predicted_token_id)
+    print(f'Prediction for the <mask> token at index {mask_idx}: "{output}"')
+```
+Here is the output:
+```python
+Prediction for the <mask> token at index 2: " method"
+```
+## Citation
+For more information, please take a look at the original paper.
+* Paper: [The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications](TBD)
+* Authors: *Mirac Suzgun, Luke Melas-Kyriazi, Suproteem K. Sarkar, Scott Duke Kominers, and Stuart M. Shieber*
+* BibTeX:
+```
+@article{suzgun2022hupd,
+  title={The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications},
+  author={Suzgun, Mirac and Melas-Kyriazi, Luke and Sarkar, Suproteem K and Kominers, Scott and Shieber, Stuart},
+  year={2022}
+}
+```