--- language: - en thumbnail: "url to a thumbnail used in social sharing" tags: - hupd - roberta - distilroberta - patents license: cc-by-sa-4.0 datasets: - HUPD/hupd --- # HUPD DistilRoBERTa-Base Model This HUPD DistilRoBERTa model was fine-tuned on the HUPD dataset with a masked language modeling objective. It was originally introduced in [this paper](TBD). For more information about the Harvard USPTO Patent Dataset, please feel free to visit the [project website](https://patentdataset.org/) or the [project's GitHub repository](https://github.com/suzgunmirac/hupd). ### How to Use You can use this model directly with a pipeline for masked language modeling: ```python from transformers import pipeline model = pipeline(task="fill-mask", model="hupd/hupd-distilroberta-base") model("Improved for playing a game of thumb wrestling.") ``` Here is the output: ```python [{'score': 0.4274042248725891, 'sequence': 'Improved method for playing a game of thumb wrestling.', 'token': 5448, 'token_str': ' method'}, {'score': 0.06967400759458542, 'sequence': 'Improved system for playing a game of thumb wrestling.', 'token': 467, 'token_str': ' system'}, {'score': 0.06849079579114914, 'sequence': 'Improved device for playing a game of thumb wrestling.', 'token': 2187, 'token_str': ' device'}, {'score': 0.04544765502214432, 'sequence': 'Improved apparatus for playing a game of thumb wrestling.', 'token': 26529, 'token_str': ' apparatus'}, {'score': 0.025765646249055862, 'sequence': 'Improved means for playing a game of thumb wrestling.', 'token': 839, 'token_str': ' means'}] ``` Alternatively, you can load the model and use it as follows: ```python import torch from transformers import AutoTokenizer, AutoModelForMaskedLM # cuda/cpu device = 'cuda' if torch.cuda.is_available() else 'cpu' tokenizer = AutoTokenizer.from_pretrained("hupd/hupd-distilroberta-base") model = AutoModelForMaskedLM.from_pretrained("hupd/hupd-distilroberta-base").to(device) TEXT = "Improved for playing a game of thumb wrestling." inputs = tokenizer(TEXT, return_tensors="pt").to(device) with torch.no_grad(): logits = model(**inputs).logits # retrieve indices of mask_token_indxs = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0] for mask_idx in mask_token_indxs: predicted_token_id = logits[0, mask_idx].argmax(axis=-1) output = tokenizer.decode(predicted_token_id) print(f'Prediction for the token at index {mask_idx}: "{output}"') ``` Here is the output: ```python Prediction for the token at index 2: " method" ``` ## Citation For more information, please take a look at the original paper. * Paper: [The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications](TBD) * Authors: *Mirac Suzgun, Luke Melas-Kyriazi, Suproteem K. Sarkar, Scott Duke Kominers, and Stuart M. Shieber* * BibTeX: ``` @article{suzgun2022hupd, title={The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications}, author={Suzgun, Mirac and Melas-Kyriazi, Luke and Sarkar, Suproteem K and Kominers, Scott and Shieber, Stuart}, year={2022} } ```