# Harvard USPTO Dataset Training

## Importing Packages

We first need to import the actual USPTO dataset.

In [1]:
!pip install datasets



In [2]:
from datasets import load_dataset
import pandas as pd
import numpy as np

## Loading the Dataset

We need to extract the dataset. We filter only for those in January 2016.

In [3]:
dataset_dict = load_dataset('HUPD/hupd',
    name='sample',
    data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather", 
    icpr_label=None,
    train_filing_start_date='2016-01-01',
    train_filing_end_date='2016-01-21',
    val_filing_start_date='2016-01-22',
    val_filing_end_date='2016-01-31',
)

Found cached dataset hupd (/home/jovyan/.cache/huggingface/datasets/HUPD___hupd/sample-a4eeba92b4229e93/0.0.0/6920d2def8fd7767046c0470603357f76866e5a09c97e19571896bfdca521142)


  0%|          | 0/2 [00:00<?, ?it/s]

We print out the dataset to understand what exactly we want to look for

In [4]:
print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['patent_number', 'decision', 'title', 'abstract', 'claims', 'background', 'summary', 'description', 'cpc_label', 'ipc_label', 'filing_date', 'patent_issue_date', 'date_published', 'examiner_id'],
        num_rows: 16153
    })
    validation: Dataset({
        features: ['patent_number', 'decision', 'title', 'abstract', 'claims', 'background', 'summary', 'description', 'cpc_label', 'ipc_label', 'filing_date', 'patent_issue_date', 'date_published', 'examiner_id'],
        num_rows: 9094
    })
})


We separate our data between training and validation

In [5]:
df_train = pd.DataFrame(dataset_dict['train'] )
df_val = pd.DataFrame(dataset_dict['validation'] )

We can preview the training data

In [6]:
df_train

Unnamed: 0,patent_number,decision,title,abstract,claims,background,summary,description,cpc_label,ipc_label,filing_date,patent_issue_date,date_published,examiner_id
0,13261748,ACCEPTED,MINI-OPTICAL NETWORK TERMINAL (ONT),The present invention relates to passive optic...,"1. A compact optical network terminal, compris...",<SOH> BACKGROUND OF THE INVENTION <EOH>A netwo...,<SOH> SUMMARY OF THE INVENTION <EOH>An aspect ...,FIELD OF THE INVENTION The present invention r...,H04Q110071,H04Q1100,20160120,20170606,20160526,95191.0
1,13995128,ACCEPTED,APPARATUS FOR FORMING AND READING AN IDENTIFIC...,Embodiments of the invention provide a method ...,1. A method comprising: using a first reader t...,<SOH> BACKGROUND OF THE INVENTION <EOH>Identif...,<SOH> SUMMARY OF THE INVENTION <EOH>In accorda...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,G06K500,G06K500,20160112,20160322,20140102,59514.0
2,14241799,PENDING,PORTABLE DRUG DISPENSER,A portable drug dispenser includes a chamber f...,"1. A portable drug dispenser, comprising: a ch...",,,This application claims priority from U.S. app...,A61J70084,A61J700,20160104,,20171116,95928.0
3,14348792,ACCEPTED,LIQUID-COOLED HEAT EXCHANGER,A crystal growth furnace comprising a crucible...,1. A crystal growth furnace for growing a crys...,<SOH> BACKGROUND OF THE INVENTION <EOH>1. Fiel...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,C30B11003,C30B1100,20160111,20180529,20160512,63013.0
4,14360978,REJECTED,SOLE MEMBER OF FOOTWEAR,A shoe midsole is composed of a base plate (1)...,1. A sole member of footwear comprising a base...,<SOH> BACKGROUND ART <EOH>When the heel touche...,<SOH> BRIEF DESCRIPTION OF THE DRAWINGS <EOH>F...,TECHNICAL FIELD The present invention relates ...,A43B13181,A43B1318,20160113,,20160512,94490.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16148,15002394,ACCEPTED,ROBOT HAND CONTROLLING METHOD AND ROBOTICS DEVICE,A robot hand controlling method executes calcu...,"1. A controlling method of a robot hand, the r...",<SOH> BACKGROUND OF THE INVENTION <EOH>1. Fiel...,<SOH> SUMMARY OF THE INVENTION <EOH>An object ...,BACKGROUND OF THE INVENTION 1. Field of the In...,B25J91612,B25J916,20160120,20180710,20160804,66148.0
16149,15002396,REJECTED,IMMUNOGLOBULIN FUSION PROTEINS AND USES THEREOF,A fusion protein is disclosed. The fusion prot...,1. A fusion protein comprising an Fc fragment ...,<SOH> BACKGROUND OF THE INVENTION <EOH>An immu...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,The present application is a U.S. Nonprovision...,C07K14745,C07K14745,20160120,,20161215,95819.0
16150,15330955,REJECTED,PIPE EXTRACTION TOOL,A pipe extraction tool that grips the inside o...,1. A pipe extraction tool for extracting a pip...,<SOH> BACKGROUND OF THE INVENTION <EOH>1. Fiel...,<SOH> BRIEF SUMMARY OF THE INVENTION <EOH>The ...,CROSS-REFERENCES TO RELATED APPLICATIONS Not a...,B25B2714,B25B2714,20160120,,20170907,95661.0
16151,15330961,PENDING,Molded parts with thermoplastic cellulose biop...,A longitudinal extending body with oriented fi...,1. A longitudinal body of a solidified organic...,<SOH> BACKGROUND OF INVENTION <EOH>In the medi...,<SOH> BRIEF SUMMARY OF THE PRESENT INVENTION <...,CROSS REFERENCES Application claims priority o...,A61L3106,A61L3106,20160111,,20171019,96956.0


## Pre-Processing the Data

We are interested in the following columns:
- Abstract
- Claims
- Decision <- our `y`

Let's preprocess them both out of our training and validation data

Also, consider that the "Decision" column has three types of values: "Accepted", "Rejected", and "Pending". To remove unecessary baggage, we will be only looking for "Accepted" and "Rejected".

In [7]:
necessary_columns = ["abstract","claims","decision"]
output_values = ['ACCEPTED','REJECTED'] 

In [8]:
trainFeaturesToDrop = [col for col in list(df_train.columns) if col not in necessary_columns]
trainDF = df_train.dropna()
trainDF.drop(columns=trainFeaturesToDrop, inplace=True)
trainDF = trainDF[trainDF['decision'].isin(output_values)]

In [9]:
trainDF

Unnamed: 0,decision,abstract,claims
0,ACCEPTED,The present invention relates to passive optic...,"1. A compact optical network terminal, compris..."
1,ACCEPTED,Embodiments of the invention provide a method ...,1. A method comprising: using a first reader t...
3,ACCEPTED,A crystal growth furnace comprising a crucible...,1. A crystal growth furnace for growing a crys...
4,REJECTED,A shoe midsole is composed of a base plate (1)...,1. A sole member of footwear comprising a base...
5,ACCEPTED,"A ratchet tool includes a shaft member, a hand...","1. A ratchet tool, comprising a shaft member, ..."
...,...,...,...
16144,ACCEPTED,"A wavelength tunable laser device, including: ...","1. A wavelength tunable laser device, comprisi..."
16145,ACCEPTED,"In one aspect, a method for use in preparing a...","1. (canceled) 2. The method of claim 19, where..."
16148,ACCEPTED,A robot hand controlling method executes calcu...,"1. A controlling method of a robot hand, the r..."
16149,REJECTED,A fusion protein is disclosed. The fusion prot...,1. A fusion protein comprising an Fc fragment ...


In [10]:
valFeaturesToDrop = [col for col in list(df_val.columns) if col not in necessary_columns]
valDF = df_val.dropna()
valDF.drop(columns=valFeaturesToDrop, inplace=True)
valDF = valDF[valDF['decision'].isin(output_values)]

In [11]:
valDF

Unnamed: 0,decision,abstract,claims
0,REJECTED,Regimen for the treatment of rosacea include t...,1. A treatment regimen comprising: cleansing a...
1,ACCEPTED,A clamp arrangement includes a pair of bracket...,1. A clamp arrangement for supporting a fractu...
2,REJECTED,A system and method for device action and conf...,1-20. (canceled) 21. A mobile device comprisin...
4,REJECTED,Systems and methods for managing datasets prod...,"1. A method, comprising: executing, by one or ..."
9,ACCEPTED,A scan driving circuit is provided. The scan d...,1. A scan driving circuit for driving a scan l...
...,...,...,...
9085,REJECTED,The non-rigid gate device as described may be ...,1; A non-rigid blocking apparatus referred to ...
9090,REJECTED,The present invention provides an improved unc...,1. A method for rendering a plastic surface am...
9091,ACCEPTED,A method for detecting a software-race conditi...,1. A method for detecting a software-race cond...
9092,ACCEPTED,The present application relates to multi-stage...,1. A multi-stage amplitude modulation-based me...


We need to replace the values in the `decision` column to numerical representations. We will set "ACCEPTED" as `1` and "REJECTED" as `0`.

In [12]:
yKey = {"ACCEPTED":1,"REJECTED":0}

In [13]:
trainDF2 = trainDF.replace({"decision": yKey})
valDF2 = valDF.replace({"decision": yKey})

In [14]:
trainDF2

Unnamed: 0,decision,abstract,claims
0,1,The present invention relates to passive optic...,"1. A compact optical network terminal, compris..."
1,1,Embodiments of the invention provide a method ...,1. A method comprising: using a first reader t...
3,1,A crystal growth furnace comprising a crucible...,1. A crystal growth furnace for growing a crys...
4,0,A shoe midsole is composed of a base plate (1)...,1. A sole member of footwear comprising a base...
5,1,"A ratchet tool includes a shaft member, a hand...","1. A ratchet tool, comprising a shaft member, ..."
...,...,...,...
16144,1,"A wavelength tunable laser device, including: ...","1. A wavelength tunable laser device, comprisi..."
16145,1,"In one aspect, a method for use in preparing a...","1. (canceled) 2. The method of claim 19, where..."
16148,1,A robot hand controlling method executes calcu...,"1. A controlling method of a robot hand, the r..."
16149,0,A fusion protein is disclosed. The fusion prot...,1. A fusion protein comprising an Fc fragment ...


In [15]:
valDF2

Unnamed: 0,decision,abstract,claims
0,0,Regimen for the treatment of rosacea include t...,1. A treatment regimen comprising: cleansing a...
1,1,A clamp arrangement includes a pair of bracket...,1. A clamp arrangement for supporting a fractu...
2,0,A system and method for device action and conf...,1-20. (canceled) 21. A mobile device comprisin...
4,0,Systems and methods for managing datasets prod...,"1. A method, comprising: executing, by one or ..."
9,1,A scan driving circuit is provided. The scan d...,1. A scan driving circuit for driving a scan l...
...,...,...,...
9085,0,The non-rigid gate device as described may be ...,1; A non-rigid blocking apparatus referred to ...
9090,0,The present invention provides an improved unc...,1. A method for rendering a plastic surface am...
9091,1,A method for detecting a software-race conditi...,1. A method for detecting a software-race cond...
9092,1,The present application relates to multi-stage...,1. A multi-stage amplitude modulation-based me...


We combine the `abstract` and `claims` columns into a single `text` column. We also re-label the `decision` column to `label`.

In [16]:
trainDF3 = trainDF2.rename(columns={'decision': 'label'})
trainDF3['text'] = trainDF3['abstract'] + ' ' + trainDF3['claims']
trainDF3.drop(columns=["abstract","claims"],inplace=True)
trainDF3

Unnamed: 0,label,text
0,1,The present invention relates to passive optic...
1,1,Embodiments of the invention provide a method ...
3,1,A crystal growth furnace comprising a crucible...
4,0,A shoe midsole is composed of a base plate (1)...
5,1,"A ratchet tool includes a shaft member, a hand..."
...,...,...
16144,1,"A wavelength tunable laser device, including: ..."
16145,1,"In one aspect, a method for use in preparing a..."
16148,1,A robot hand controlling method executes calcu...
16149,0,A fusion protein is disclosed. The fusion prot...


In [17]:
valDF3 = valDF2.rename(columns={'decision': 'label'})
valDF3['text'] = valDF3['abstract'] + ' ' + valDF3['claims']
valDF3.drop(columns=["abstract","claims"],inplace=True)
valDF3

Unnamed: 0,label,text
0,0,Regimen for the treatment of rosacea include t...
1,1,A clamp arrangement includes a pair of bracket...
2,0,A system and method for device action and conf...
4,0,Systems and methods for managing datasets prod...
9,1,A scan driving circuit is provided. The scan d...
...,...,...
9085,0,The non-rigid gate device as described may be ...
9090,0,The present invention provides an improved unc...
9091,1,A method for detecting a software-race conditi...
9092,1,The present application relates to multi-stage...


We can grab the data for each column so that we have a list of values for training labels, training texts, validation labels, and validation texts.

In [18]:
trainLabels = trainDF3["label"].tolist()
trainText = trainDF3["text"].tolist()

valLabels = valDF3["label"].tolist()
valText = valDF3["text"].tolist()

## Loading the Trainer

Now we can start training! This time, we will just go with `distilbert-base-uncased` for simplicity.

In [19]:
!pip install torch
!pip install transformers



In [20]:
import torch
from torch.utils.data import Dataset
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

In [21]:
model_name = "distilbert-base-uncased"
class USPTODataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encoding.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    def __len__(self):
        return len(self.labels)


In [22]:
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

In [None]:
train_encodings = tokenizer(trainText, truncation=True, padding=True)
val_encodings = tokenizer(valText, truncation=True, padding=True)

train_dataset = USPTODataset(train_encodings, trainLabels)
val_dataset = USPTODataset(val_encodings, valLabels)

train_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    learning_rate=5e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10
)