{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Harvard USPTO Dataset Training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing Packages\n", "\n", "We first need to import the actual USPTO dataset." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: datasets in /opt/conda/lib/python3.10/site-packages (2.11.0)\n", "Requirement already satisfied: fsspec[http]>=2021.11.1 in /opt/conda/lib/python3.10/site-packages (from datasets) (2022.8.2)\n", "Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.10/site-packages (from datasets) (1.23.3)\n", "Requirement already satisfied: tqdm>=4.62.1 in /opt/conda/lib/python3.10/site-packages (from datasets) (4.64.1)\n", "Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (2.28.1)\n", "Requirement already satisfied: pyarrow>=8.0.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (9.0.0)\n", "Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.10/site-packages (from datasets) (6.0)\n", "Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from datasets) (1.5.0)\n", "Requirement already satisfied: huggingface-hub<1.0.0,>=0.11.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (0.13.4)\n", "Requirement already satisfied: responses<0.19 in /opt/conda/lib/python3.10/site-packages (from datasets) (0.18.0)\n", "Requirement already satisfied: xxhash in /opt/conda/lib/python3.10/site-packages (from datasets) (3.2.0)\n", "Requirement already satisfied: dill<0.3.7,>=0.3.0 in /opt/conda/lib/python3.10/site-packages (from datasets) (0.3.6)\n", "Requirement already satisfied: aiohttp in /opt/conda/lib/python3.10/site-packages (from datasets) (3.8.4)\n", "Requirement already satisfied: packaging in /opt/conda/lib/python3.10/site-packages (from datasets) (21.3)\n", "Requirement already satisfied: multiprocess in /opt/conda/lib/python3.10/site-packages (from datasets) (0.70.14)\n", "Requirement already satisfied: attrs>=17.3.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (22.1.0)\n", "Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.1)\n", "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (4.0.2)\n", "Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (6.0.4)\n", "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.8.2)\n", "Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (1.3.3)\n", "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /opt/conda/lib/python3.10/site-packages (from aiohttp->datasets) (2.1.1)\n", "Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/conda/lib/python3.10/site-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets) (4.4.0)\n", "Requirement already satisfied: filelock in /opt/conda/lib/python3.10/site-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets) (3.12.0)\n", "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.10/site-packages (from packaging->datasets) (3.0.9)\n", "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (2022.9.24)\n", "Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (3.4)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.10/site-packages (from requests>=2.19.0->datasets) (1.26.11)\n", "Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2.8.2)\n", "Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas->datasets) (2022.4)\n", "Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)\n" ] } ], "source": [ "!pip install datasets" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to extract the dataset. We filter only for those in January 2016." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Found cached dataset hupd (/home/jovyan/.cache/huggingface/datasets/HUPD___hupd/sample-a4eeba92b4229e93/0.0.0/6920d2def8fd7767046c0470603357f76866e5a09c97e19571896bfdca521142)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e39fd26828774c8e9d159a8b5d91c4f5", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/2 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dataset_dict = load_dataset('HUPD/hupd',\n", " name='sample',\n", " data_files=\"https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather\", \n", " icpr_label=None,\n", " train_filing_start_date='2016-01-01',\n", " train_filing_end_date='2016-01-21',\n", " val_filing_start_date='2016-01-22',\n", " val_filing_end_date='2016-01-31',\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We print out the dataset to understand what exactly we want to look for" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DatasetDict({\n", " train: Dataset({\n", " features: ['patent_number', 'decision', 'title', 'abstract', 'claims', 'background', 'summary', 'description', 'cpc_label', 'ipc_label', 'filing_date', 'patent_issue_date', 'date_published', 'examiner_id'],\n", " num_rows: 16153\n", " })\n", " validation: Dataset({\n", " features: ['patent_number', 'decision', 'title', 'abstract', 'claims', 'background', 'summary', 'description', 'cpc_label', 'ipc_label', 'filing_date', 'patent_issue_date', 'date_published', 'examiner_id'],\n", " num_rows: 9094\n", " })\n", "})\n" ] } ], "source": [ "print(dataset_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We separate our data between training and validation" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df_train = pd.DataFrame(dataset_dict['train'] )\n", "df_val = pd.DataFrame(dataset_dict['validation'] )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can preview the training data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | patent_number | \n", "decision | \n", "title | \n", "abstract | \n", "claims | \n", "background | \n", "summary | \n", "description | \n", "cpc_label | \n", "ipc_label | \n", "filing_date | \n", "patent_issue_date | \n", "date_published | \n", "examiner_id | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "13261748 | \n", "ACCEPTED | \n", "MINI-OPTICAL NETWORK TERMINAL (ONT) | \n", "The present invention relates to passive optic... | \n", "1. A compact optical network terminal, compris... | \n", "<SOH> BACKGROUND OF THE INVENTION <EOH>A netwo... | \n", "<SOH> SUMMARY OF THE INVENTION <EOH>An aspect ... | \n", "FIELD OF THE INVENTION The present invention r... | \n", "H04Q110071 | \n", "H04Q1100 | \n", "20160120 | \n", "20170606 | \n", "20160526 | \n", "95191.0 | \n", "
1 | \n", "13995128 | \n", "ACCEPTED | \n", "APPARATUS FOR FORMING AND READING AN IDENTIFIC... | \n", "Embodiments of the invention provide a method ... | \n", "1. A method comprising: using a first reader t... | \n", "<SOH> BACKGROUND OF THE INVENTION <EOH>Identif... | \n", "<SOH> SUMMARY OF THE INVENTION <EOH>In accorda... | \n", "CROSS-REFERENCE TO RELATED APPLICATIONS The pr... | \n", "G06K500 | \n", "G06K500 | \n", "20160112 | \n", "20160322 | \n", "20140102 | \n", "59514.0 | \n", "
2 | \n", "14241799 | \n", "PENDING | \n", "PORTABLE DRUG DISPENSER | \n", "A portable drug dispenser includes a chamber f... | \n", "1. A portable drug dispenser, comprising: a ch... | \n", "\n", " | \n", " | This application claims priority from U.S. app... | \n", "A61J70084 | \n", "A61J700 | \n", "20160104 | \n", "\n", " | 20171116 | \n", "95928.0 | \n", "
3 | \n", "14348792 | \n", "ACCEPTED | \n", "LIQUID-COOLED HEAT EXCHANGER | \n", "A crystal growth furnace comprising a crucible... | \n", "1. A crystal growth furnace for growing a crys... | \n", "<SOH> BACKGROUND OF THE INVENTION <EOH>1. Fiel... | \n", "<SOH> SUMMARY OF THE INVENTION <EOH>The presen... | \n", "CROSS-REFERENCE TO RELATED APPLICATIONS The pr... | \n", "C30B11003 | \n", "C30B1100 | \n", "20160111 | \n", "20180529 | \n", "20160512 | \n", "63013.0 | \n", "
4 | \n", "14360978 | \n", "REJECTED | \n", "SOLE MEMBER OF FOOTWEAR | \n", "A shoe midsole is composed of a base plate (1)... | \n", "1. A sole member of footwear comprising a base... | \n", "<SOH> BACKGROUND ART <EOH>When the heel touche... | \n", "<SOH> BRIEF DESCRIPTION OF THE DRAWINGS <EOH>F... | \n", "TECHNICAL FIELD The present invention relates ... | \n", "A43B13181 | \n", "A43B1318 | \n", "20160113 | \n", "\n", " | 20160512 | \n", "94490.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
16148 | \n", "15002394 | \n", "ACCEPTED | \n", "ROBOT HAND CONTROLLING METHOD AND ROBOTICS DEVICE | \n", "A robot hand controlling method executes calcu... | \n", "1. A controlling method of a robot hand, the r... | \n", "<SOH> BACKGROUND OF THE INVENTION <EOH>1. Fiel... | \n", "<SOH> SUMMARY OF THE INVENTION <EOH>An object ... | \n", "BACKGROUND OF THE INVENTION 1. Field of the In... | \n", "B25J91612 | \n", "B25J916 | \n", "20160120 | \n", "20180710 | \n", "20160804 | \n", "66148.0 | \n", "
16149 | \n", "15002396 | \n", "REJECTED | \n", "IMMUNOGLOBULIN FUSION PROTEINS AND USES THEREOF | \n", "A fusion protein is disclosed. The fusion prot... | \n", "1. A fusion protein comprising an Fc fragment ... | \n", "<SOH> BACKGROUND OF THE INVENTION <EOH>An immu... | \n", "<SOH> SUMMARY OF THE INVENTION <EOH>The presen... | \n", "The present application is a U.S. Nonprovision... | \n", "C07K14745 | \n", "C07K14745 | \n", "20160120 | \n", "\n", " | 20161215 | \n", "95819.0 | \n", "
16150 | \n", "15330955 | \n", "REJECTED | \n", "PIPE EXTRACTION TOOL | \n", "A pipe extraction tool that grips the inside o... | \n", "1. A pipe extraction tool for extracting a pip... | \n", "<SOH> BACKGROUND OF THE INVENTION <EOH>1. Fiel... | \n", "<SOH> BRIEF SUMMARY OF THE INVENTION <EOH>The ... | \n", "CROSS-REFERENCES TO RELATED APPLICATIONS Not a... | \n", "B25B2714 | \n", "B25B2714 | \n", "20160120 | \n", "\n", " | 20170907 | \n", "95661.0 | \n", "
16151 | \n", "15330961 | \n", "PENDING | \n", "Molded parts with thermoplastic cellulose biop... | \n", "A longitudinal extending body with oriented fi... | \n", "1. A longitudinal body of a solidified organic... | \n", "<SOH> BACKGROUND OF INVENTION <EOH>In the medi... | \n", "<SOH> BRIEF SUMMARY OF THE PRESENT INVENTION <... | \n", "CROSS REFERENCES Application claims priority o... | \n", "A61L3106 | \n", "A61L3106 | \n", "20160111 | \n", "\n", " | 20171019 | \n", "96956.0 | \n", "
16152 | \n", "15330968 | \n", "PENDING | \n", "Transmission method with double directivity | \n", "A transmission method using a massive MIMO (Mu... | \n", "1. Transmission method with double directivity... | \n", "<SOH> BACKGROUND OF THE INVENTION <EOH> | \n", "<SOH> BRIEF SUMMARY OF THE INVENTION <EOH>The ... | \n", "BACKGROUND OF THE INVENTION Field of the Inven... | \n", "H04B7043 | \n", "H04B704 | \n", "20160114 | \n", "\n", " | 20180329 | \n", "70883.0 | \n", "
16153 rows × 14 columns
\n", "\n", " | decision | \n", "abstract | \n", "claims | \n", "
---|---|---|---|
0 | \n", "ACCEPTED | \n", "The present invention relates to passive optic... | \n", "1. A compact optical network terminal, compris... | \n", "
1 | \n", "ACCEPTED | \n", "Embodiments of the invention provide a method ... | \n", "1. A method comprising: using a first reader t... | \n", "
3 | \n", "ACCEPTED | \n", "A crystal growth furnace comprising a crucible... | \n", "1. A crystal growth furnace for growing a crys... | \n", "
4 | \n", "REJECTED | \n", "A shoe midsole is composed of a base plate (1)... | \n", "1. A sole member of footwear comprising a base... | \n", "
5 | \n", "ACCEPTED | \n", "A ratchet tool includes a shaft member, a hand... | \n", "1. A ratchet tool, comprising a shaft member, ... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
16144 | \n", "ACCEPTED | \n", "A wavelength tunable laser device, including: ... | \n", "1. A wavelength tunable laser device, comprisi... | \n", "
16145 | \n", "ACCEPTED | \n", "In one aspect, a method for use in preparing a... | \n", "1. (canceled) 2. The method of claim 19, where... | \n", "
16148 | \n", "ACCEPTED | \n", "A robot hand controlling method executes calcu... | \n", "1. A controlling method of a robot hand, the r... | \n", "
16149 | \n", "REJECTED | \n", "A fusion protein is disclosed. The fusion prot... | \n", "1. A fusion protein comprising an Fc fragment ... | \n", "
16150 | \n", "REJECTED | \n", "A pipe extraction tool that grips the inside o... | \n", "1. A pipe extraction tool for extracting a pip... | \n", "
8719 rows × 3 columns
\n", "\n", " | decision | \n", "abstract | \n", "claims | \n", "
---|---|---|---|
0 | \n", "REJECTED | \n", "Regimen for the treatment of rosacea include t... | \n", "1. A treatment regimen comprising: cleansing a... | \n", "
1 | \n", "ACCEPTED | \n", "A clamp arrangement includes a pair of bracket... | \n", "1. A clamp arrangement for supporting a fractu... | \n", "
2 | \n", "REJECTED | \n", "A system and method for device action and conf... | \n", "1-20. (canceled) 21. A mobile device comprisin... | \n", "
4 | \n", "REJECTED | \n", "Systems and methods for managing datasets prod... | \n", "1. A method, comprising: executing, by one or ... | \n", "
9 | \n", "ACCEPTED | \n", "A scan driving circuit is provided. The scan d... | \n", "1. A scan driving circuit for driving a scan l... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
9085 | \n", "REJECTED | \n", "The non-rigid gate device as described may be ... | \n", "1; A non-rigid blocking apparatus referred to ... | \n", "
9090 | \n", "REJECTED | \n", "The present invention provides an improved unc... | \n", "1. A method for rendering a plastic surface am... | \n", "
9091 | \n", "ACCEPTED | \n", "A method for detecting a software-race conditi... | \n", "1. A method for detecting a software-race cond... | \n", "
9092 | \n", "ACCEPTED | \n", "The present application relates to multi-stage... | \n", "1. A multi-stage amplitude modulation-based me... | \n", "
9093 | \n", "ACCEPTED | \n", "A paper feeder includes a housing, a driving u... | \n", "1. A paper feeder, comprising: a housing; a dr... | \n", "
4888 rows × 3 columns
\n", "\n", " | decision | \n", "abstract | \n", "claims | \n", "
---|---|---|---|
0 | \n", "1 | \n", "The present invention relates to passive optic... | \n", "1. A compact optical network terminal, compris... | \n", "
1 | \n", "1 | \n", "Embodiments of the invention provide a method ... | \n", "1. A method comprising: using a first reader t... | \n", "
3 | \n", "1 | \n", "A crystal growth furnace comprising a crucible... | \n", "1. A crystal growth furnace for growing a crys... | \n", "
4 | \n", "0 | \n", "A shoe midsole is composed of a base plate (1)... | \n", "1. A sole member of footwear comprising a base... | \n", "
5 | \n", "1 | \n", "A ratchet tool includes a shaft member, a hand... | \n", "1. A ratchet tool, comprising a shaft member, ... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
16144 | \n", "1 | \n", "A wavelength tunable laser device, including: ... | \n", "1. A wavelength tunable laser device, comprisi... | \n", "
16145 | \n", "1 | \n", "In one aspect, a method for use in preparing a... | \n", "1. (canceled) 2. The method of claim 19, where... | \n", "
16148 | \n", "1 | \n", "A robot hand controlling method executes calcu... | \n", "1. A controlling method of a robot hand, the r... | \n", "
16149 | \n", "0 | \n", "A fusion protein is disclosed. The fusion prot... | \n", "1. A fusion protein comprising an Fc fragment ... | \n", "
16150 | \n", "0 | \n", "A pipe extraction tool that grips the inside o... | \n", "1. A pipe extraction tool for extracting a pip... | \n", "
8719 rows × 3 columns
\n", "\n", " | decision | \n", "abstract | \n", "claims | \n", "
---|---|---|---|
0 | \n", "0 | \n", "Regimen for the treatment of rosacea include t... | \n", "1. A treatment regimen comprising: cleansing a... | \n", "
1 | \n", "1 | \n", "A clamp arrangement includes a pair of bracket... | \n", "1. A clamp arrangement for supporting a fractu... | \n", "
2 | \n", "0 | \n", "A system and method for device action and conf... | \n", "1-20. (canceled) 21. A mobile device comprisin... | \n", "
4 | \n", "0 | \n", "Systems and methods for managing datasets prod... | \n", "1. A method, comprising: executing, by one or ... | \n", "
9 | \n", "1 | \n", "A scan driving circuit is provided. The scan d... | \n", "1. A scan driving circuit for driving a scan l... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
9085 | \n", "0 | \n", "The non-rigid gate device as described may be ... | \n", "1; A non-rigid blocking apparatus referred to ... | \n", "
9090 | \n", "0 | \n", "The present invention provides an improved unc... | \n", "1. A method for rendering a plastic surface am... | \n", "
9091 | \n", "1 | \n", "A method for detecting a software-race conditi... | \n", "1. A method for detecting a software-race cond... | \n", "
9092 | \n", "1 | \n", "The present application relates to multi-stage... | \n", "1. A multi-stage amplitude modulation-based me... | \n", "
9093 | \n", "1 | \n", "A paper feeder includes a housing, a driving u... | \n", "1. A paper feeder, comprising: a housing; a dr... | \n", "
4888 rows × 3 columns
\n", "\n", " | label | \n", "text | \n", "
---|---|---|
0 | \n", "1 | \n", "The present invention relates to passive optic... | \n", "
1 | \n", "1 | \n", "Embodiments of the invention provide a method ... | \n", "
3 | \n", "1 | \n", "A crystal growth furnace comprising a crucible... | \n", "
4 | \n", "0 | \n", "A shoe midsole is composed of a base plate (1)... | \n", "
5 | \n", "1 | \n", "A ratchet tool includes a shaft member, a hand... | \n", "
... | \n", "... | \n", "... | \n", "
16144 | \n", "1 | \n", "A wavelength tunable laser device, including: ... | \n", "
16145 | \n", "1 | \n", "In one aspect, a method for use in preparing a... | \n", "
16148 | \n", "1 | \n", "A robot hand controlling method executes calcu... | \n", "
16149 | \n", "0 | \n", "A fusion protein is disclosed. The fusion prot... | \n", "
16150 | \n", "0 | \n", "A pipe extraction tool that grips the inside o... | \n", "
8719 rows × 2 columns
\n", "\n", " | label | \n", "text | \n", "
---|---|---|
0 | \n", "0 | \n", "Regimen for the treatment of rosacea include t... | \n", "
1 | \n", "1 | \n", "A clamp arrangement includes a pair of bracket... | \n", "
2 | \n", "0 | \n", "A system and method for device action and conf... | \n", "
4 | \n", "0 | \n", "Systems and methods for managing datasets prod... | \n", "
9 | \n", "1 | \n", "A scan driving circuit is provided. The scan d... | \n", "
... | \n", "... | \n", "... | \n", "
9085 | \n", "0 | \n", "The non-rigid gate device as described may be ... | \n", "
9090 | \n", "0 | \n", "The present invention provides an improved unc... | \n", "
9091 | \n", "1 | \n", "A method for detecting a software-race conditi... | \n", "
9092 | \n", "1 | \n", "The present application relates to multi-stage... | \n", "
9093 | \n", "1 | \n", "A paper feeder includes a housing, a driving u... | \n", "
4888 rows × 2 columns
\n", "