dbleek commited on
Commit
9a55ba3
1 Parent(s): b6121f8

Milestone 4 (#6)

Browse files

* added documentation in README

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* added colab notebook

* Update README.md

README.md CHANGED
@@ -10,11 +10,33 @@ pinned: false
10
  # cs-gy-6613-project
11
  Project for CS-GY-6613 Spring 2023
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  # Milestone 3
14
- USPTO Patentability Classifier:https://huggingface.co/spaces/dbleek/cs-gy-6613-project-final
15
 
16
  # Milestone 2
17
- Sentiment Analysis App:https://huggingface.co/spaces/dbleek/cs-gy-6613-project
18
 
19
  # Milestone 1
20
  For milestone 1, I used the quick start instructions from VS code to connect to a remote Ubuntu container:
@@ -24,4 +46,3 @@ https://code.visualstudio.com/docs/devcontainers/containers#_quick-start-open-an
24
  ![Alt text](milestone-1.jpg "Screenshot for Milestone 1")
25
 
26
 
27
-
 
10
  # cs-gy-6613-project
11
  Project for CS-GY-6613 Spring 2023
12
 
13
+ # Milestone 4
14
+
15
+ ## Training
16
+ For documentation regarding the model training, please the Colab notebook here: https://github.com/dbleek/cs-gy-6613-project/blob/milestone-4/dmb443_csgy_6613_project_model_trainer.ipynb
17
+
18
+ ## Writing the App
19
+
20
+ First, I loaded the January 2016 HUPD data again and filtered out any applications from the validation dataset that were neither accepted nor rejected. I used applications only from the validation dataset in absence of a test set, since they were only used during the validation phase of training. I then randomly selected five accepted applications and five rejected applications to use as my app's sample data.
21
+
22
+ I then loaded the model and the distilBERT tokenizer as was done during training, except the model trained on the HUPD data was loaded instead of the base distilBERT model.
23
+
24
+ The patent numbers of the 10 sample applications are added as keys in a dictionary that map to each application's index in the dataset. This index is used with a helper function called `load_data` that selects the corresponding application from the dataset, and then populates the text inputs accordingly, whenever the selectbox is changed. These inputs include the application title, its decision, its abstract and its claims, although only the last two are entered into the model.
25
+
26
+ When the user presses the "Get Patentability Score" button, the abstract and claims are submitted as a form and ran through the tokenizer, just as they were during training. The tokens are then passed to the model. The models outputs the logits, which are then ran through a softmax to calculate the predicted probability of either label--0 for rejected, 1 for accepted--for the input.
27
+
28
+ Finally, the probability of the application being accepted is displayed on the page as the application's patentability score. The user will need to scroll down to see the message.
29
+
30
+ From the selected samples, the model correctly predicts 7 out of 10 applications. In other words, for 7 applications, the patentability score is .5 or above when the USPTO accepted the application, otherwise it is less than .5. This tracks with the .73 accuracy metric from training. The model seems to do a little better with correctly predicting accepted applications than rejected ones, which is understandable given the slight skew in the data I used to train it.
31
+
32
+ ## Landing Page
33
+ [https://sites.google.com/nyu.edu/dmb443-cs-gy-6613-project](https://sites.google.com/nyu.edu/dmb443-cs-gy-6613-project)
34
+
35
  # Milestone 3
36
+ USPTO Patentability Classifier: https://huggingface.co/spaces/dbleek/cs-gy-6613-project-final
37
 
38
  # Milestone 2
39
+ Sentiment Analysis App: https://huggingface.co/spaces/dbleek/cs-gy-6613-project
40
 
41
  # Milestone 1
42
  For milestone 1, I used the quick start instructions from VS code to connect to a remote Ubuntu container:
 
46
  ![Alt text](milestone-1.jpg "Screenshot for Milestone 1")
47
 
48
 
 
dmb443_csgy_6613_project_model_trainer.ipynb ADDED
The diff for this file is too large to render. See raw diff