Spaces:

dbleek
/

cs-gy-6613-project-final

Sleeping

dbleek commited on May 3, 2023

Commit

9a55ba3

•

1 Parent(s): b6121f8

Milestone 4 (#6)

* added documentation in README

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* added colab notebook

* Update README.md

Files changed (2) hide show

README.md +24 -3
dmb443_csgy_6613_project_model_trainer.ipynb +0 -0

README.md CHANGED Viewed

@@ -10,11 +10,33 @@ pinned: false
 # cs-gy-6613-project
 Project for CS-GY-6613 Spring 2023
 # Milestone 3
-USPTO Patentability Classifier:https://huggingface.co/spaces/dbleek/cs-gy-6613-project-final
 # Milestone 2
-Sentiment Analysis App:https://huggingface.co/spaces/dbleek/cs-gy-6613-project
 # Milestone 1
 For milestone 1, I used the quick start instructions from VS code to connect to a remote Ubuntu container:
@@ -24,4 +46,3 @@ https://code.visualstudio.com/docs/devcontainers/containers#_quick-start-open-an
 ![Alt text](milestone-1.jpg "Screenshot for Milestone 1")

 # cs-gy-6613-project
 Project for CS-GY-6613 Spring 2023
+# Milestone 4
+## Training
+For documentation regarding the model training, please the Colab notebook here: https://github.com/dbleek/cs-gy-6613-project/blob/milestone-4/dmb443_csgy_6613_project_model_trainer.ipynb
+## Writing the App
+First, I loaded the January 2016 HUPD data again and filtered out any applications from the validation dataset that were neither accepted nor rejected. I used applications only from the validation dataset in absence of a test set, since they were only used during the validation phase of training. I then randomly selected five accepted applications and five rejected applications to use as my app's sample data.
+I then loaded the model and the distilBERT tokenizer as was done during training, except the model trained on the HUPD data was loaded instead of the base distilBERT model.
+The patent numbers of the 10 sample applications are added as keys in a dictionary that map to each application's index in the dataset. This index is used with a helper function called `load_data` that selects the corresponding application from the dataset, and then populates the text inputs accordingly, whenever the selectbox is changed. These inputs include the application title, its decision, its abstract and its claims, although only the last two are entered into the model.
+When the user presses the "Get Patentability Score" button, the abstract and claims are submitted as a form and ran through the tokenizer, just as they were during training. The tokens are then passed to the model. The models outputs the logits, which are then ran through a softmax to calculate the predicted probability of either label--0 for rejected, 1 for accepted--for the input.
+Finally, the probability of the application being accepted is displayed on the page as the application's patentability score. The user will need to scroll down to see the message.
+From the selected samples, the model correctly predicts 7 out of 10 applications. In other words, for 7 applications, the patentability score is .5 or above when the USPTO accepted the application, otherwise it is less than .5. This tracks with the .73 accuracy metric from training. The model seems to do a little better with correctly predicting accepted applications than rejected ones, which is understandable given the slight skew in the data I used to train it.
+## Landing Page
+[https://sites.google.com/nyu.edu/dmb443-cs-gy-6613-project](https://sites.google.com/nyu.edu/dmb443-cs-gy-6613-project)
 # Milestone 3
+USPTO Patentability Classifier: https://huggingface.co/spaces/dbleek/cs-gy-6613-project-final
 # Milestone 2
+Sentiment Analysis App: https://huggingface.co/spaces/dbleek/cs-gy-6613-project
 # Milestone 1
 For milestone 1, I used the quick start instructions from VS code to connect to a remote Ubuntu container:
 ![Alt text](milestone-1.jpg "Screenshot for Milestone 1")

dmb443_csgy_6613_project_model_trainer.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff