GPT-NYC-nontoxic
About
GPT2 (small version on HF) fine-tuned on questions and responses from https://reddit.com/r/asknyc
I filtered comments to ones with scores >= 3, and responding directly to the original post ( = ignoring responses to other commenters). I also added many tokens which were common on /r/AskNYC but missing from GPT2.
Additional and tokens control following output. Toxic comments (about 5.5% of input data) are those which were flagged by Perspective API with toxicity > 0.7, or by English DeHateBERT, with tagging for all comments related to LGBTQ identity to avoid false positives / more aggressive censorship from these classifiers.
Try prompting with question? - additional info %% <Toxic>
Or question? - additional info %% <NonToxic>
Other options
The gpt-nyc-small repo is based on GPT2 [small] but without the and tags. It is the most directly comparable model to this one.
The main gpt-nyc repo is based on GPT2-Medium and comes off more accurate. It does not have Toxic/NonToxic tagging.
Blog
Initial model: https://mapmeld.medium.com/gpt-nyc-part-1-9cb698b2e3d
Notebooks
Data processing / new tokens
https://colab.research.google.com/drive/13BOw0uekoAYB4jjQtaXTn6J_VHatiRLu
Fine-tuning GPT2 (small)
https://colab.research.google.com/drive/1FnXcAh4H-k8dAzixkV5ieygV96ePh3lR
Predictive text and probabilities
Scroll to end of
https://colab.research.google.com/drive/1FnXcAh4H-k8dAzixkV5ieygV96ePh3lR
to see how to install git-lfs and trick ecco into loading this.
- Downloads last month
- 9