metadata
license: other
license_name: agplv3
license_link: https://www.gnu.org/licenses/agpl-3.0.en.html
We have trained distil bert on this dataset [https://huggingface.co/datasets/nothingiisreal/Human_Stories]
It's kinda okay for sampling, but needs improvements and exposure to more synthetic data and types of mistakes LLMs do.
Overall I'm extremely impressed with how well this 68 million parameter model works, and extremely disappointed with how every single AI is getting picked up after only training BERT on GPT3.5 rows of the data.
Class label 0 means human, 1 means AI.
We tested these models all of which worked:
GPT3.5, 4, 4o
Claude Sonnet, Opus
Wizard LM 2
Gemini 1.5 Pro
It's really blatant how every single AI company is using the same watermark whether knowingly or unknowingly (through LLM "incest")