--- license: other license_name: apglv3 license_link: https://www.gnu.org/licenses/agpl-3.0.en.html --- We have trained distil bert on this dataset [https://huggingface.co/datasets/nothingiisreal/Human_Stories] It's kinda okay for sampling, but needs improvements and exposure to more synthetic data and types of mistakes LLMs do. Overall I'm extremely impressed with how well this 68 million parameter model works, and extremely disappointed with how every single AI is getting picked up after only training BERT on GPT3.5 rows of the data. It's really blatant how every single AI company is using the same watermark whether knowingly or unknowingly (through LLM "incest")