🆕 Releasing a new series of 8 zeroshot classifiers: better performance, fully commercially useable thanks to synthetic data, up to 8192 tokens, run on any hardware.

🤖 The zeroshot-v2.0-c series replaces commercially restrictive training data with synthetic data generated with mistralai/Mixtral-8x7B-Instruct-v0.1 (Apache 2.0). All models are released under the MIT license.
🦾 The best model performs 17%-points better across 28 tasks vs. facebook/bart-large-mnli (the most downloaded commercially-friendly baseline).
🌏 The series includes a multilingual variant fine-tuned from BAAI/bge-m3 for zeroshot classification in 100+ languages and with a context window of 8192 tokens
🪶 The models are 0.2 - 0.6 B parameters small, so they run on any hardware. The base-size models are +2x faster than bart-large-mnli while performing significantly better.
🤏 The models are not generative LLMs, they are efficient encoder-only models specialized in zeroshot classification through the universal NLI task.
🤑 For users where commercially restrictive training data is not an issue, I've also trained variants with even more human data for improved performance.

Next steps:
✍️ I'll publish a blog post with more details soon
🔮 There are several improvements I'm planning for v2.1. Especially the multilingual model has room for improvement.

All models are available for download in this Hugging Face collection: MoritzLaurer/zeroshot-classifiers-6548b4ff407bb19ff5c3ad6f

These models are an extension of the approach explained in this paper, but with additional synthetic data:
Prompts are hyperparameters. Every time you test a different prompt on your data, you become less sure if the LLM actually generalizes to unseen data.

Issues of overfitting to a test set seem like concepts from boring times when people still fine-tuned models, but it's just as important for "zeroshot prompting". Using a separate validation split to tune the main hyperparameter of LLMs (the prompt) is just as important as train-val-test splitting for fine-tuning. The only difference is that you don't have a training dataset anymore and it somehow feels different because there is no training / no parameter updates.

Its easy to trick yourself into believing that an LLM performs well on your task, while you've actually overfit the prompt on your data. Every good "zeroshot" paper should clarify that they used a validation split for finding their prompt before final testing.