- Pre-training code with nanotron - Evaluation suite with lighteval - Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk) - Post-training scripts with TRL & the alignment handbook - On-device tools with llama.cpp for summarization, rewriting & agents
Apache 2.0 licensed. V2 pre-training data mix coming soon!
What a week! A recap for everything you missed βοΈ merve/nov-22-releases-673fbbcfc1c97c4f411def07 Multimodal β¨ > Mistral AI released Pixtral 124B, a gigantic open vision language model > Llava-CoT (formerly known as Llava-o1) was released, a multimodal reproduction of o1 model by PKU > OpenGVLab released MMPR: a new multimodal reasoning dataset > Jina has released Jina-CLIP-v2 0.98B multilingual multimodal embeddings > Apple released new SotA vision encoders AIMv2
LLMs π¦ > AllenAI dropped a huge release of models, datasets and scripts for TΓΌlu, a family of models based on Llama 3.1 aligned with SFT, DPO and a new technique they have developed called RLVR > Jina has released embeddings-v3: new multilingual embeddings with longer context > Hugging Face released SmolTalk: synthetic dataset used to align SmolLM2 using supervised fine-tuning > Microsoft released orca-agentinstruct-1M-v1: a gigantic instruction dataset of 1M synthetic instruction pairs
Image Generation πΌοΈ > Black Forest Labs released Flux 1. tools: four new models for different image modifications and two LoRAs to do image conditioning and better steer generations
Lastly Hugging Face released a new library Observers: a lightweight SDK for monitoring interactions with AI APIs and easily store and browse them π $ pip install observers