Quick update from week 1 of smol course. The community is taking the driving seat and using the material for their own projects. If you want to do the same, join in!
- we have ongoing translation projects in Korean, Vietnamese, Portuguese, and Spanish - 3 chapters are ready for students. On topics like, instruction tuning, preference alignment, and parameter efficient fine tuning - 3 chapters are in progress on evaluation, vision language models, and synthetic data. - around 780 people have forked the repo to use it for learning, teaching, sharing.
⏭️ Next step is to support people that want to use the course for teaching, content creation, internal knowledge sharing, or anything. If you're into this. Drop an issue or PR
There's a new timm release, v 1.0.12, with a focus on optimizers. The optimizer factory has been refactored, there's now a timm.optim.list_optimizers() and new way to register optimizers and their attributes. As always you can use an timm optimizer like a torch one, just replace torch.optim with timm.optim
New optimizers include: * AdafactorBigVision - adfactorbv * ADOPT - adopt / adoptw (decoupled decay) * MARS - mars * LaProp - laprop * Cautious Optimizers - a modification to all of the above, prefix with c as well as cadamw, cnadamw, csgdw, clamb, crmsproptf
Six predictions for AI in 2025 (and a review of how my 2024 predictions turned out):
- There will be the first major public protest related to AI - A big company will see its market cap divided by two or more because of AI - At least 100,000 personal AI robots will be pre-ordered - China will start to lead the AI race (as a consequence of leading the open-source AI race). - There will be big breakthroughs in AI for biology and chemistry. - We will begin to see the economic and employment growth potential of AI, with 15M AI builders on Hugging Face.
How my predictions for 2024 turned out:
- A hyped AI company will go bankrupt or get acquired for a ridiculously low price ✅ (Inflexion, AdeptAI,...)
- Open-source LLMs will reach the level of the best closed-source LLMs ✅ with QwQ and dozens of others
- Big breakthroughs in AI for video, time-series, biology and chemistry ✅ for video 🔴for time-series, biology and chemistry
- We will talk much more about the cost (monetary and environmental) of AI ✅Monetary 🔴Environmental (😢)
- A popular media will be mostly AI-generated ✅ with NotebookLM by Google
- 10 millions AI builders on Hugging Face leading to no increase of unemployment 🔜currently 7M of AI builders on Hugging Face
small but mighty 🔥 you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM 🫰🏻 also with gradient accumulation simulated batch size is 16 ✨ I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work 💝 https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
🖼️ Multimodal > At Hugging Face we released SmolVLM, a performant and efficient smol vision language model 💗 > Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents 🤖 > Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length > ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM > Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning model📖 > Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT 📕
💬 LLMs > Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet 🔥 > AliBaba has released Marco-o1, a new open-source reasoning model 💥 > NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)
⏯️ Image/Video Generation > Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation > Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res ⏯️ > Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla 🏷️
Audio > OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens
🌟🌎 Cohere releases Aya 8B & 32B: SOTA multilingual models for 23 languages !
How did they manage to beat top contenders while also adding 23 languages?
🔄 𝗧𝗿𝗮𝗶𝗻 𝗼𝗻 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗱𝗮𝘁𝗮: • Synthetic data has been said to cause model-collapse after too much training • Cohere has introduced "data arbitrage" to prevent this by strategically sampling from a pool of several teacher models instead of one single teacher • First train a model pool for each different groups of languages, and employ an internal Reward Model named "Arbiter" to evaluate and select the optimal generation. Then only the best generation is kept as the final completion for each prompt ➡️ This process is particularly effective for multilingual setting, where no single teacher model performs in all languages : here "Multilingual Arbitrage" singlehandedly improves win rates of the 8B model vs Gemma-2-9B by 10 points!
🧩 𝗨𝘀𝗲 𝗺𝗼𝗱𝗲𝗹 𝗺𝗲𝗿𝗴𝗶𝗻𝗴: Rather than struggling to find the right mix of data in training a single model for multilingual use, just train language specific models then merge them! • Maximize diversity between merged checkpoints by training each on different language families. • Experimented fancy techniques (SLERP, TIES, DARE-TIES) but found out weighted averaging to be the most consistent! ➡️ Merging had 3x more gains at high 35B scale vs the 8B scale - consistent with literature findings that merging is more effective at scale
⚡️ 𝗚𝗿𝗲𝗮𝘁 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲: Automatic evaluations on Arena-Hard-Auto dataset: ➡️ Aya Expanse 8B beats models from its weight class such as Gemma 2 9B, Llama 3.1 8B, and the recent Ministral 8B, with win rates ranging from 60.4% to 70.6% ➡️ Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B (2x its size) • ⚠️ But this performance eval comes from only one benchmark! Let's wait for Open LLM leaderboard evals;
amazing leaderboard by @rwightman, compare all the image backbones on various metrics against model performance below is an example for top-k against inferred samples per second timm/leaderboard
New sampling strategy dropped in 🤗 transformers -- Min P sampling 🔥
Are you tired of having top_k arbitrarily discarding high-quality continuations? Or top_p forgetting to exclude low-probability tokens, derailing your generation? Try out the new min_p flag in generate, fresh from a PR merged today! 🥬
Min P consists of a dynamic token filter -- as opposed to Top K, which keeps the K most likely tokens, and Top P, which keeps the most likely tokens up to a fixed cumulative probability, both static filters. Min P takes a base probability (defined in the min_p flag) and multiplies it by the probability of the most likely token in the distribution for the next token. All tokens less likely than the resulting value are filtered. What happens with this strategy? 👉 High probability token present -> aggressive filter (we don't want to miss on that high-probability case and risk derailing generation) 👉 No high probability token present -> relaxed filter (there are many continuation possibilities that the model finds plausible)
You should set min_p to a low value, between 0.05 and 0.1. It behaves particularly well for creative text generation when paired up with temperature > 1.
I have documented my journey of this specific PR in a blog post for everyone to read. The highlight of the PR was when the first author of DoRA reviewed my code.