Omar Sanseviero

osanseviero

AI & ML interests

Llamas, model merging, massive ASR for data collection, 3D ML, on-device ML, quantization, model judging, ML in browser, healthcare applications, education, intersection of art and ML.🦙

Articles

Organizations

osanseviero's activity

replied to their post 14 days ago
posted an update 14 days ago
view post
Post
4038
Diaries of Open Source. Part 15 🤗

🕵️‍♀️Idefics 2 is out, a multimodal open-source model with very nice capabilities
Models, demo, and datasets: HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe
Blog: https://hf.co/blog/idefics2

💾Snowflake released snowflake-arctic-embed, a family of powerful small embedding models
Model: Snowflake/snowflake-arctic-embed-m
Blog: https://www.snowflake.com/blog/introducing-snowflake-arctic-embed-snowflakes-state-of-the-art-text-embedding-family-of-models/

✨Pile-T5, EleutherAI's T5 model trained on 2T tokens
Blog: https://blog.eleuther.ai/pile-t5/
Models: EleutherAI/pile-t5-65a76a0d0022dd270b385a66
GitHub: https://github.com/EleutherAI/improved-t5

🤖CodeQwen1.5-7B base and chat models. Models trained on 3T tokens strong benchmark results for code generation, editing and SQL
Blog post: https://qwenlm.github.io/blog/codeqwen1.5/
Demo: Qwen/CodeQwen1.5-7b-Chat-demo
Models: Qwen/CodeQwen1.5-7B and Qwen/CodeQwen1.5-7B-Chat

Misc
🦉 DocOwl1.5: Unified Stucture Learning for OCR-free Document Understanding mPLUG/DocOwl
👀Cerule - a tiny Vision LM model Tensoic/Cerule-v0.1
ChemLLM - a LLM for chemistry and molecule science ⚗️https://hf.co/AI4Chem/ChemLLM-7B-Chat-1.5-DPO
Distil Whisper Large
📝New pdf/OCR datasets with 19 samples pixparse/pdf-document-ocr-datasets-660701430b0346f97c4bc628
🔥Gretel AI high quality text-to-sql synthetic dataset gretelai/synthetic_text_to_sql
·
replied to clem's post 16 days ago
view reply

I had missed this one, thanks for sharing!

posted an update 21 days ago
view post
Post
3608
Diaries of Open Source. Part 14 🤗

🔥CohereForAI releases Command R+, an open 104B model with:
- Tool usage capabilities
- Specialized in RAGs
- Multilingual
It's one of the first models to surpass GPT-4 in the lmsys arena, check it out!
Model: CohereForAI/c4ai-command-r-plus
Official demo: CohereForAI/c4ai-command-r-plus
Quantized: CohereForAI/c4ai-command-r-plus-4bit

🎉Google releases a new version of their Gemma instruct models, with improved quality, nicer to converse, and a fancier RL algorithm. The model is similar to Llama 2 70B in the Chat Arena!
Models: google/gemma-release-65d5efbccdbb8c4202ec078b
Try it out in HuggingChat https://hf.co/chat/models/google/gemma-1.1-7b-it

🪄VoiceCraft, a speech editing and TTS SOTA open model
Paper: VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (2403.16973)
Model: pyp1/VoiceCraft

💻Google released CodeGemma, a family of code generation, completion, and chat models
Blog post: https://hf.co/blog/codegemma
Models: google/codegemma-release-66152ac7b683e2667abdee11
Report: https://storage.googleapis.com/deepmind-media/gemma/codegemma_report.pdf

Misc models:
🦖T-Rex2, a very powerful object detection model for many applications https://github.com/IDEA-Research/T-Rex
👀 CT-RATE : A 3D dataset paired with text reports ibrahimhamamci/CT-RATE
🐙Octopus v2: a Gemma-based model trained for Android API - extremely fast, better than Llama+RAG, great results NexaAIDev/Octopus-v2
  • 2 replies
·
replied to louisbrulenaudet's post 21 days ago
replied to their post 30 days ago
replied to their post 30 days ago
posted an update 30 days ago
view post
Post
2248
Diaries of Open Source. Part 13 🤗

🤏Two different bitnet 1.5 open-source replications
Original paper: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (2402.17764)
1bitllm experiment: https://hf.co/blog/joey00072/experiments-with-bitnet-1-5
NousResearch experiment NousResearch/OLMo-Bitnet-1B

🥳Tiny and large multimodal models great for embeddings
GitHub: https://github.com/unum-cloud/uform
Encoders: https://hf.co/collections/unum-cloud/multimodal-encoders-660553903617c5297eb16838
ONNX weights: https://hf.co/collections/unum-cloud/uform-vl-english-large-onnx-66055a57c182d846f3bc1949

📜 SMPLer-X: Expressive Human Pose and Shape Estimation
Project website: https://caizhongang.com/projects/SMPLer-X/
Demo: caizhongang/SMPLer-X
Paper: SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation (2309.17448)

🧙GeoWizard: 3D Geometry Estimation
Project website: https://fuxiao0719.github.io/projects/geowizard/
Demo: lemonaddie/geowizard

Misc models and datasets
- Dolphin-2.8-mistral-7b-v0.2 cognitivecomputations/dolphin-2.8-mistral-7b-v02
- Hermes-2-Pro-11B, a self-frankenmerge 11B variant mattshumer/Hermes-2-Pro-11B
- Large conversational dataset based on Usenet data in the Italian language mii-community/UsenetArchiveIT-conversations
  • 3 replies
·
replied to their post about 1 month ago
posted an update about 1 month ago
view post
Post
3488
Diaries of Open Source. Part 12 🤗

🚀Alibaba releases Qwen1.5-MoE-A2.7B, an interesting MoE with 2.7B activated parameters and 64 experts
Blog https://qwenlm.github.io/blog/qwen-moe/
Demo: Qwen/qwen1.5-MoE-A2.7B-Chat-demo
Models: https://hf.co/Qwen
GitHub: https://github.com/QwenLM/Qwen1.5

🎵VoiceCraft, SOTA speech editing and text to speech
GitHub: https://github.com/jasonppy/VoiceCraft
Model: pyp1/VoiceCraft

🐍 AI21Labs release Jamba, an SSM-Transformer, pretrained MoE which allows a large context window (256K) and high throughput
Blog https://www.ai21.com/blog/announcing-jamba
Model ai21labs/Jamba-v0.1

✨ Berkeley releases Starling-LM-7B, an RLHF-ed model, and -RM-34B, a Yi-based reward model very good for its size
Starling Beta: Nexusflow/Starling-LM-7B-beta
Starling RM: Nexusflow/Starling-RM-34B

🖥️Stability releases Stable Code Instruct 3B, an instruct model for code generation
Blog: https://stability.ai/news/introducing-stable-code-instruct-3b
Demo: stabilityai/stable-code-instruct-3b
Report: https://stability.ai/s/Stable_Code_TechReport_release.pdf

📚Common Corpus: the largest public domain dataset for training LLMs
Blog: https://hf.co/blog/Pclanglais/common-corpus
Dataset: PleIAs/common-corpus-65d46e3ea3980fdcd66a5613

Misc:
⚡GaLore: a very memory-efficient technique that allows pretraining models in consumer GPUs https://hf.co/blog/galore
Moirai
📈Moirai, foundation models for time series forecasting Salesforce/moirai-10-r-models-65c8d3a94c51428c300e0742
🔥 Mistral-ORPO-Capybara-7K, a high-quality Mistral fine-tune using ORPO, a new alignment technique kaist-ai/mistral-orpo-capybara-7k
🤯APISR, an anime super-resolution upscaling model HikariDawn/APISR
·
replied to JustinLin610's post about 1 month ago
view reply

This is super exciting! Congrats for the release!

replied to their post about 1 month ago
posted an update about 1 month ago
view post
Post
2052
Diaries of Open Source. Part 11 🚀

🚀Databricks release DBRX, potentially the best open access model! A 132B Mixture of Experts with 36B active params and trained on 12 trillion tokens
Blog: https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Base and instruct models: databricks/dbrx-6601c0852a0cdd3c59f71962
Demo: databricks/dbrx-instruct

🤏1-bit and 2-bit quantization exploration using HQQ+
Blog post: https://mobiusml.github.io/1bit_blog/
Models: mobiuslabsgmbh/llama2-7b-hqq-6604257a96fc8b9c4e13e0fe
GitHub: https://github.com/mobiusml/hqq

📚Cosmopedia: a large-scale synthetic dataset for pre-training - it includes 25 billion tokens and 30 million files
Dataset: HuggingFaceTB/cosmopedia
Blog: https://hf.co/blog/cosmopedia

⭐Mini-Gemini: multi-modal VLMs, from 2B to 34B
Models: https://hf.co/collections/YanweiLi/mini-gemini-6603c50b9b43d044171d0854
Paper: Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (2403.18814)
GitHub: https://github.com/dvlab-research/MiniGemini

🔥VILA - On Pre-training for VLMs
Models: Efficient-Large-Model/vila-on-pre-training-for-visual-language-models-65d8022a3a52cd9bcd62698e
Paper: VILA: On Pre-training for Visual Language Models (2312.07533)

Misc
👀 FeatUp: a framework for image features at any resolution: mhamilton723/FeatUp FeatUp: A Model-Agnostic Framework for Features at Any Resolution (2403.10516)
🍞ColBERTus Maxiums, a colbertialized embedding model mixedbread-ai/mxbai-colbert-large-v1
🖌️Semantic Palette, a new drawing paradigm ironjr/SemanticPalette
🧑‍⚕️HistoGPT, a vision model that generates accurate pathology reports marr-peng-lab/histogpt https://www.medrxiv.org/content/10.1101/2024.03.15.24304211v1
·
replied to monsoon-nlp's post about 1 month ago
replied to Locutusque's post about 1 month ago
replied to their post about 1 month ago
posted an update about 1 month ago
view post
Post
1589
Diaries of Open Source. Part 10 🚀

🌼Marigold-LCM: A super fast SOTA Depth Estimator
Demo: prs-eth/marigold-lcm
Original paper: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation (2312.02145)
Model: prs-eth/marigold-lcm-v1-0

🌟Quiet-STaR: A self-teaching technique via internal monologue
Paper: Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking (2403.09629)
GitHub: https://github.com/ezelikman/quiet-star
Tweetutorial: https://twitter.com/ericzelikman/status/1768663835106513041

🖼️ WebSight v0.2: A image-to-code dataset containing tailwind CSS, images in screenshots, and more!
Dataset: HuggingFaceM4/WebSight
Paper: Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset (2403.09029)
Blog: https://hf.co/blog/websight

🕵️Agent-FLAN - effective agent tuning for LLMs
Paper: Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models (2403.12881)
Model: internlm/Agent-FLAN-7b
Dataset: internlm/Agent-FLAN
Website: https://internlm.github.io/Agent-FLAN/

🔥HPT, a family of multimodal LLMs from HyperGAI
Blog post: https://hypergai.com/blog/introducing-hpt-a-family-of-leading-multimodal-llms
Model: HyperGAI/HPT
GitHub: https://github.com/hyperGAI/HPT

🌏Models and datasets around the world
- Tess-70B, a MiQu-70B fine-tune with high-quality data migtissera/Tess-70B-v1.6
- UNI, a model trained on 100 million pathology images from 100k+ slides MahmoodLab/UNI
- CONCH, a VLM trained on 1.17 million pathology image-text pairs MahmoodLab/CONCH
·
replied to their post about 1 month ago
replied to their post about 1 month ago
posted an update about 1 month ago
view post
Post
3255
Diaries of Open Source. Part 9!

⏰Amazon releases Chronos, a family of models for time series
Base model: amazon/chronos-t5-large
Paper: Chronos: Learning the Language of Time Series (2403.07815)
Models: amazon/chronos-models-65f1791d630a8d57cb718444

💡ORPO Alignment: align without a reference model nor SFT!
Paper: ORPO: Monolithic Preference Optimization without Reference Model (2403.07691)
Models: kaist-ai/orpo-65efef87544ba100aef30013
GitHub: https://github.com/xfactlab/orpo

🇺🇳Cohere releases 250M Wikipedia Embeddings in 300+ languages
Data: Cohere/wikipedia-2023-11-embed-multilingual-v3
Announcement: https://twitter.com/Nils_Reimers/status/1767891859207057618

🧬SegmentNT: a LLM for annotating DNA at single nucleotide resolution
Models: InstaDeepAI/segmentnt-65eb4941c57808b4a3fe1319
GitHub repo: https://github.com/instadeepai/nucleotide-transformer
Paper: https://www.biorxiv.org/content/10.1101/2024.03.14.584712v1

🚀DynamiCrafter: video generation models for interpolation and looping are out!
Project page: https://doubiiu.github.io/projects/DynamiCrafter/
GitHub: https://github.com/Doubiiu/DynamiCrafter
Demo: Doubiiu/DynamiCrafter_interp_loop

🚀Stanford releases Anticipatory Music Transformer:
GitHub: https://github.com/jthickstun/anticipation/
Models: https://hf.co/stanford-crfm
Original blog announcement: https://crfm.stanford.edu/2023/06/16/anticipatory-music-transformer.html
  • 2 replies
·
replied to giux78's post about 1 month ago
replied to their post about 1 month ago
posted an update about 1 month ago
view post
Post
2508
Diaries of Open Source. Part 8!

🤯CRM: Image-to-3D Textured Mesh
Demo: Zhengyi/CRM
Model: Zhengyi/CRM
Project page: https://ml.cs.tsinghua.edu.cn/~zhengyi/CRM/
Paper: CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model (2403.05034)

🤏Half Quadratic Quantization: super-fast quantization of very large models
Blog post: https://mobiusml.github.io/hqq_blog/
Colab: https://colab.research.google.com/drive/1cG_5R_u9q53Uond7F0JEdliwvoeeaXVN?usp=sharing
Repo: https://github.com/mobiusml/hqq

🤗GemMoE -Gemma + MoE
Model: Crystalcareai/GemMoE-Base-Random
Collection: Crystalcareai/gemmoe-65f11f4922af97ebe9943591

👀VeCLIP and MOFI, new 0-shot and image retrieval models by Apple, are now open-source!
GitHub: https://github.com/apple/ml-veclip/ and https://github.com/apple/ml-mofi
VeCLIP paper: From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions (2310.07699)
MOFI paper: MOFI: Learning Image Representations from Noisy Entity Annotated Images (2306.07952)

⚡SPIN: Recipe for alignment with very little data
Collection: argilla/dibt-prompt-collective-spin-65ef59062518776024395fc3
Tweetutorial: https://twitter.com/argilla_io/status/1767608154697699455

👀ViT Prisma - an interoperability library for vision models
GitHub: https://github.com/soniajoseph/ViT-Prisma

☕OpenLRM: full model and training code are open-sourced
Codebase: https://github.com/3DTopia/OpenLRM
Demo: zxhezexin/OpenLRM
Models: https://huggingface.co/zxhezexin

⚗️Oxford releases an extensive PEFT evaluation for bio models
Model: NTaylor/bio-mobilebert-mimic-mp-lora
GitHub: https://github.com/nlpie-research/efficient-ml
Paper: Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks (2402.10597)

🌍Data and models around the world
Hermes 2 Pro 7B: an upgraded Nous Hermes 2 model with strong function calling and JSON capabilities NousResearch/Hermes-2-Pro-Mistral-7B
Navarasa-2.0 : Gemma fine-tuned in 15 indian language Telugu-LLM-Labs/navarasa-65f5e6ffdf29f02c6d7767ce
·
replied to their post about 1 month ago
posted an update about 1 month ago
view post
Post
1866
Diaries of Open Source. Part 7!

🔥Sakana releases Evolutionary Model Merge
Blog post: https://sakana.ai/evolutionary-model-merge/
Paper: Evolutionary Optimization of Model Merging Recipes (2403.13187)
Models and demo: https://hf.co/SakanaAI

🍞MixedBread releases new SoTA sentence embedding model
Announcement: https://www.mixedbread.ai/blog/mxbai-embed-large-v1
Model: mixedbread-ai/mxbai-embed-large-v1

🎥VideoMamba, a Mamba-based model for video understanding
Blog: https://hf.co/blog/vladbogo/video-mamba
Demo: OpenGVLab/VideoMamba
Model: OpenGVLab/VideoMamba

🔍 MathVerse, a visual math benchmark for multimodal LLMs
Paper page: https://mathverse-cuhk.github.io/
Dataset: AI4Math/MathVerse
Paper: MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? (2403.14624)

🧠GraphWiz, a family of instruct-tuned LLMs to solve graph problems
Repos: https://hf.co/GraphWiz
Paper: GraphWiz: An Instruction-Following Language Model for Graph Problems (2402.16029)

🪆NLLB-SigLIP-MRL: a combination of NLLB and SigLIP trained with Matryoshka representation learning
Model: visheratin/nllb-siglip-mrl-large
Tweet: https://twitter.com/visheratin/status/1766643219909984734?s=46

🧍HDM and ProciGen: Template-free reconstruction of human-object interactions
Paper page: https://virtualhumans.mpi-inf.mpg.de/procigen-hdm/
Demo: xiexh20/HDM-interaction-recon
Models: xiexh20/HDM-models

🌎Models and data around the world
EagleX 7B, multi-lingual RNN-based model https://hf.co/spaces/recursal/EagleX-7B-1.7T-Gradio-Demo
Tamil LLM mervinpraison/tamil-large-language-model-7b-v1.0
  • 2 replies
·
replied to lorraine2's post about 1 month ago
replied to their post about 1 month ago
posted an update about 1 month ago
view post
Post
1903
Diaries of Open Source. Part 6!

🏎️xAI releases Grok-1, a 314B MoE
Blog: https://x.ai/blog/grok-os
GH repo: https://github.com/xai-org/grok-1
Model: xai-org/grok-1

🕺MusicLang, a model for controllable music generation
Demo: musiclang/musiclang-predict
GH repo: https://github.com/musiclang/musiclang_predict

🔬BioT5: a family of models for biology and chemical text tasks
Base model: QizhiPei/biot5-base
Model for molecule captioning and design: QizhiPei/biot5-base-mol2text and QizhiPei/biot5-base-text2mol
GH Repo: https://github.com/QizhiPei/BioT5
Paper: BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations (2310.07276)

🤏Check out the AQLM and QMoE official weights from ISTA-DAS lab
Org: https://hf.co/ISTA-DASLab
Papers: Extreme Compression of Large Language Models via Additive Quantization (2401.06118) and QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models (2310.16795)

🚀Community releases
Einstein-v4-7B, a Mistral fine-tune on high-quality data Weyaxi/Einstein-v4-7B
IL-7B, a Misttral fine-tune merge for rheumatology cmcmaster/il_7b
Caselaw Access Project, a collaboration to digitalize 40 million US court decisions from 6.7 million cases from 360 years TeraflopAI/Caselaw_Access_Project

🌍Data and models around the world
HPLT Monolingual, a dataset of 75 languages with over 40TB of data HPLT/hplt_monolingual_v1_2
OpenLLM Turkish Benchmarks & Leaderboard malhajar/openllmturkishleadboard-datasets-65e5854490a87c0f2670ec18 and malhajar/OpenLLMTurkishLeaderboard
Occiglot, a collaborative effort for European LLMs with an initial release of 7B models for French, German, Spanish, and Italian occiglot/occiglot-eu5-7b-v01-65dbed502a6348b052695e01
Guftagoo, a Hindi+Hinglish multi-turn conversational dataset https://hf.co/datasets/Tensoic/gooftagoo
AryaBhatta-Orca-Maths-Hindi dataset https://hf.co/datasets/GenVRadmin/Aryabhatta-Orca-Maths-Hindi
  • 1 reply
·
posted an update about 2 months ago
view post
Post
Diaries of Open Source. Part 5!

🤯Contextual KTO Mistral PairRM: this model combines iterative KTO, SnorkelAI DPO dataset, Allenai PairRM for ranking, Mistral for the base model, and is a very strong model with Claude 3 quality on AlpacaEval 2.0
Final model: ContextualAI/Contextual_KTO_Mistral_PairRM
Dataset: snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset
Leaderboard: https://tatsu-lab.github.io/alpaca_eval/
Base model: mistralai/Mistral-7B-Instruct-v0.2

🤏 tinyBenchmarks: Quick and cheap LLM evaluation!
Code: https://github.com/felipemaiapolo/tinyBenchmarks
Paper: tinyBenchmarks: evaluating LLMs with fewer examples (2402.14992)
Data: tinyBenchmarks/tinyMMLU

🎨Transformers.js 2.16 includes StableLM, speaker verification and diarization, and better chat templating. Try some fun demos!
- Xenova/video-object-detection
- Xenova/cross-encoder-web
- Xenova/the-tokenizer-playground

🏴‍☠️ Abascus Liberated-Qwen1.5-72B, a Qwen 72B-based model that strongly follows system prompts
Model: abacusai/Liberated-Qwen1.5-72B

👀Design2Code: benchmark of webpage screenshots to code
Data: SALT-NLP/Design2Code
Project https://salt-nlp.github.io/Design2Code/
Paper Design2Code: How Far Are We From Automating Front-End Engineering? (2403.03163)

🌎Data and models around the world
- One of the biggest Italian datasets https://hf.co/datasets/manalog/UsenetArchiveIT
- IndicLLMSuite: argest Pre-training and Instruction Fine-tuning dataset collection across 22 Indic languages ai4bharat/indicllmsuite-65ee7d225c337fcfa0991707
- Hebrew-Gemma-11B, the best base Hebrew model yam-peleg/Hebrew-Gemma-11B
- Komodo-7B, a family of multiple Indonesian languages LLMs Yellow-AI-NLP/komodo-7b-base

You can find the previous part at https://huggingface.co/posts/osanseviero/127895284909100
replied to chiphuyen's post about 2 months ago
view reply

Thanks for sharing! Btw Gradio is a separate org but is also HF :)

posted an update about 2 months ago
view post
Post
Diaries of Open Source. Part 4!

🌏Cohere and Cohere4AI release Command-R, a 35B model that is multilingual, RAG-optimized, and can manage tools!
Model: CohereForAI/c4ai-command-r-v01
Blog post: https://txt.cohere.com/command-r/

🧑‍🍳StarChat2: A powerful code model that is conversational
Try it out: HuggingFaceH4/starchat2-playground
Repos: HuggingFaceH4/starchat2-15b-65f068417b330fafad751fce
Training code: https://github.com/huggingface/alignment-handbook/tree/main/recipes/starchat2-15b

🐲Yi-9B: trained on 3 trillion tokens, this english-chinese LLM is quite good and with a very nice detailed report!
Model: 01-ai/Yi-9B
Paper: Yi: Open Foundation Models by 01.AI (2403.04652)

🐋DeepSeek-VL, 1.3B and 7B VLMs
Paper: DeepSeek-VL: Towards Real-World Vision-Language Understanding (2403.05525)
Large model: deepseek-ai/deepseek-vl-7b-chat

✍️Writer releases OmniACT: a dataset for multimodal agents for desktop and web.
Dataset: Writer/omniact
Paper: OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web (2402.17553)

🍎Apple releases MobileCLIP: fast image-text models! https://github.com/apple/ml-mobileclip

🦙💪LlamaGym - fine-tune LLM agents with RL in just a few lines of code! https://github.com/KhoomeiK/LlamaGym

🖼️New multimodal leaderboard ConTextual https://huggingface.co/blog/leaderboard-contextual

🎁 Design2Code: benchmark for multimodal LLMs for automating front-end development.
Dataset SALT-NLP/Design2Code
Paper Design2Code: How Far Are We From Automating Front-End Engineering? (2403.03163)
Project https://salt-nlp.github.io/Design2Code/

You can find the previous part at https://huggingface.co/posts/osanseviero/633758457910104
replied to Jaward's post about 2 months ago
posted an update about 2 months ago
view post
Post
Diaries of Open Source. Part 3! OS goes to the moon!

💻 OpenCodeInterpreter, a family of very powerful code generation models
Models: m-a-p/opencodeinterpreter-65d312f6f88da990a64da456
Paper: OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement (2402.14658)
Demo m-a-p/OpenCodeInterpreter_demo

🔷🔶Zephyr 7B Gemma, Gemma fine-tuned with the Zephyr recipe
Model: HuggingFaceH4/zephyr-7b-gemma-v0.1
Demo: HuggingFaceH4/zephyr-7b-gemma-chat
GH Repo: https://github.com/huggingface/alignment-handbook

🪆The MixedBread folks released a 2D Matryoshka text embedding model, which means you can dynamically change the embedding size and layer counts
Model: mixedbread-ai/mxbai-embed-2d-large-v1
Release blog post: https://www.mixedbread.ai/blog/mxbai-embed-2d-large-v1

🐋Microsoft released Orca Math, which includes 200K grade school math problems
Dataset: microsoft/orca-math-word-problems-200k

🥷IBM silently released Merlinite, a cool model trained on Mixtral-generated synthetic data using a novel LAB method ibm/merlinite-7b

🌚 Moondream2 - a small vision language model to run on-device!
Model: vikhyatk/moondream2
Demo: vikhyatk/moondream2

🏙️CityDreamer: 3D City Generation
Demo: hzxie/city-dreamer
Repo: https://github.com/hzxie/city-dreamer
Model: hzxie/city-dreamer

🌏ML in all languages
Sailor, a family of South-East Asian languages models sail/sailor-language-models-65e19a749f978976f1959825
Samvaad dataset, which includes 140k QA pairs in Hindi, Bengali, Marathi, Tamil, Telugu, Oriya, Punjabi, and Gujarati GenVRadmin/Samvaad-Mixed-Language-2

You can see the previous part at https://huggingface.co/posts/osanseviero/674644082063278
  • 1 reply
·
replied to mayank-mishra's post about 2 months ago
replied to DmitryRyumin's post about 2 months ago
view reply

Very cool! It would be great to have the checkpoints on the Hub, too :)
Congrats in getting accepted at ICLR

cc @dylanebert

replied to their post about 2 months ago
replied to urchade's post about 2 months ago
view reply

Very cool! Is the model and the data somewhere on Hugging Face to easily download?

posted an update about 2 months ago
view post
Post
Diaries of Open Source. Part 2. Open Source is going brrrrr

🚀The European Space Agency releases MajorTOM, a dataset of earth observation covering half the earth. The dataset has 2.5 trillion pixels! Congrats @aliFrancis and @mikonvergence !
Dataset: Major-TOM/Core-S2L2A
Viewer: Major-TOM/MajorTOM-Core-Viewer

🍞Re-ranking models by MixedBreadAI, with very high quality, Apache 2 license, and easy to use!
Models: https://huggingface.co/models?other=reranker&sort=trending&search=mixedbread-ai
Blog: https://www.mixedbread.ai/blog/mxbai-rerank-v1

🧊StabilityAI and TripoAI release TripoSR, a super-fast MIT-licensed image-to-3D model!
Model: stabilityai/TripoSR
Demo: stabilityai/TripoSR

🤝Together AI and HazyResearch release Based
Models and datasets: hazyresearch/based-65d77fb76f9c813c8b94339c
GH repo: https://github.com/HazyResearch/based

🌊LaVague: an open-source pipeline to turn natural language into browser actions! It can run locally with HuggingFaceH4/zephyr-7b-gemma-v0.1
Read more about it at https://huggingface.co/posts/dhuynh95/717319217106504

🏆Berkeley Function-Calling Leaderboard
Read about it: https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html
Leaderboard: https://gorilla.cs.berkeley.edu/leaderboard.html

🐬Sailor-Chat: chat models built on top of OpenOrca and @sarahooker CohereForAI Aya project. They can be used for South-East Asia languages such as Indonesian, Thai, Vietnamese, Malay and Lao!
Models: sail/sailor-language-models-65e19a749f978976f1959825
Demo: sail/Sailor-7B-Chat

🤗Arabic-OpenHermes-2.5: OpenHermes dataset translated to Arabic 2A2I/Arabic-OpenHermes-2.5

See the previous part here https://huggingface.co/posts/osanseviero/622788932781684
  • 3 replies
·
replied to robmarkcole's post about 2 months ago
posted an update about 2 months ago
view post
Post
Diaries of Open Source. Part 1.

What a week! Here are some of the exciting Open Source releases of the week!

1. BigCode releases The Stack v2 and StarCoder 2
Resources in https://huggingface.co/posts/loubnabnl/596860170283496
Blog https://huggingface.co/blog/starcoder2
Collection: bigcode/starcoder2-65de6da6e87db3383572be1a

2. Playground v2.5, a very powerful new text-to-image model
Model: playgroundai/playground-v2.5-1024px-aesthetic
Demo: playgroundai/playground-v2.5
Blog: https://playground.com/blog/playground-v2-5

3.Evo: DNA foundation models
Blog: https://arcinstitute.org/news/blog/evo
Models: togethercomputer/evo-1-131k-base

4. OpenHermesPreferences: a dataset of ~1 million AI Preferences argilla/OpenHermesPreferences

5. SpeechBrain 1.0: a toolkit with hundreds of recipes and pretrained models for audio-related tasks, such as speech recognition, diarization, and enhancement. New major release!
HF repos: https://huggingface.co/speechbrain
Website: https://speechbrain.github.io/

6. Tower: a suite of Llama-based multilingual translation models Unbabel/tower-659eaedfe36e6dd29eb1805c

7. AllenAI releases OLMo-7B-Instruct
allenai/olmo-suite-65aeaae8fe5b6b2122b46778

8. DIBT - An crowdsourced effort to human-rate prompts. Its 10k prompts dataset is released ttps://huggingface.co/datasets/DIBT/10k_prompts_ranked

9. ChatMusician: A Llama 2 fine-tuned model for music generation m-a-p/ChatMusician

10. Bonito, an model that converts data into synthetic instruction datasets
GitHub: https://github.com/BatsResearch/bonito
Model: BatsResearch/bonito-v1
Paper: Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation (2402.18334)
·
posted an update 2 months ago
view post
Post
Introducing: Zephyr Gemma!

The community has struggled to do a good preference-tune of Gemma, so the amazing @lewtun and @philschmid built an open-source recipe and trained a model to help people get started.

Handbook: https://github.com/huggingface/alignment-handbook/blob/main/recipes/zephyr-7b-gemma/README.md
Model: HuggingFaceH4/zephyr-7b-gemma-v0.1
Demo: HuggingFaceH4/zephyr-7b-gemma-chat

Some interesting details
- Fine-tuned on DEITA and DPOed with Argilla DPO dataset
- Very strong MT Bench results (7.81), better than Zephyr Beta (mistral based) and Gemma Instruct
- Can run locally with tools such as llama.cpp on a Mac
- Not so good AGIEval results compared to mistral-based tunes
- All training code is open-sourced
- Trained for 105 minutes on 8x H100
- No system message

Big kudos to the team! Super exciting to see a good fine-tune for Gemma
  • 1 reply
·
replied to chiphuyen's post 2 months ago
replied to vladbogo's post 2 months ago
replied to trisfromgoogle's post 2 months ago
view reply

This is such an exciting release!! Amazing work from Google DeepMind and all the team!

posted an update 3 months ago
view post
Post
Mixture of experts: beware 🛡️⚔️

New paper by DeepMind: Buffer Overflow in MoE Buffer Overflow in Mixture of Experts (2402.05526)

The paper shows an adversarial attack strategy in which a user sends malicious queries that can affect the output of other user queries from the same batch.

So if in the same batch we have
- User A benign query
- User B malicious query
The response for A might be altered!😱

How is this possible?
One approach is to fill the token buffers with adversarial data, hence forcing the gating to use the non-ideal experts or to entirely drop the bening tokens (in the case of finite limit size).

This assumes that the adversary can use the model as a black-box but can observe the logit outputs + ensure that the data is always grouped in the same batch.

How to mitigate this?
- Randomize batch order (and even run twice if some queries are very sensitive)
- Use a large capacity slack
- Sample from gate weights instead of top-k (not great IMO, as that require more memory for inference)

Very cool paper!!
replied to akhaliq's post 3 months ago
replied to victor's post 3 months ago
replied to hunkim's post 3 months ago
replied to yuchenlin's post 3 months ago
view reply

This is amazing! As so many new VLLMs are being launched, this can be quite impactful!

replied to satpalsr's post 3 months ago
view reply

Any plans to make the SFT dataset public?

replied to soldni's post 3 months ago
view reply

This is amazing! Congratulations on the launch!

replied to clem's post 3 months ago
replied to their post 3 months ago
view reply

At the same time, there was discussion when BLOOM was trained that the multilingualism was hurting the english performance