merve (merve)

posted an update 3 days ago

Post

2818

GitHub refuses to render notebooks for a long time now 💔

so smol-vision now lives in Hugging Face model repository 🤗 merve/smol-vision

1 reply

·

posted an update 4 days ago

Post

3297

ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source 👏 ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🤯

reacted to chansung's post with 👍 4 days ago

Post

3340

YAML engineering becomes more and more important than ever from infra provisioning to model training (recipes).

Here, I built a simple editor first for @dstackai , and I will share the live endpoint this week. Let me know what you think about this approach.

Based on this approach, if people think this is useful, I am going to do the same thing for the LLM training recipes for popular frameworks such as Hugging Face open-r1, Axolotl, and so on. Let me hear.

reacted to YerbaPage's post with 🔥 4 days ago

Post

4734

Is 100% Pass Rate on HumanEval possible? Yes! ✅

Meet MGDebugger if you are tired of LLMs failing on complex bugs 🤔 Our MGDebugger, just hit 100% accuracy on HumanEval using the DeepSeek-R1 model. 🚀

✨ Demo: learnmlf/MGDebugger
📝 Paper: From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging (2410.01215)
💻 Code: https://github.com/YerbaPage/MGDebugger

HumanEval may be retired, we're ready for the next challenge In more complex scenarios! You may also take look at this repo for a collection of awesome repo-level coding tasks!

🖥️ https://github.com/YerbaPage/Awesome-Repo-Level-Code-Generation

posted an update 4 days ago

Post

3578

Huge drops in open AI past week!
Find more models, datasets, demos here merve/releases-july-4-686bcc54ed7c45c341fbf654
Some of our picks 🫡
⏯️ BAAI/MTVCraft is a new Veo3-like text-to-video model, demo is here BAAI/MTVCraft
🧑🏻‍💻 apple/diffucoder-6868139f56672ae046fe04e8 is a new family of diffusion LLMs (7B base and instruct) for coding
🗣️ kyutai/tts-1.6b-en_fr is a new small TTS model for English and France
👀 aharley/alltracker is a new pixel tracking model by Stanford, demo is here aharley/alltracker
📖 racineai/OGC_MEGA_MultiDomain_DocRetrieval is a new large visual document retrieval dataset

1 reply

·

posted an update 9 days ago

Post

891

SOOOO MANY MODEL RELEASES 😍
Here's some picks from past week 🤗

> ByteDance/XVerse is a new identity preserving image generation model 🖼️
> google/gemma-3n-E4B-it, any-to-text model supported by transformers 🤗
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers 📑
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c

reacted to m-ric's post with 🔥 9 days ago

Post

3534

If you're using any HF libraries, you should enable the Hub MCP in your agentic coding tool!

The brand new Docs Semantic Search tool is intravenous caffeine supply for Cursor, enables to correct API errors in a few seconds, gj @mishig ⚡️⚡️

👉 To enable Hub MCP, head to your account setting, under MCP, and it will give you everything you need!

posted an update 10 days ago

Post

2977

visual reasoning is now in transformers 🔥
THUDM/GLM-4.1V-9B-Thinking is just released and merged into transformers, we gave it a vibe test run 🤠

it's very good, comes with 64k context length and MIT license 😍
it supports 4k image tokens and any aspect ratio as well!
Notebook: http://colab.research.google.com/drive/1atODIiV57hOZLv16Bjzwd6fwx0yoTorj?usp=sharing
Demo: THUDM/GLM-4.1V-9B-Thinking-Demo

posted an update 12 days ago

Post

2501

so many multimodal releases these days 🤠
> ERNIE-4.5-VL: new vision language MoE models by Baidu https://huggingface.co/models?search=ernie-4.5-vl
> new visual document retrievers by NVIDIA (sota on ViDoRe!) nvidia/llama-nemoretriever-colembed-3b-v1 nvidia/llama-nemoretriever-colembed-1b-v1
> Ovis-3b: new image-text in image-text out models by Alibaba ⤵️ https://huggingface.co/spaces/AIDC-AI/Ovis-U1-

reacted to bartowski's post with 🤗 16 days ago

Post

10084

Was going to post this on /r/LocalLLaMa, but apparently it's without moderation at this time :')

bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF

Was able to use previous mistral chat templates, some hints from Qwen templates, and Claude to piece together a seemingly working chat template, tested it with llama.cpp server and got perfect results, though lmstudio still seems to be struggling for some reason (don't know how to specify a jinja file there)

Outlined the details of the script and results in my llama.cpp PR to add the jinja template:

https://github.com/ggml-org/llama.cpp/pull/14349

Start server with a command like this:

./llama-server -m /models/mistralai_Mistral-Small-3.2-24B-Instruct-2506-Q4_K_M.gguf --jinja --chat-template-file /models/Mistral-Small-3.2-24B-Instruct-2506.jinja

and it should be perfect! Hoping it'll work for ALL tools if lmstudio gets an update or something, not just llama.cpp, but very happy to see it works flawlessly in llama.cpp

In the meantime, will try to open a PR to minja to make the strftime work, but no promises :)

posted an update 16 days ago

Post

585

Dataset Viewer for PDFs just landed on Hugging Face 📖🤗 you can now preview all the PDFs easier than before!

on top of this, there's PdfFolder format to load the PDF datasets quicker 💨
> to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf
> if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder 🤝

read document dataset docs https://huggingface.co/docs/datasets/main/en/document_dataset
check all the document datasets here https://huggingface.co/datasets?modality=modality:document&sort=trending 📖

1 reply

·

posted an update 17 days ago

Post

634

we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector 🙏🏻

it works very well, try it yourself: ETH-CVG/LightGlue

here's an in-the-wild test with two images of the same place ⤵️

1 reply

·

posted an update 18 days ago

Post

4326

Release picks of the past week is here! Find more models, datasets, Spaces here merve/june-20-releases-68594824d1f4dfa61aee3433

🖼️ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos 👏 (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)

💬 LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)

🗣️ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay

3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)

posted an update 19 days ago

Post

5029

fav open-source multimodal reasoning model just got an update 🔥

moonshotai/Kimi-VL-A3B-Thinking-2506 has
> smarter with less tokens, small size (only 3B active params!!!)
> better accuracy
> video reasoning
> higher resolution 🤯
Read their blog https://huggingface.co/blog/moonshotai/kimi-vl-a3b-thinking-2506

posted an update 21 days ago

Post

2291

y'all have been asking my opinion on how OCR models compare to each other 👀
I will leave three apps to compare newest models by @prithivMLmods instead ⤵️
> compare Nanonets-OCR-s, Qwen2-VL-OCR-2B-Instruct, RolmOCR, Aya-Vision prithivMLmods/Multimodal-OCR
> SmolDocling, Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B prithivMLmods/Multimodal-OCR2
> docscopeOCR, MonkeyOCR, coreOCR prithivMLmods/core-OCR

1 reply

·

replied to their post 22 days ago

hello, it's the first link in the post, i.e. https://huggingface.co/spaces/visionLMsftw/comparevlms

reacted to prithivMLmods's post with 🚀🤗🔥 22 days ago

Post

1898

The demo for smoldocling / nanonets ocr / typhoon ocr / monkey ocr explores the document OCR capabilities of various newly released multimodal VLMs in a single space. And if you're experiencing or demoing long document image OCR, kindly use the Smoldocling 256M preview [ Smoldocling is back in demo here. ] 🤗.

✦ Try the demo here : prithivMLmods/Multimodal-OCR2

⤷ MonkeyOCR Recognition : echo840/MonkeyOCR
⤷ Nanonets-OCR-s : nanonets/Nanonets-OCR-s
⤷ SmolDocling-256M-preview : ds4sd/SmolDocling-256M-preview
⤷ typhoon-ocr-7b : scb10x/typhoon-ocr-7b

⤷ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

⤷ Github : https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR2

The community GPU grant was given by Hugging Face — special thanks to them. 🤗🚀

To know more about it, visit the model card of the respective model. !!

2 replies

·

posted an update 22 days ago

Post

1917

stop using VLMs blindly ✋🏻

compare different VLM outputs on a huge variety of inputs (from reasoning to OCR!) 🔥 visionLMsftw/comparevlms

> has support for multiple VLMs: google/gemma-3-27b-it, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen2.5-VL-32B-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct, HuggingFaceTB/SmolVLM2-2.2B-Instruct
> recommend us new models or inputs, we'll add 🫡

so far I figured out
> for fact-checks, you need a relatively bigger size (7B is ok!)
> Gemma 3 gets downgrade without pan and scan (especially for 📑)
> Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks 🗣️

2 replies

·

merve PRO

AI & ML interests

Recent Activity

Organizations

merve PRO

AI & ML interests

Recent Activity

Organizations

merve's activity