Merve Noyan PRO

merve

AI & ML interests

Natural language understanding

Articles

Organizations

merve's activity

posted an update 3 days ago
view post
Post
2176
just landed at Hugging Face Hub: community-led computer vision course 📖🤍
learn from fundamentals to details of the bleeding edge vision transformers!
  • 1 reply
·
posted an update 4 days ago
view post
Post
1581
I have built a Space to compare different vision language model outputs, which model should I add next? 👀
Try them yourself here merve/compare_VLMs
  • 1 reply
·
posted an update 8 days ago
replied to xiaotianhan's post 9 days ago
view reply

Hiya, are you planning to open-source the models?

posted an update 9 days ago
posted an update 11 days ago
view post
Post
2769
I see you all send your documents to close-source APIs, this is not ok 👎 it breaks my heart 💔
I have seen many open-source document models, and I am amazed by what IDEFICS2 has done with document understanding 🤯🤩 it's not something you've ever seen before! HuggingFaceM4/idefics-8b

Please use it! Has Apache 2.0 license ❤️
posted an update 12 days ago
view post
Post
2362
Demo for IDEFICS-8B demo is out! HuggingFaceM4/idefics-8b

This checkpoint is not optimized to chat, but rather works very well for various tasks, incl visual question answering and document tasks 💬📑
Chatty one is coming soon!
posted an update about 1 month ago
view post
Post
2827
SegGPT is a vision generalist on image segmentation, quite like GPTs for computer vision ✨
It comes with the last release of transformers 🎁 Demo and more in this post!
SegGPT is an extension of the Painter, where you speak to images with images: the model takes in an image prompt, transformed version of the image prompt, the actual image you want to see the same transform, and expected to output the transformed image.
SegGPT consists of a vanilla ViT with a decoder on top (linear, conv, linear).
The model is trained on diverse segmentation examples, where they provide example image-mask pairs, the actual input to be segmented, and the decoder head learns to reconstruct the mask output.
This generalizes pretty well!
The authors do not claim state-of-the-art results as the model is mainly used zero-shot and few-shot inference. They also do prompt tuning, where they freeze the parameters of the model and only optimize the image tensor (the input context).
Thanks to 🤗 transformers you can use this model easily!
See here https://huggingface.co/docs/transformers/en/model_doc/seggpt
I have built an app for you to try it out. I combined SegGPT with Depth Anything Model, so you don't have to upload image mask prompts in your prompt pair 🤗
Try it here merve/seggpt-depth-anything
Also check out the collection merve/seggpt-660466a303bc3cd7559d271b
replied to davanstrien's post about 1 month ago
view reply

I think it would be nice if we could have what the data looks like in the tl;dr, how it was curated, the license, what type of model it can be trained with and so on, it would be very useful for me 🤩

posted an update about 1 month ago
view post
Post
3237
LLaVA-NeXT is recently merged to Hugging Face transformers and it outperforms many of the closed source models like Gemini on various benchmarks 🤩 Let's take a look!
Demo: merve/llava-next
Notebook: https://colab.research.google.com/drive/1afNudu72SNWZCYtCVrRlb9T9Vj9CFJEK?usp=sharing
LLaVA is essentially a vision-language model that consists of ViT-based CLIP encoder, a MLP projection and Vicuna as decoder ✨
LLaVA 1.5 was released with Vicuna, but LLaVA NeXT (1.6) is released with four different LLMs:
- Nous-Hermes-Yi-34B
- Mistral-7B
- Vicuna 7B & 13B
Mistral and Nous-Hermes-Yi-34B are performing better and have better commercial use.
Moreover, according to authors' findings, the improvements comes from more diverse and high quality data mixture and dynamic high resolution.
LLaVA based on Nous-Hermes-Yi-34B outperforms many other models, including Gemini in various multimodal understanding and generation benchmarks 😊
replied to vikhyatk's post about 1 month ago
view reply

I really like your work, and I did check moondream GH repository. Was wondering if you'd like to share your training details and findings on aligning text decoder and vision encoder and projection layer.

posted an update about 2 months ago
view post
Post
I love vision language models 💗
My favorite is KOSMOS-2, because it's a grounded model (it doesn't hallucinate).
In this demo you can,
- ask a question about the image,
- do detailed/brief captioning,
- localize the objects! 🤯
It's just amazing for VLM to return bounding boxes 🤩
Try it here merve/kosmos2
replied to akhaliq's post about 2 months ago
view reply

one of the best research questions I've seen recently 😊

posted an update about 2 months ago
view post
Post
New foundation model on document understanding and generation in transformers 🤩
UDOP by MSFT is a bleeding-edge model that is capable of many tasks, including question answering, document editing and more! 🤯
Demo 👉 merve/UDOP
It is a model that combines vision, text and layout. 📝
This model is very interesting because the input representation truly captures the nature of the document modality: text, where the text is, and the layout of the document matters!
If you know T5, it resembles that: it's pre-trained on both self-supervised and supervised objectives over text, image and layout.
To switch between tasks, one simply needs to change the task specific prompt at the beginning, e.g. for QA, one prepends with Question answering.
As for the architecture, it's like T5, except it has a single encoder that takes in text, image and layout, and two decoders (text-layout and vision decoders) combined into one.
The vision decoder is a masked autoencoder (thus the capabilities of document editing).
For me, the most interesting capability is document reconstruction, document editing and layout re-arrangement. This decoder isn't released though because it could be used maliciously to fake document editing.
Overall, the model performs very well on document understanding benchmark (DUE) and also information extraction (FUNSD, CORD) and classification (RVL-CDIP) for vision, text, layout modalities.
You can learn more about the model from below resources (h/t to
@nielsr ), thanks a lot for reading 🤗
Docs: https://huggingface.co/docs/transformers/main/en/model_doc/udop 📚
Checkpoints: microsoft/udop-65e625124aee97415b88b513
Demo notebooks: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP 📕
posted an update 2 months ago
view post
Post
I've tried DoRA (https://arxiv.org/abs/2402.09353) with SDXL using PEFT, outputs are quite detailed 🤩🌟
as usual trained on lego dataset I compiled, I compared them with previously trained pivotal tuned model and the normal DreamBooth model before that 😊

Notebook by @linoyts https://colab.research.google.com/drive/134mt7bCMKtCYyYzETfEGKXT1J6J50ydT?usp=sharing
Integration to PEFT by @BenjaminB https://github.com/huggingface/peft/pull/1474 (more info in the PR)
replied to vladbogo's post 2 months ago
view reply

Thanks a lot for the blog post, it's very informative 🤗

posted an update 2 months ago
view post
Post
There's a new leaderboard for vision language models 🤩
The models are ranked based on ELO, you can rate the responses to preselected examples or try with your input 🤗
WildVision/vision-arena
replied to ivanfioravanti's post 3 months ago
posted an update 3 months ago
view post
Post
Google released a paper on Chess that doesn't rely on MCTS (aka AlphaZero) ♟️
their secret sauce is.. synthetic data pseudolabeled by Stockfish engine 😀
2024 really is the year of synthetic data across all domains!
There's a nice discussion here, join us Grandmaster-Level Chess Without Search (2402.04494)
  • 2 replies
·
posted an update 3 months ago
replied to victor's post 3 months ago
replied to xianbao's post 3 months ago
view reply

What is the limitation that you haven't use your own EVACLIP here?

posted an update 3 months ago
view post
Post
EVA-CLIP 🦖 is the CLIP scaled to the moon! 🔥
The new SotA CLIP-like model 🏆
Highlights ✨
- Performs better in linear probing
- Outperforms in Zero-Shot Image-Text Retrieval
- Higher zero-shot accuracy in IN-1K

As usual, try it with the notebook I built for you https://colab.research.google.com/drive/1K7DdCORC3x4qyhwhuB4fT4wcfJ_BQLKw?usp=sharing#scrollTo=0ZS_lJ7SK6Ys
I also built a Space for you to compare the output probabilities to CLIP, seems that EVACLIP is more "sure" of it's results 😊 merve/EVACLIP
The authors have shared 8B checkpoints open with Apache 2.0 license 💜 and it's built on top of transformers, super easy to use! BAAI/EVA-CLIP-8B
Read the paper EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters (2402.04252) 📄
posted an update 3 months ago
view post
Post
Explaining a new state-of-the-art monocular depth estimation model: Depth Anything ✨ 🧶
Before we begin: Depth Anything is recently integrated to 🤗 transformers and you can use it with three lines of code! ✨
from transformers import pipeline

pipe = pipeline(task="depth-estimation", model="LiheYoung/depth-anything-small-hf")
depth = pipe(image)["depth"]

We have also built an app for you to compare different depth estimation models 🐝 🌸 merve/compare_depth_models
Check out Depth Anything in Web by @Xenova Xenova/depth-anything-web

The model's success heavily depends on unlocking the use of unlabeled datasets, although initially the authors used self-training and failed.
What the authors have done:
➰ Train a teacher model on labelled dataset
➰ Guide the student using teacher and also use unlabelled datasets pseudolabelled by the teacher
However, this was the cause of the failure, as both architectures were similar, the outputs were the same.
So the authors have added a more difficult optimization target for student to learn additional knowledge on unlabeled images that went through color jittering, distortions, Gaussian blurring and spatial distortion, so it can learn more invariant representations from them.
The architecture consists of DINOv2 encoder to extract the features followed by DPT decoder. At first, they train the teacher model on labelled images, and then they jointly train the student model and add in the dataset pseudo-labelled by ViT-L.
Thanks to this, Depth Anything performs very well! I have also benchmarked the inference duration of the model against different models here. I also ran torch.compile benchmarks across them and got nice speed-ups 🚀 https://huggingface2.notion.site/DPT-Benchmarks-1e516b0ba193460e865c47b3a5681efb?pvs=4
replied to gsarti's post 3 months ago
view reply

Thanks a lot for sharing these papers!

posted an update 3 months ago
view post
Post
TURNA: the biggest Turkish encoder-decoder model up-to-date, based on UL2 architecture, comes in 1.1B params 🐦 😍
The researchers also released models fine-tuned on various downstream tasks including text categorization, NER, summarization and more! 🤯 Great models @onurgu @gokceuludogan @yirmibesogluz @furkanakkurt1618 @uskudarli 👏
Fine-tuned models are in this collection 👉 boun-tabi-LMG/turna-ft-65b3f20aff5235e6cad07c1b
Pre-trained models are in this collection 👉 boun-tabi-LMG/turna-65ad340e5df673eec66e48c7
replied to gsarti's post 3 months ago
replied to philschmid's post 3 months ago
view reply

This is so cool, thanks a lot! added to my reading list :)

replied to akhaliq's post 3 months ago
replied to gsarti's post 3 months ago
view reply

is it the same intuition with catastrophic forgetting?

replied to their post 3 months ago
posted an update 3 months ago
view post
Post
Migrated all my GPU consuming Spaces to ZERO, it was super easy to do so (add three lines of code and voila!) and the start-up time decreased dramatically as well 💜
·
replied to their post 3 months ago
posted an update 3 months ago
view post
Post
Explaining the 👑 of zero-shot open-vocabulary object detection: OWLv2 🦉
OWLv2 is scaled version of a model called OWL-ViT, so let's take a look at that first. 📝
OWLViT is an open vocabulary object detector, meaning, it can detect objects it didn't explicitly see during the training. 👀
What's cool is that it can take both image and text queries! This is thanks to how the image and text features aren't fused together.

Taking a look at the architecture, the authors firstly do contrastive pre-training of a vision and a text encoder (just like CLIP). They take that model, remove the final pooling layer and attach a lightweight classification and box detection head and fine-tune.
During fine-tuning for object detection, they calculate the loss over bipartite matches. Simply put, loss is calculated over the predicted objects against ground truth objects and the goal is to find a perfect match of these two sets where each object is matched to one object in ground truth.

OWL-ViT is very scalable. You can easily scale most language models or vision-language models because they require no supervision, but this isn't the case for object detection: you still need weak supervision. Moreover, only scaling the encoders creates a bottleneck after a while.

The authors wanted to scale OWL-ViT with more data, so they used OWL-ViT for labelling to train a better detector, "self-train" a new detector on the labels, and fine-tune the model on human-annotated data.

Thanks to this, OWLv2 scaled very well and topped leaderboards on open vocabulary object detection 👑
If you'd like to try it out, I will leave couple of links with apps, notebooks and more in the comments! 🤗
  • 1 reply
·
posted an update 3 months ago
view post
Post
Posting about a very underrated model that tops paperswithcode across different segmentation benchmarks: OneFormer 👑

OneFormer is a "truly universal" model for semantic, instance and panoptic segmentation tasks ⚔️
What makes is truly universal is that it's a single model that is trained only once and can be used across all tasks.
The enabler here is the text conditioning, i.e. the model is given a text query that states task type along with the appropriate input, and using contrastive loss, the model learns the difference between different task types 👇 (see in the image below)

It's also super easy to use with transformers.
from transformers import OneFormerProcessor, OneFormerForUniversalSegmentation

processor = OneFormerProcessor.from_pretrained("shi-labs/oneformer_ade20k_swin_large")
model = OneFormerForUniversalSegmentation.from_pretrained("shi-labs/oneformer_ade20k_swin_large")

# swap the postprocessing and task_inputs for different types of segmentation
semantic_inputs = processor(images=image, task_inputs=["semantic"], return_tensors="pt")
semantic_outputs = model(**semantic_inputs)
predicted_semantic_map = processor.post_process_semantic_segmentation(outputs, target_sizes=[image.size[::-1]])[0]

I have drafted a notebook for you to try right away ✨ https://colab.research.google.com/drive/1wfJhoTFqUqcTAYAOUc6TXUubBTmOYaVa?usp=sharing
You can also check out the Space without checking out the code itself 👉 shi-labs/OneFormer
·
replied to isidentical's post 3 months ago
replied to isidentical's post 3 months ago
posted an update 3 months ago
view post
Post
Google's SigLIP is another alternative to openai's CLIP, and it just got merged to 🤗transformers and it's super easy to use!
To celebrate this, I have created a repository including notebooks and bunch of Spaces on various SigLIP based projects 🥳
Search for art 👉 merve/draw_to_search_art
Compare SigLIP with CLIP 👉 merve/compare_clip_siglip

How does SigLIP work?
SigLIP an vision-text pre-training technique based on contrastive learning. It jointly trains an image encoder and text encoder such that the dot product of embeddings are most similar for the appropriate text-image pairs
The image below is taken from CLIP, where this contrastive pre-training takes place with softmax, but SigLIP replaces softmax with sigmoid. 📎

Highlights from the paper on why you should use it ✨
🖼️📝 Authors used medium sized B/16 ViT for image encoder and B-sized transformer for text encoder
😍 More performant than CLIP on zero-shot
🗣️ Authors trained a multilingual model too!
⚡️ Super efficient, sigmoid is enabling up to 1M items per batch, but the authors chose 32k because the performance saturates after that

It's super easy to use thanks to transformers 👇
from transformers import pipeline
from PIL import Image
import requests

# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-base-patch16-256-i18n")

# load image
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

# inference
outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)

For all the SigLIP notebooks on similarity search and indexing, you can check this [repository](https://github.com/merveenoyan/siglip) out. 🤗
  • 2 replies
·
replied to Tonic's post 3 months ago
view reply

Hey @Tonic you can actually give link to your Spaces here and only the slugs will show up which is a cool formatting feature here 😊

posted an update 4 months ago
view post
Post
Sharing a super-fast segmentation model today 💨
SlimSAM is pruned-distilled version of SAM model, it's (up-to 8.6x) faster and smaller yet very powerful! ⚡️
It has the same architecture as SAM, meaning you can use the 🤗 transformers code for SAM on SlimSAM models ⬇️ (yes only 3 lines of code!)
from transformers import pipeline
generator = pipeline(model="nielsr/slimsam-50-uniform", task="mask-generation")
outputs = generator(image)

Lastly, I have built an app for you to compare SlimSAM and SAM outputs
merve/slimsam
replied to julien-c's post 4 months ago
view reply

@burkaygur I run away to sunny Izmir when winter arrives in Paris, @julien-c always makes fun of me saying I don't live in France 😂

posted an update 4 months ago
view post
Post
Last month was great for faster/smaller segmentation models, and I wanted to dedicate my first post to compile the recently released SAM variants! 🤗
📚 All models and their demos can be found in this collection 👉🏼 merve/segment-anything-model-6585835fc76915aa14e2bcbd
The ideas behind them are mostly about making heavy image encoder lighter either through distillation or changing the pre-training. 💡
⚡️MobileSAM: It decouples the heavy image encoder of SAM and distills it into a TinyViT to make SAM smaller. The architecture is same except for the encoder.
⚡️TinySAM: It distills the whole model with online hard prompt sampling. The authors also quantized it and released Q-TinySAM.
⚡️ EfficientSAM: This model combines masked image pre-training for training lightweight image encoders (like ViTMAE, learns to reconstruct the images) and mask decoder.
⚡️ FastSAM: It's a CNN-based model where the problem is modeled as segments generation. The inference takes place as everything is segmented at once and then you can prompt with boxes or points or text (and this is how it is similar to SAM). So the architecture is nowhere similar to original SAM itself.
✨ [NEW] SlimSAM: It's a pruned-distilled version of pre-trained SAM. The architecture is same so @nielsr recently converted the weights and you can use it with the same API you use with SAM models. You can find the available checkpoints in the collection.
I hope you liked it!