Merve Noyan PRO

merve

AI & ML interests

VLMs, vision & co

Articles

Organizations

merve's activity

posted an update 1 day ago
view post
Post
837
OWLSAM2: text-promptable SAM2 πŸ¦‰ merve/OWLSAM2

Marrying cutting-edge zero-shot object detector OWLv2 🀝 mask generator SAM2 (small checkpoint)
Zero-shot segmentation with insane precision ⛡️

I also uploaded all models with usage snippets and made a collection of SAM2 models and demos merve/sam2-66ac9deac6fca3bc5482fe30
  • 2 replies
Β·
posted an update 8 days ago
view post
Post
3345
At Hugging Face we have an open-source Cookbook with many applied AI recipes πŸ“–
Here are some of the latest recipes contributed β₯₯

- "Information Extraction with Haystack and NuExtract": Use Haystack and transformers to build structured data extraction pipelines using LLMs by @anakin87 https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract

- "Build RAG with Hugging Face and Milvus": Learn how to use Milvus with sentence transformers to build RAG pipelines https://huggingface.co/learn/cookbook/rag_with_hf_and_milvus

- "Code Search with Vector Embeddings and Qdrant": Search a codebase by building a retrieval pipeline using Qdrant and sentence transformers https://huggingface.co/learn/cookbook/code_search

- Data analyst agent: get your data’s insights in the blink of an eye ✨: great recipe by our own @m-ric showing how to build an agent that can do data analysis! 😱 https://huggingface.co/learn/cookbook/agent_data_analyst
replied to their post 9 days ago
view reply

I think it's not about the Space, it's model output, Space can't do anything for this. Maybe try another VLM that was fine-tuned for this type of tasks? Maybe google/paligemma-3b-mix-224

posted an update 9 days ago
view post
Post
2121
We have recently merged Video-LLaVA to transformers! πŸ€—πŸŽžοΈ
What makes this model different?

Demo: llava-hf/video-llava
Model: LanguageBind/Video-LLaVA-7B-hf

Compared to other models that take image and video input and either project them separately or downsampling video and projecting selected frames, Video-LLaVA is converting images and videos to unified representation and project them using a shared projection layer.

It uses Vicuna 1.5 as the language model and LanguageBind's own encoders that's based on OpenCLIP, these encoders project the modalities to an unified representation before passing to projection layer.


I feel like one of the coolest features of this model is the joint understanding which is also introduced recently with many models

It's a relatively older model but ahead of it's time and works very well! Which means, e.g. you can pass model an image of a cat and a video of a cat and ask questions like whether the cat in the image exists in video or not 🀩
replied to their post 12 days ago
view reply

It is a vision language model, these models use text decoders (here it's built on Llama-2 since it's another model from Meta) as a smaller part. VLMs largely differ from LLMs, if you can read the post above you can understand the difference.

replied to their post 14 days ago
posted an update 15 days ago
view post
Post
3090
Chameleon 🦎 by Meta is now available in Hugging Face transformers 😍
A vision language model that comes in 7B and 34B sizes 🀩
But what makes this model so special?

Demo: merve/chameleon-7b
Models: facebook/chameleon-668da9663f80d483b4c61f58

keep reading β₯₯

Chameleon is a unique model: it attempts to scale early fusion 🀨
But what is early fusion?
Modern vision language models use a vision encoder with a projection layer to project image embeddings so it can be promptable to text decoder (LLM)

Early fusion on the other hand attempts to fuse all features together (image patches and text) by using an image tokenizer and all tokens are projected into a shared space, which enables seamless generation 😏

Authors have also introduced different architectural improvements (QK norm and revise placement of layer norms) for scalable and stable training and they were able to increase the token count (5x tokens compared to Llama 3 which is a must with early-fusion IMO)

This model is an any-to-any model thanks to early fusion: it can take image and text input and output image and text, but image generation are disabled to prevent malicious use.

One can also do text-only prompting, authors noted the model catches up with larger LLMs (like Mixtral 8x7B or larger Llama-2 70B) and also image-pair prompting with larger VLMs like IDEFICS2-80B (see paper for the benchmarks Chameleon: Mixed-Modal Early-Fusion Foundation Models (2405.09818))
Thanks for reading!
Β·
posted an update 24 days ago
view post
Post
2927
Forget any document retrievers, use ColPali πŸ’₯πŸ’₯

Document retrieval is done through OCR + layout detection, but you are losing a lot of information in between, stop doing that! πŸ€“

ColPali uses a vision language model, which is better in doc understanding πŸ“‘
ColPali: vidore/colpali (mit license!)
Blog post: https://huggingface.co/blog/manu/colpali
The authors also released a new benchmark for document retrieval:
ViDoRe Benchmark: vidore/vidore-benchmark-667173f98e70a1c0fa4db00d
ViDoRe Leaderboard: vidore/vidore-leaderboard

ColPali marries the idea of modern vision language models with retrieval 🀝

The authors apply contrastive fine-tuning to SigLIP on documents, and pool the outputs (they call it BiSigLip). Then they feed the patch embedding outputs to PaliGemma and create BiPali πŸ–‡οΈ
BiPali natively supports image patch embeddings to an LLM, which enables leveraging the ColBERT-like late interaction computations between text tokens and image patches (hence the name ColPali!) 🀩

The authors created the ViDoRe benchmark by collecting PDF documents and generate queries from Claude-3 Sonnet.
ColPali seems to be the most performant model on ViDoRe. Not only this, but is way faster than traditional PDF parsers too!
posted an update about 1 month ago
view post
Post
4755
Real-time DEtection Transformer (RT-DETR) landed in transformers 🀩 with Apache 2.0 license 😍

πŸ”– models: https://huggingface.co/PekingU
πŸ”– demo: merve/RT-DETR-tracking-coco
πŸ“ paper: DETRs Beat YOLOs on Real-time Object Detection (2304.08069)
πŸ“– notebook: https://github.com/merveenoyan/example_notebooks/blob/main/RT_DETR_Notebook.ipynb

YOLO models are known to be super fast for real-time computer vision, but they have a downside with being volatile to NMS πŸ₯²

Transformer-based models on the other hand are computationally not as efficient πŸ₯²

Isn't there something in between? Enter RT-DETR!

The authors combined CNN backbone, multi-stage hybrid decoder (combining convs and attn) with a transformer decoder. In the paper, authors also claim one can adjust speed by changing decoder layers without retraining altogether.
The authors find out that the model performs better in terms of speed and accuracy compared to the previous state-of-the-art. 🀩
posted an update about 1 month ago
posted an update about 1 month ago
view post
Post
5788
Fine-tune Florence-2 on any task πŸ”₯

Today we release a notebook and a walkthrough blog on fine-tuning Florence-2 on DocVQA dataset @andito @SkalskiP

Blog: https://huggingface.co/blog πŸ“•
Notebook: https://colab.research.google.com/drive/1hKDrJ5AH_o7I95PtZ9__VlCTNAo1Gjpf?usp=sharing πŸ“–
Florence-2 is a great vision-language model thanks to it's massive dataset and small size!

This model requires conditioning through task prefixes and it's not as generalist, requiring fine-tuning on a new task, such as DocVQA πŸ“

We have fine-tuned the model on A100 (and one can also use a smaller GPU with smaller batch size) and saw that model picks up new tasks πŸ₯Ή

See below how it looks like before and after FT 🀩
Play with the demo here andito/Florence-2-DocVQA πŸ„β€β™€οΈ
posted an update about 1 month ago
view post
Post
3428
EPFL and Apple (at @EPFL-VILAB ) just released 4M-21: single any-to-any model that can do anything from text-to-image generation to generating depth masks! πŸ™€
4M is a multimodal training framework introduced by Apple and EPFL.
Resulting model takes image and text and output image and text 🀩

Models: EPFL-VILAB/4m-models-660193abe3faf4b4d98a2742
Demo: EPFL-VILAB/4M
Paper: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (2406.09406)

This model consists of transformer encoder and decoder, where the key to multimodality lies in input and output data:

input and output tokens are decoded to generate bounding boxes, generated image's pixels, captions and more!

This model also learnt to generate canny maps, SAM edges and other things for steerable text-to-image generation πŸ–ΌοΈ

The authors only added image-to-all capabilities for the demo, but you can try to use this model for text-to-image generation as well ☺️
posted an update about 1 month ago
view post
Post
4152
Florence-2 is a new vision foundation model capable of a wide variety of tasks 🀯
Demo πŸ‘‰πŸ» gokaygokay/Florence-2
Collection πŸ‘‰πŸ» microsoft/florence-6669f44df0d87d9c3bfb76de

This model can handle tasks that vary from OCR to semantic segmentation.

The difference from previous models is that the authors have compiled a dataset consisting of 126M images with 5.4B annotations labelled with their own data engine pseudolabelled by smaller specialized models and APIs.

The model has a similar architecture to previous models: an image encoder and a multimodality encoder with a text decoder. The authors have compiled the multitask dataset with prompts for each task.

You can also fine-tune this model on any task of choice. The authors also released different results on downstream tasks and reported their results when un/freezing the vision encoder πŸ€“πŸ“‰
They have released fine-tuned models too, you can find them in the collection above πŸ€—
Β·
posted an update about 2 months ago
view post
Post
3232
Forget about all the captioning datasets you've tried before!

PixelProse is a captioning dataset of 16M image-caption pairs, with less toxicity and higher details ✨
tomg-group-umd/pixelprose

The existing suite of captioning datasets consists of web scrapes that have alt text that is either irrelevant or not descriptive. The authors of this paper have taken those datasets, filtered for CSAM, passed it with a prompt to Gemini Vision Pro. They also removed PII and detoxified the resulting dataset.
posted an update about 2 months ago
view post
Post
4159
I love Depth Anything V2 😍
It’s Depth Anything, but scaled with both larger teacher model and a gigantic dataset!

Here's a small TLDR of paper with a lot of findings, experiments and more.
I have also created a collection that has the models, the dataset, the demo and CoreML converted model 😚 merve/depth-anything-v2-release-6671902e798cd404513ffbf5

The authors have analyzed Marigold, a diffusion based model against Depth Anything and found out what’s up with using synthetic images vs real images for MDE:

πŸ”– Real data has a lot of label noise, inaccurate depth maps (caused by depth sensors missing transparent objects etc) and there are many details overlooked

πŸ”– Synthetic data have more precise and detailed depth labels and they are truly ground-truth, but there’s a distribution shift between real and synthetic images, and they have restricted scene coverage

The authors train different image encoders only on synthetic images and find out unless the encoder is very large the model can’t generalize well (but large models generalize inherently anyway) 🧐
But they still fail encountering real images that have wide distribution in labels (e.g. diverse instances of objects) πŸ₯²

Depth Anything v2 framework is to..

πŸ¦– Train a teacher model based on DINOv2-G based on 595K synthetic images
🏷️ Label 62M real images using teacher model
πŸ¦• Train a student model using the real images labelled by teacher
Result: 10x faster and more accurate than Marigold!

The authors also construct a new benchmark called DA-2K that is less noisy, highly detailed and more diverse!
posted an update about 2 months ago
view post
Post
2973
Finally @CVPR2024 is here! 🩷
Have you claimed your papers and linked your models/datasets/demos?
This will increase visibility and impact of your paper πŸ’«

To index your papers, go here
CVPR2024/CVPR2024-papers
Find your paper, click on paper page link, index the paper, then click on your name (workflow is below πŸ‘‡πŸ»)
If you'd like to add links to your paper, go here CVPR2024/update-CVPR2024-papers
login, find your paper's id, retrieve the paper, fill in the info and submit!
posted an update about 2 months ago
view post
Post
2830
releasing: smol vision 🌼

A repository with notebooks on shrinking, optimizing, speeding-up, customizing large vision models! https://github.com/merveenoyan/smol-vision
  • 1 reply
Β·
replied to Tonic's post about 2 months ago
view reply

thank you for all you do for good open-source <3

posted an update about 2 months ago
view post
Post
2700
THUDM has released GLM-4V-9B and it's.. chatty! πŸ˜‚
I asked it to describe my favorite Howl's Moving Castle scene and here's how it went πŸ‘‡πŸ»

joke aside it seems to outperform the previous VLMs. however the license isn't open-source πŸ“ˆ
model repo: THUDM/glm-4v-9b
a community member has built a demo: vilarin/VL-Chatbox
  • 1 reply
Β·
posted an update about 2 months ago
view post
Post
2651
A great vision language benchmark: MM-UPD evaluates how model responds to unsolvable problems πŸ€“
LLaVA 1.6 is outperforming proprietary VLMs, making it a very robust choice for production!

It is now hosted as a leaderboard MM-UPD/MM-UPD_Leaderboard πŸ†πŸ’•
replied to hakunamatata1997's post about 2 months ago
replied to their post 2 months ago
view reply

Hello @anothercoder2 interesting, can you see the files through the CLI though? is this your local setup? I think you need to find the correct path inside /downloads and give load_from_disk that. because many datasets are cached in same folder it needs the exact path. (which often is a folder under ~/.cache/huggingface/datasets/downloads with a unique ID assigned)

posted an update 2 months ago
view post
Post
2028
Do we fully leverage ViT encoders in vision language models?

A new paper (by @HuanjinYao et al) built a dense connector that does it better! HuanjinYao/DenseConnector-v1.5-8B
HuanjinYao/denseconnector-66500e173fc8c9f05dc98dea

VLMs consist of an image encoder block, a projection layer that projects image embeddings to text embedding space and then a text decoder sequentially connected πŸ“–
This paper explores using intermediate states of image encoder and not a single output 🀩
The authors explore three different ways of instantiating dense connector: sparse token integration, sparse channel integration and dense channel integration. (see paper on how they do it Dense Connector for MLLMs (2405.13800))

They explore all three of them integrated to LLaVA 1.5 and found out each of the new models are superior to the original LLaVA 1.5 πŸ₯Ή I tried the model and it seems to work very well. As part of the release, the authors have released various ckpts based on different decoders (Vicuna 7/13B and Llama 3-8B) that you can find in the collection πŸ€—

replied to their post 2 months ago
view reply

you can use Colab's instances to do QLoRA FT, and then for Space we will give ZeroGPU :)

replied to their post 2 months ago
view reply

you can use Colab's instances to do QLoRA FT, and then for Space we will give ZeroGPU :)

posted an update 2 months ago
view post
Post
1238
We will be providing ZeroGPU grants (for Spaces inference) to those who want to fine-tune PaliGemma and build a Space πŸ”₯

You can pick any dataset of your choice!

Example code: https://colab.research.google.com/drive/1x_OEphRK0H97DqqxEyiMewqsTiLD_Xmi?usp=sharing (you can use a lower GPU with QLoRA)

Datasets:
https://huggingface.co/datasets?task_categories=task_categories:text-to-image&sort=trending
https://huggingface.co/datasets?task_categories=task_categories:image-to-text&sort=trending
Β·
replied to hakunamatata1997's post 2 months ago
view reply

@HakunaMatata1997 hello!
I think on top of my head I can't think of an OCR model specifically, I was mostly using easyocr. OCR is a problem that is pretty much solved, so most of the AI work around docs are focused on understanding documents (because it's more than image -> text, it involves text, charts, tables, whole layout and more)
if you really want OCR there are models like https://huggingface.co/facebook/nougat-base that is for PDF to markdown for instance.
I can also recommend some for document understanding in general (which works on text + chart + image + layout) zero shot or as a backbone to finetune.

posted an update 2 months ago
view post
Post
1858
we recently shipped fine-grained access tokens on Hugging Face Hub, which lets you create tokens with super specific permissions

for instance, if you want to collaborate with an external organization you don't want to use your write token since they can access everything you can access. instead you can set token access to repositories under that org only like below
posted an update 3 months ago
view post
Post
2821
I got asked about PaliGemma's document understanding capabilities, so I built a Space that has all the PaliGemma fine-tuned doc models πŸ“„πŸ“ŠπŸ“–
merve/paligemma-doc
replied to their post 3 months ago
view reply

@Cuiunbo ah yes, right. these type of models are "OCR free" meaning it understands and responds the image and not uses an extra ocr on them per se. those datasets are also ocr free I think. good thing about ocr free approach is that features like layout, charts, tables etc are also understood. maybe try prompts to do purely ocr? high res works well also on handwritings etc

posted an update 3 months ago
replied to their post 3 months ago
view reply

@Cuiunbo I think in model card you can see OCR (document understanding in general) fine-tuned model with associated benchmark on test dataset

posted an update 3 months ago
view post
Post
1746
it's raining vision language models β˜”οΈ
CuMo is a new vision language model that has MoE in every step of the VLM (image encoder, MLP and text decoder) and uses Mistral-7B for the decoder part πŸ€“
You can try it yourself here: shi-labs/CuMo-7b-zero

the authors firstly did pre-training of MLP with the by freezing the image encoder and text decoder, then they warmup the whole network by unfreezing and finetuning which they state to stabilize the visual instruction tuning when bringing in the experts. πŸ€“

the mixture of experts MLP blocks above are simply the same MLP blocks initialized from the single MLP that was trained during pre-training and fine-tuned in pre-finetuning.
it works very well (also tested myself) that it outperforms the previous sota of it's size LLaVA NeXt and IDEFICS2-8B in several benchmarks! 😍
replied to their post 3 months ago
view reply

@Cuiunbo I think @giffmana et al will release a technical report in the upcoming days. for mix models and finetuned models the details should be in the model cards. for chatty model I think it's not the intention of this release.

replied to their post 3 months ago
view reply

@MoonRide if you check the model card you can see the scores. mix models are trained on a mix of academic benchmark datasets (coco captions, vqav2, ocrvqa etc), where you just say e.g. "caption" and it captions. these datasets often have shorter descriptions and not long prompts, however they're grounded so they are good in the test sets of those benchmarks and can be used in many industry use cases (document AI etc since it hardly hallucinates). for your prompt, I just input "caption" and it came up with very grounded caption for instance.

the main point of PaliGemma release is to release finetuneable models, not provide heavy models with wide zero shot capabilities (where you input super long instruction or chat like prompts) so if you want, you can finetune a "pt" model on any benchmark of your choice and it should perform well.

replied to their post 3 months ago
view reply

@MoonRide it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.

posted an update 3 months ago
view post
Post
1691
New open Vision Language Model by @Google : PaliGemma πŸ’™πŸ€

πŸ“ Comes in 3B, pretrained, mix and fine-tuned models in 224, 448 and 896 resolution
🧩 Combination of Gemma 2B LLM and SigLIP image encoder
πŸ€— Supported in transformers

PaliGemma can do..
🧩 Image segmentation and detection! 🀯
πŸ“‘ Detailed document understanding and reasoning
πŸ™‹ Visual question answering, captioning and any other VLM task!

Read our blog πŸ”– hf.co/blog/paligemma
Try the demo πŸͺ€ hf.co/spaces/google/paligemma
Check out the Spaces and the models all in the collection πŸ“š google/paligemma-release-6643a9ffbf57de2ae0448dda
Collection of fine-tuned PaliGemma models google/paligemma-ft-models-6643b03efb769dad650d2dda
Β·
posted an update 3 months ago
posted an update 3 months ago
view post
Post
3823
just landed at Hugging Face Hub: community-led computer vision course πŸ“–πŸ€
learn from fundamentals to details of the bleeding edge vision transformers!
  • 1 reply
Β·
posted an update 3 months ago
view post
Post
2310
I have built a Space to compare different vision language model outputs, which model should I add next? πŸ‘€
Try them yourself here merve/compare_VLMs
  • 1 reply
Β·
posted an update 4 months ago
replied to xiaotianhan's post 4 months ago
view reply

Hiya, are you planning to open-source the models?

posted an update 4 months ago
posted an update 4 months ago
view post
Post
2833
I see you all send your documents to close-source APIs, this is not ok πŸ‘Ž it breaks my heart πŸ’”
I have seen many open-source document models, and I am amazed by what IDEFICS2 has done with document understanding 🀯🀩 it's not something you've ever seen before! HuggingFaceM4/idefics-8b

Please use it! Has Apache 2.0 license ❀️
posted an update 4 months ago
view post
Post
2419
Demo for IDEFICS-8B demo is out! HuggingFaceM4/idefics-8b

This checkpoint is not optimized to chat, but rather works very well for various tasks, incl visual question answering and document tasks πŸ’¬πŸ“‘
Chatty one is coming soon!
posted an update 4 months ago
view post
Post
2880
SegGPT is a vision generalist on image segmentation, quite like GPTs for computer vision ✨
It comes with the last release of transformers 🎁 Demo and more in this post!
SegGPT is an extension of the Painter, where you speak to images with images: the model takes in an image prompt, transformed version of the image prompt, the actual image you want to see the same transform, and expected to output the transformed image.
SegGPT consists of a vanilla ViT with a decoder on top (linear, conv, linear).
The model is trained on diverse segmentation examples, where they provide example image-mask pairs, the actual input to be segmented, and the decoder head learns to reconstruct the mask output.
This generalizes pretty well!
The authors do not claim state-of-the-art results as the model is mainly used zero-shot and few-shot inference. They also do prompt tuning, where they freeze the parameters of the model and only optimize the image tensor (the input context).
Thanks to πŸ€— transformers you can use this model easily!
See here https://huggingface.co/docs/transformers/en/model_doc/seggpt
I have built an app for you to try it out. I combined SegGPT with Depth Anything Model, so you don't have to upload image mask prompts in your prompt pair πŸ€—
Try it here merve/seggpt-depth-anything
Also check out the collection merve/seggpt-660466a303bc3cd7559d271b
replied to davanstrien's post 4 months ago
view reply

I think it would be nice if we could have what the data looks like in the tl;dr, how it was curated, the license, what type of model it can be trained with and so on, it would be very useful for me 🀩

posted an update 4 months ago
view post
Post
3305
LLaVA-NeXT is recently merged to Hugging Face transformers and it outperforms many of the closed source models like Gemini on various benchmarks 🀩 Let's take a look!
Demo: merve/llava-next
Notebook: https://colab.research.google.com/drive/1afNudu72SNWZCYtCVrRlb9T9Vj9CFJEK?usp=sharing
LLaVA is essentially a vision-language model that consists of ViT-based CLIP encoder, a MLP projection and Vicuna as decoder ✨
LLaVA 1.5 was released with Vicuna, but LLaVA NeXT (1.6) is released with four different LLMs:
- Nous-Hermes-Yi-34B
- Mistral-7B
- Vicuna 7B & 13B
Mistral and Nous-Hermes-Yi-34B are performing better and have better commercial use.
Moreover, according to authors' findings, the improvements comes from more diverse and high quality data mixture and dynamic high resolution.
LLaVA based on Nous-Hermes-Yi-34B outperforms many other models, including Gemini in various multimodal understanding and generation benchmarks 😊
replied to vikhyatk's post 5 months ago
view reply

I really like your work, and I did check moondream GH repository. Was wondering if you'd like to share your training details and findings on aligning text decoder and vision encoder and projection layer.

posted an update 5 months ago
view post
Post
I love vision language models πŸ’—
My favorite is KOSMOS-2, because it's a grounded model (it doesn't hallucinate).
In this demo you can,
- ask a question about the image,
- do detailed/brief captioning,
- localize the objects! 🀯
It's just amazing for VLM to return bounding boxes 🀩
Try it here merve/kosmos2
replied to akhaliq's post 5 months ago
view reply

one of the best research questions I've seen recently 😊

posted an update 5 months ago
view post
Post
New foundation model on document understanding and generation in transformers 🀩
UDOP by MSFT is a bleeding-edge model that is capable of many tasks, including question answering, document editing and more! 🀯
Demo πŸ‘‰ merve/UDOP
It is a model that combines vision, text and layout. πŸ“
This model is very interesting because the input representation truly captures the nature of the document modality: text, where the text is, and the layout of the document matters!
If you know T5, it resembles that: it's pre-trained on both self-supervised and supervised objectives over text, image and layout.
To switch between tasks, one simply needs to change the task specific prompt at the beginning, e.g. for QA, one prepends with Question answering.
As for the architecture, it's like T5, except it has a single encoder that takes in text, image and layout, and two decoders (text-layout and vision decoders) combined into one.
The vision decoder is a masked autoencoder (thus the capabilities of document editing).
For me, the most interesting capability is document reconstruction, document editing and layout re-arrangement. This decoder isn't released though because it could be used maliciously to fake document editing.
Overall, the model performs very well on document understanding benchmark (DUE) and also information extraction (FUNSD, CORD) and classification (RVL-CDIP) for vision, text, layout modalities.
You can learn more about the model from below resources (h/t to
@nielsr ), thanks a lot for reading πŸ€—
Docs: https://huggingface.co/docs/transformers/main/en/model_doc/udop πŸ“š
Checkpoints: microsoft/udop-65e625124aee97415b88b513
Demo notebooks: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/UDOP πŸ“•
posted an update 5 months ago
view post
Post
I've tried DoRA (https://arxiv.org/abs/2402.09353) with SDXL using PEFT, outputs are quite detailed 🀩🌟
as usual trained on lego dataset I compiled, I compared them with previously trained pivotal tuned model and the normal DreamBooth model before that 😊

Notebook by @linoyts https://colab.research.google.com/drive/134mt7bCMKtCYyYzETfEGKXT1J6J50ydT?usp=sharing
Integration to PEFT by @BenjaminB https://github.com/huggingface/peft/pull/1474 (more info in the PR)
replied to vladbogo's post 5 months ago
view reply

Thanks a lot for the blog post, it's very informative πŸ€—

posted an update 6 months ago
view post
Post
There's a new leaderboard for vision language models 🀩
The models are ranked based on ELO, you can rate the responses to preselected examples or try with your input πŸ€—
WildVision/vision-arena
replied to ivanfioravanti's post 6 months ago
posted an update 6 months ago
view post
Post
Google released a paper on Chess that doesn't rely on MCTS (aka AlphaZero) β™ŸοΈ
their secret sauce is.. synthetic data pseudolabeled by Stockfish engine πŸ˜€
2024 really is the year of synthetic data across all domains!
There's a nice discussion here, join us Grandmaster-Level Chess Without Search (2402.04494)
  • 2 replies
Β·
posted an update 6 months ago
replied to victor's post 6 months ago
view reply

βž• for comment upvotes or ⬆️

replied to xianbao's post 6 months ago
view reply

What is the limitation that you haven't use your own EVACLIP here?