2 1 2

Aman Kapoor

ak0601

https://linktr.ee/aman_kapoor

AI & ML interests

Deep Learning, Transformers, Computer Vision, Natural Language Processing

Recent Activity

updated a Space 3 days ago

ak0601/Eng_tutor

updated a Space 21 days ago

ak0601/Chat_api

updated a Space 21 days ago

ak0601/Percentile_rank

View all activity

Organizations

ak0601's activity

updated a Space 3 days ago

Build error

👀

Gemini Chatbot

updated 2 Spaces about 1 month ago

Running

👁

Precollege Scraper

reacted to merve's post with 🔥 4 months ago

Post

5565

I have put together a notebook on Multimodal RAG, where we do not process the documents with hefty pipelines but natively use:
- vidore/colpali for retrieval 📖 it doesn't need indexing with image-text pairs but just images!
- Qwen/Qwen2-VL-2B-Instruct for generation 💬 directly feed images as is to a vision language model with no processing to text!
I used ColPali implementation of the new 🐭 Byaldi library by @bclavie 🤗
https://github.com/answerdotai/byaldi
Link to notebook: https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb

reacted to merve's post with ❤️ 8 months ago

Post

3842

just landed at Hugging Face Hub: community-led computer vision course 📖🤍
learn from fundamentals to details of the bleeding edge vision transformers!

1 reply

reacted to xiaotianhan's post with 🚀👍 8 months ago

Post

2093

🎉 🎉 🎉 Happy to share our recent work. We noticed that image resolution plays an important role, either in improving multi-modal large language models (MLLM) performance or in Sora style any resolution encoder decoder, we hope this work can help lift restriction of 224x224 resolution limit in ViT.

ViTAR: Vision Transformer with Any Resolution (2403.18361)