Aman Kapoor

ak0601

AI & ML interests

Deep Learning, Transformers, Computer Vision, Natural Language Processing

Recent Activity

updated a Space 3 days ago
ak0601/Eng_tutor
updated a Space 21 days ago
ak0601/Chat_api
updated a Space 21 days ago
ak0601/Percentile_rank
View all activity

Organizations

Hugging Face for Computer Vision's profile picture Hugging Face Discord Community's profile picture

ak0601's activity

updated a Space 3 days ago
reacted to merve's post with πŸ”₯ 4 months ago
view post
Post
5565
I have put together a notebook on Multimodal RAG, where we do not process the documents with hefty pipelines but natively use:
- vidore/colpali for retrieval πŸ“– it doesn't need indexing with image-text pairs but just images!
- Qwen/Qwen2-VL-2B-Instruct for generation πŸ’¬ directly feed images as is to a vision language model with no processing to text!
I used ColPali implementation of the new 🐭 Byaldi library by @bclavie πŸ€—
https://github.com/answerdotai/byaldi
Link to notebook: https://github.com/merveenoyan/smol-vision/blob/main/ColPali_%2B_Qwen2_VL.ipynb
reacted to merve's post with ❀️ 8 months ago
view post
Post
3842
just landed at Hugging Face Hub: community-led computer vision course πŸ“–πŸ€
learn from fundamentals to details of the bleeding edge vision transformers!
  • 1 reply
Β·
reacted to xiaotianhan's post with πŸš€πŸ‘ 8 months ago
view post
Post
2093
πŸŽ‰ πŸŽ‰ πŸŽ‰ Happy to share our recent work. We noticed that image resolution plays an important role, either in improving multi-modal large language models (MLLM) performance or in Sora style any resolution encoder decoder, we hope this work can help lift restriction of 224x224 resolution limit in ViT.

ViTAR: Vision Transformer with Any Resolution (2403.18361)
  • 2 replies
Β·
updated a Space 11 months ago
updated a model 12 months ago
New activity in hf-vision/course-assets about 1 year ago
New activity in hf-vision/course-assets about 1 year ago