File size: 4,068 Bytes
6efe9e0
 
 
 
 
 
 
 
28065b0
6efe9e0
28065b0
6efe9e0
 
 
28065b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
title: Conversational Image Recognition Chatbot
emoji: πŸŒ–
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: true
license: mit
short_description: Conversational Image Recognition chatbot
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Conversational Image Recognition Chatbot

## Introduction

This application combines natural language processing with image recognition technology that involves the image recognition downstream task and extraction of visual information to interact with users. It allows users to upload images and then engage in a conversation to answer questions about the content of those images. This tool is designed to provide a seamless and interactive way to understand and analyze visual content through natural language dialogue.

# Solution we offer 
● Chatbot harnessing the power of Vision Language Model (VLM) & Zero-shot object detection Model
● The user can upload an image, detect the objects in it and start the chat session.
● Enhanced spatial understanding of objects in the images.It happens due to inter-communication between both the 
models.
● Image question answering chatbot with feature of object detection.
● Chat bot History and detection output interaction of the system.
● Used 9 representative VLMs on 10 Benchmarks in Open compass multimodal leaderboard.
● Best performing and most used object detection models like OWLv2 model, Grounding DINO model. 

# How it addresses the problem
● Both the models help in addressing the problem since they were pretrained on household datasets.
● Provide unparalleled results and inference speed.
● Our approach is easy and can be implemented with minimal efforts.

# Unique value propositions
● Correct responses lexically and grammatically.
● Usage of state of the art and novel research work in field of image understanding. 
Everything we have used is open sourcework.
● Flexibility to use different models with reference documentation.
● Working and hosted application

# Technologies used
Programming languages : Python
Libraries : Transformers, Pytorch, Image libraries like PIL
Hardware : Nvidia T4 medium 8vCPU 30GB RAM
Platforms : Hugging face, arXiv, other research articles
Deployment tools : Gradio, Streamlit, Hugging face spaces 
hardware


## Aplication demo

● Used google/owlv2-base-patch16-ensemble as zero shot object detection model and Qwen/Qwen2-VL-2B-Instruct as VLM
● Gradio framework and transformers library for development 
Demo Link - Hugging Face Space Google Colab demo
HF space has 2vCPU-16GB RAM and no GPU deployed in the free tier. So the 
inference speed of our chatbot is very slow. 
To see the demo it is highly recommended to use the Google colab demo we have 
provided. Get started with the demo with minimal efforts. The inference speed 
increases drastically on google colab with T4 GPU runtime.

### Image Demos
- [Demo Image 1](https://drive.google.com/file/d/1AZNrdTZMSDdGAPtgacQ4V1b8UWKBqaAY/view?usp=sharing)
- [Demo Image 2](https://drive.google.com/file/d/1aUi75v0I3qwcHA2HBx6zls1JX0s0ZLyg/view?usp=sharing)


### Live Links


- **Hugging Face Spaces**: [Conversational Image Recognition Chatbot](https://huggingface.co/spaces/vigneshwar472/Conversational-image-recognition-chatbot)  
  ***Note**: Due to GPU constraints on Hugging Face Spaces, performance may be slower.*




- **Google Colab Notebook**: [Google Colab Demo](https://colab.research.google.com/drive/1UcY1X5AV5yy9jTuxBnmDWdAjETCi-ni2?usp=sharing)  
  ***Recommended**: It is recommended to use the Google Colab Notebook for the demo because Hugging Face Spaces have GPU constraints so the application  may run slow* 






## Technologies Used
- **Programming languages** : Python
- **Libraries** : Transformers, Pytorch, Image libraries like PIL
- **Hardware** : Nvidia T4 medium 8vCPU 30GB RAM
- **Platforms** : Hugging face, arXiv, other research articles
- **Deployment tools** : Gradio, Streamlit, Hugging face spaces hardware