Spaces:
Sleeping
Sleeping
UniquePratham
commited on
Upload 5 files
Browse filesDualTextOCRFusion
- .gitignore +55 -0
- README.md +150 -13
- app.py +72 -0
- ocr_cpu.py +97 -0
- requirements.txt +14 -0
.gitignore
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Byte-compiled / optimized / DLL files
|
2 |
+
__pycache__/
|
3 |
+
*.py[cod]
|
4 |
+
*$py.class
|
5 |
+
|
6 |
+
# C extensions
|
7 |
+
*.so
|
8 |
+
|
9 |
+
# Distribution / packaging
|
10 |
+
.Python
|
11 |
+
build/
|
12 |
+
develop-eggs/
|
13 |
+
dist/
|
14 |
+
downloads/
|
15 |
+
eggs/
|
16 |
+
.eggs/
|
17 |
+
lib/
|
18 |
+
lib64/
|
19 |
+
parts/
|
20 |
+
sdist/
|
21 |
+
var/
|
22 |
+
wheels/
|
23 |
+
*.egg-info/
|
24 |
+
.installed.cfg
|
25 |
+
*.egg
|
26 |
+
MANIFEST
|
27 |
+
|
28 |
+
# Virtual environment
|
29 |
+
venv/
|
30 |
+
env/
|
31 |
+
.venv/
|
32 |
+
.env/
|
33 |
+
ENV/
|
34 |
+
.env.bak/
|
35 |
+
*.env
|
36 |
+
__pycache__
|
37 |
+
|
38 |
+
# VS Code
|
39 |
+
.vscode/
|
40 |
+
.history/
|
41 |
+
|
42 |
+
# PyCharm
|
43 |
+
.idea/
|
44 |
+
|
45 |
+
# Jupyter Notebook
|
46 |
+
.ipynb_checkpoints/
|
47 |
+
|
48 |
+
# Logs
|
49 |
+
*.log
|
50 |
+
|
51 |
+
# Mac OS files
|
52 |
+
.DS_Store
|
53 |
+
|
54 |
+
# Streamlit Cache (Optional)
|
55 |
+
streamlit_cache/
|
README.md
CHANGED
@@ -1,13 +1,150 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# 🔍 DualTextOCRFusion
|
2 |
+
|
3 |
+
**DualTextOCRFusion** is a web-based Optical Character Recognition (OCR) application that allows users to upload images containing both Hindi and English text, extract the text, and search for keywords within the extracted text. The app uses advanced models like **ColPali’s Byaldi + Qwen2-VL** or **General OCR Theory (GOT)** for multilingual text extraction.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+
- **Multilingual OCR**: Extract text from images containing both **Hindi** and **English**.
|
8 |
+
- **Keyword Search**: Search for specific keywords in the extracted text.
|
9 |
+
- **User-Friendly Interface**: Simple, intuitive interface for easy image uploading and searching.
|
10 |
+
- **Deployed Online**: Accessible through a live URL for easy use.
|
11 |
+
|
12 |
+
## Technologies Used
|
13 |
+
|
14 |
+
- **Python**: Backend logic.
|
15 |
+
- **Streamlit**: For building the web interface.
|
16 |
+
- **Huggingface Transformers**: For integrating OCR models (Qwen2-VL or GOT).
|
17 |
+
- **PyTorch**: For deep learning inference.
|
18 |
+
- **Pytesseract**: Optional OCR engine.
|
19 |
+
- **OpenCV**: For image preprocessing.
|
20 |
+
|
21 |
+
## Project Structure
|
22 |
+
|
23 |
+
```
|
24 |
+
DualTextOCRFusion/
|
25 |
+
│
|
26 |
+
├── app.py # Main Streamlit application
|
27 |
+
├── ocr.py # Handles OCR extraction using the selected model
|
28 |
+
├── .gitignore # Files and directories to ignore in Git
|
29 |
+
├── .streamlit/
|
30 |
+
│ └── config.toml # Streamlit theme configuration
|
31 |
+
├── requirements.txt # Dependencies for the project
|
32 |
+
└── README.md # This file
|
33 |
+
```
|
34 |
+
|
35 |
+
## How to Run Locally
|
36 |
+
|
37 |
+
### Prerequisites
|
38 |
+
|
39 |
+
- Python 3.8 or above installed on your machine.
|
40 |
+
- Tesseract installed for using `pytesseract` (optional if using Huggingface models). You can download Tesseract from [here](https://github.com/tesseract-ocr/tesseract).
|
41 |
+
|
42 |
+
### Steps
|
43 |
+
|
44 |
+
1. **Clone the Repository**:
|
45 |
+
|
46 |
+
```bash
|
47 |
+
git clone https://github.com/yourusername/dual-text-ocr-fusion.git
|
48 |
+
cd dual-text-ocr-fusion
|
49 |
+
```
|
50 |
+
|
51 |
+
2. **Install Dependencies**:
|
52 |
+
|
53 |
+
Make sure you have the required dependencies by running the following:
|
54 |
+
|
55 |
+
```bash
|
56 |
+
pip install -r requirements.txt
|
57 |
+
```
|
58 |
+
|
59 |
+
3. **Run the Application**:
|
60 |
+
|
61 |
+
Start the Streamlit app by running the following command:
|
62 |
+
|
63 |
+
```bash
|
64 |
+
streamlit run app.py
|
65 |
+
```
|
66 |
+
|
67 |
+
4. **Open the App**:
|
68 |
+
|
69 |
+
Once the server starts, the app will be available in your browser at:
|
70 |
+
|
71 |
+
```
|
72 |
+
http://localhost:8501
|
73 |
+
```
|
74 |
+
|
75 |
+
### Usage
|
76 |
+
|
77 |
+
1. **Upload an Image**: Upload an image containing Hindi and English text in formats like JPG, JPEG, or PNG.
|
78 |
+
2. **View Extracted Text**: The app will extract and display the text from the image.
|
79 |
+
3. **Search for Keywords**: Enter any keyword to search within the extracted text.
|
80 |
+
|
81 |
+
## Deployment
|
82 |
+
|
83 |
+
The app is deployed on **Streamlit Sharing** and can be accessed via the live URL:
|
84 |
+
|
85 |
+
**[Live Application](https://your-app-link.streamlit.app)**
|
86 |
+
|
87 |
+
## Customization
|
88 |
+
|
89 |
+
### Changing the OCR Model
|
90 |
+
|
91 |
+
By default, the app uses the **Qwen2-VL** model, but you can switch to the **General OCR Theory (GOT)** model by editing the `ocr.py` file.
|
92 |
+
|
93 |
+
- **For Qwen2-VL**:
|
94 |
+
|
95 |
+
```python
|
96 |
+
from ocr import extract_text_byaldi
|
97 |
+
```
|
98 |
+
|
99 |
+
- **For General OCR Theory (GOT)**:
|
100 |
+
|
101 |
+
```python
|
102 |
+
from ocr import extract_text_got
|
103 |
+
```
|
104 |
+
|
105 |
+
### Custom UI Theme
|
106 |
+
|
107 |
+
You can customize the look and feel of the application by modifying the `.streamlit/config.toml` file. Adjust colors, fonts, and layout options to suit your preferences.
|
108 |
+
|
109 |
+
## Example Images
|
110 |
+
|
111 |
+
Here are some sample images you can use to test the OCR functionality:
|
112 |
+
|
113 |
+
1. **Sample 1**: A document with mixed Hindi and English text.
|
114 |
+
2. **Sample 2**: An image with only Hindi text for multilingual OCR testing.
|
115 |
+
|
116 |
+
## Contributing
|
117 |
+
|
118 |
+
If you'd like to contribute to this project, feel free to fork the repository and submit a pull request. Follow these steps:
|
119 |
+
|
120 |
+
1. Fork the project.
|
121 |
+
2. Create a feature branch:
|
122 |
+
|
123 |
+
```bash
|
124 |
+
git checkout -b feature-branch
|
125 |
+
```
|
126 |
+
|
127 |
+
3. Commit your changes:
|
128 |
+
|
129 |
+
```bash
|
130 |
+
git commit -am 'Add new feature'
|
131 |
+
```
|
132 |
+
|
133 |
+
4. Push to the branch:
|
134 |
+
|
135 |
+
```bash
|
136 |
+
git push origin feature-branch
|
137 |
+
```
|
138 |
+
|
139 |
+
5. Open a pull request.
|
140 |
+
|
141 |
+
## License
|
142 |
+
|
143 |
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
144 |
+
|
145 |
+
## Credits
|
146 |
+
|
147 |
+
- **Streamlit**: For the easy-to-use web interface.
|
148 |
+
- **Huggingface Transformers**: For the powerful OCR models.
|
149 |
+
- **Tesseract**: For optional OCR functionality.
|
150 |
+
- **ColPali & GOT Models**: For the multilingual OCR support.
|
app.py
ADDED
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
from ocr_cpu import extract_text_got # The updated OCR function
|
3 |
+
import json
|
4 |
+
|
5 |
+
# --- UI Styling ---
|
6 |
+
st.set_page_config(page_title="DualTextOCRFusion",
|
7 |
+
layout="centered", page_icon="🔍")
|
8 |
+
|
9 |
+
st.markdown(
|
10 |
+
"""
|
11 |
+
<style>
|
12 |
+
.reportview-container {
|
13 |
+
background: #f4f4f4;
|
14 |
+
}
|
15 |
+
.sidebar .sidebar-content {
|
16 |
+
background: #e0e0e0;
|
17 |
+
}
|
18 |
+
h1 {
|
19 |
+
color: #007BFF;
|
20 |
+
}
|
21 |
+
.upload-btn {
|
22 |
+
background-color: #007BFF;
|
23 |
+
color: white;
|
24 |
+
padding: 10px;
|
25 |
+
border-radius: 5px;
|
26 |
+
text-align: center;
|
27 |
+
}
|
28 |
+
</style>
|
29 |
+
""", unsafe_allow_html=True
|
30 |
+
)
|
31 |
+
|
32 |
+
# --- Title ---
|
33 |
+
st.title("🔍 DualTextOCRFusion")
|
34 |
+
st.write("Upload an image with **Hindi** and **English** text to extract and search for keywords.")
|
35 |
+
|
36 |
+
# --- Image Upload Section ---
|
37 |
+
uploaded_file = st.file_uploader(
|
38 |
+
"Choose an image file", type=["jpg", "jpeg", "png"])
|
39 |
+
|
40 |
+
if uploaded_file is not None:
|
41 |
+
st.image(uploaded_file, caption='Uploaded Image', use_column_width=True)
|
42 |
+
|
43 |
+
# Extract text from the image using the selected OCR function (GOT)
|
44 |
+
with st.spinner("Extracting text using the model..."):
|
45 |
+
try:
|
46 |
+
extracted_text = extract_text_got(
|
47 |
+
uploaded_file) # Pass uploaded_file directly
|
48 |
+
if not extracted_text.strip():
|
49 |
+
st.warning("No text extracted from the image.")
|
50 |
+
except Exception as e:
|
51 |
+
st.error(f"Error during text extraction: {str(e)}")
|
52 |
+
extracted_text = ""
|
53 |
+
|
54 |
+
# Display extracted text
|
55 |
+
st.subheader("Extracted Text")
|
56 |
+
st.text_area("Text", extracted_text, height=250)
|
57 |
+
|
58 |
+
# Save extracted text for search
|
59 |
+
if extracted_text:
|
60 |
+
with open("extracted_text.json", "w") as json_file:
|
61 |
+
json.dump({"text": extracted_text}, json_file)
|
62 |
+
|
63 |
+
# --- Keyword Search ---
|
64 |
+
st.subheader("Search for Keywords")
|
65 |
+
keyword = st.text_input(
|
66 |
+
"Enter a keyword to search in the extracted text")
|
67 |
+
|
68 |
+
if keyword:
|
69 |
+
if keyword.lower() in extracted_text.lower():
|
70 |
+
st.success(f"Keyword **'{keyword}'** found in the text!")
|
71 |
+
else:
|
72 |
+
st.error(f"Keyword **'{keyword}'** not found.")
|
ocr_cpu.py
ADDED
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
from transformers import AutoModel, AutoTokenizer
|
3 |
+
import torch
|
4 |
+
|
5 |
+
# Load model and tokenizer
|
6 |
+
model_name = "ucaslcl/GOT-OCR2_0"
|
7 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
8 |
+
model_name, trust_remote_code=True, return_tensors='pt'
|
9 |
+
)
|
10 |
+
|
11 |
+
# Load the model
|
12 |
+
model = AutoModel.from_pretrained(
|
13 |
+
model_name,
|
14 |
+
trust_remote_code=True,
|
15 |
+
low_cpu_mem_usage=True,
|
16 |
+
use_safetensors=True,
|
17 |
+
pad_token_id=tokenizer.eos_token_id,
|
18 |
+
)
|
19 |
+
|
20 |
+
# Ensure the model is in evaluation mode and loaded on CPU
|
21 |
+
device = torch.device("cpu")
|
22 |
+
dtype = torch.float32 # Use float32 on CPU
|
23 |
+
model = model.eval().to(device)
|
24 |
+
|
25 |
+
# OCR function
|
26 |
+
|
27 |
+
|
28 |
+
def extract_text_got(uploaded_file):
|
29 |
+
"""Use GOT-OCR2.0 model to extract text from the uploaded image."""
|
30 |
+
try:
|
31 |
+
temp_file_path = 'temp_image.jpg'
|
32 |
+
with open(temp_file_path, 'wb') as temp_file:
|
33 |
+
temp_file.write(uploaded_file.read()) # Save file
|
34 |
+
|
35 |
+
# OCR attempts
|
36 |
+
ocr_types = ['ocr', 'format']
|
37 |
+
fine_grained_options = ['ocr', 'format']
|
38 |
+
color_options = ['red', 'green', 'blue']
|
39 |
+
box = [10, 10, 100, 100] # Example box for demonstration
|
40 |
+
multi_crop_types = ['ocr', 'format']
|
41 |
+
|
42 |
+
results = []
|
43 |
+
|
44 |
+
# Run the model without autocast (not necessary for CPU)
|
45 |
+
for ocr_type in ocr_types:
|
46 |
+
with torch.no_grad():
|
47 |
+
outputs = model.chat(
|
48 |
+
tokenizer, temp_file_path, ocr_type=ocr_type
|
49 |
+
)
|
50 |
+
if isinstance(outputs, list) and outputs[0].strip():
|
51 |
+
return outputs[0].strip() # Return if successful
|
52 |
+
results.append(outputs[0].strip() if outputs else "No result")
|
53 |
+
|
54 |
+
# Try FINE-GRAINED OCR with box options
|
55 |
+
for ocr_type in fine_grained_options:
|
56 |
+
with torch.no_grad():
|
57 |
+
outputs = model.chat(
|
58 |
+
tokenizer, temp_file_path, ocr_type=ocr_type, ocr_box=box
|
59 |
+
)
|
60 |
+
if isinstance(outputs, list) and outputs[0].strip():
|
61 |
+
return outputs[0].strip() # Return if successful
|
62 |
+
results.append(outputs[0].strip() if outputs else "No result")
|
63 |
+
|
64 |
+
# Try FINE-GRAINED OCR with color options
|
65 |
+
for ocr_type in fine_grained_options:
|
66 |
+
for color in color_options:
|
67 |
+
with torch.no_grad():
|
68 |
+
outputs = model.chat(
|
69 |
+
tokenizer, temp_file_path, ocr_type=ocr_type, ocr_color=color
|
70 |
+
)
|
71 |
+
if isinstance(outputs, list) and outputs[0].strip():
|
72 |
+
return outputs[0].strip() # Return if successful
|
73 |
+
results.append(outputs[0].strip()
|
74 |
+
if outputs else "No result")
|
75 |
+
|
76 |
+
# Try MULTI-CROP OCR
|
77 |
+
for ocr_type in multi_crop_types:
|
78 |
+
with torch.no_grad():
|
79 |
+
outputs = model.chat_crop(
|
80 |
+
tokenizer, temp_file_path, ocr_type=ocr_type
|
81 |
+
)
|
82 |
+
if isinstance(outputs, list) and outputs[0].strip():
|
83 |
+
return outputs[0].strip() # Return if successful
|
84 |
+
results.append(outputs[0].strip() if outputs else "No result")
|
85 |
+
|
86 |
+
# If no text was extracted
|
87 |
+
if all(not text for text in results):
|
88 |
+
return "No text extracted."
|
89 |
+
else:
|
90 |
+
return "\n".join(results)
|
91 |
+
|
92 |
+
except Exception as e:
|
93 |
+
return f"Error during text extraction: {str(e)}"
|
94 |
+
|
95 |
+
finally:
|
96 |
+
if os.path.exists(temp_file_path):
|
97 |
+
os.remove(temp_file_path)
|
requirements.txt
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
torch==2.0.1
|
2 |
+
torchvision==0.15.2
|
3 |
+
transformers==4.37.2
|
4 |
+
megfile==3.1.2
|
5 |
+
tiktoken
|
6 |
+
verovio
|
7 |
+
opencv-python
|
8 |
+
cairosvg
|
9 |
+
accelerate
|
10 |
+
numpy==1.26.4
|
11 |
+
loadimg
|
12 |
+
pillow
|
13 |
+
markdown
|
14 |
+
shutils
|