File size: 2,502 Bytes
06bc76a
e21ca80
 
 
 
 
 
 
712d39f
06bc76a
712d39f
50736a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---

title: Web-Based-Text-Extraction-and-Retrieval-System
emoji: πŸ“„  # You can choose any emoji that represents your app
colorFrom: blue  # Start color for the gradient background
colorTo: green  # End color for the gradient background
sdk: streamlit  # Your app uses Streamlit
sdk_version: "1.21.0"  # Version of Streamlit you are using
app_file: app.py  # Entry point of your application
pinned: false
---


# Web-Based Text Extraction and Retrieval System

This project is a web application that performs Optical Character Recognition (OCR) on images and highlights keywords within the extracted text. The system supports both English and Hindi languages, allowing users to upload images, extract text, and search for specific keywords within the extracted content.

## Features
- **Language Support**: English and Hindi
- **OCR**: Extracts text from uploaded images.
- **Keyword Search**: Highlights specified keywords in the extracted text.
- **Multiple Image Formats**: Supports PNG, JPG, and JPEG image formats.

## Tech Stack
- **Python**
- **Streamlit**: Web interface for interactive image upload and keyword search.
- **Hugging Face Transformers**: Used for text extraction in English.
- **EasyOCR**: For Hindi text extraction from images.
- **PIL**: To handle image uploads.
- **Torch**: For working with the model and tokenizers.
- **Numpy**: For image processing.

## How it Works
### English OCR Flow:
1. Upload an image containing text.
2. The application uses a Hugging Face pre-trained model to extract text.
3. The extracted text is displayed, and users can search for keywords.
4. The keywords are highlighted within the extracted text.

### Hindi OCR Flow:
1. Upload an image with Hindi text.
2. EasyOCR is used to detect and extract Hindi text from the image.
3. Users can search for Hindi keywords, which will be highlighted in the extracted content.

## Installation

1. **Clone the Repository**:
    ```bash

    git clone <https://github.com/SrisuryaTeja/Web-Based-Text-Extraction-and-Retrieval-System>

    ```


2. **Create and Activate a Virtual Environment**:
    ```bash

    python -m venv myenv

    source myenv/bin/activate  # On Windows use myenv\Scripts\activate

    ```


3. **Install Dependencies**:
    Install the required packages listed in the `requirements.txt` file:

    ```bash

    pip install -r requirements.txt

    ```


4. **Run the Application**:
    ```bash

    streamlit run app.py

    ```