sparrow-data-itn / README.MD
ITNovaML's picture
Upload 8 files
bfe03ac
# Sparrow Data
## Description
This module implements data structure for Sparrow ML model fine-tuning. We are using list of invoices to build Hugging Face dataset.
## Install
1. Install
```
pip install -r requirements.txt
```
2. Install Poppler, required for pdf2image to work (macos example)
```
brew install poppler
```
3. Mindee docTR OCR installation with dependencies
```
pip install torch torchvision torchaudio
pip install python-doctr
```
## Usage
1. Run OCR on invoices with PDF conversion to JPG
```
python run_ocr.py
```
2. Run data conversion to Sparrow format
```
python run_converter.py
```
Run Sparrow UI to annotate the documents and create key/value pairs.
3. Run data preparation task for Donut model fine-tuning. This task will create metadata. It will create Hugging Face dataset with train, validation and test splits for Donut model fine-tuning
```
python run_donut.py
```
4. Push dataset to Huggung Face Hub. You need to have Hugging Face account and Hugging Face Hub token. Read more: https://huggingface.co/docs/datasets/main/en/image_dataset
```
python run_donut_upload.py
```
5. Test dataset by using load_dataset and fetching data from Hugging Face Hub
```
python run_donut_test.py
```
## FastAPI Service
Set environment variables in **set_env_vars.sh**
1. Run
```
cd api
```
```
RUN_LOCALLY=true ./start.sh
```
2. FastAPI Swagger
```
http://127.0.0.1:8000/api/v1/sparrow-data/docs
```
**Run in Docker container**
1. Build Docker image
```
docker build --tag katanaml/sparrow-data .
```
2. Run Docker container
```
docker run -e RUN_LOCALLY=true -it --name sparrow-data -p 7860:7860 katanaml/sparrow-data:latest
```
## Endpoints
1. Info
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-dataset/v1/sparrow-data/dataset_info' \
-H 'accept: application/json'
```
Replace URL with your own
2. Ground truth
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-dataset/v1/sparrow-data/ground_truth' \
-H 'accept: application/json'
```
Replace URL with your own
3. OCR service
```
curl -X 'POST' \
'https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/ocr' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=' \
-F 'image_url=https://raw.githubusercontent.com/katanaml/sparrow/main/sparrow-data/docs/input/invoices/processed/images/invoice_10.jpg' \
-F 'post_processing=false' \
-F 'sparrow_key=your_key'
```
Replace URL with your own
4. OCR statistics
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/statistics' \
-H 'accept: application/json'
```
Replace URL with your own
## Endpoints - ChatGPT Plugin
1. Get OCR content for receipt
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_by_id?receipt_id=34563&sparrow_key=your_key' \
-H 'accept: application/json'
```
Replace URL with your own
2. Post Receipt JSON content to DB
```
curl -X 'POST' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/store_receipt_db' \
-H 'accept: application/json' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-d 'chatgpt_user=user&receipt_id=12345&receipt_content=%7Breceipt%7D&sparrow_key=your_key'
```
Replace URL with your own
3. Get receipt JSON from DB by ID
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_by_id?chatgpt_user=user&receipt_id=12345&sparrow_key=your_key' \
-H 'accept: application/json'
```
Replace URL with your own
4. Delete receipt JSON from DB by ID
```
curl -X 'DELETE' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_by_id?chatgpt_user=user&receipt_id=13456&sparrow_key=your_key' \
-H 'accept: application/json'
```
Replace URL with your own
5. Get all IDs for receipts stored in DB
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_ids_by_user?chatgpt_user=user&sparrow_key=your_key' \
-H 'accept: application/json'
```
Replace URL with your own
6. Get all receipts content stored in DB
```
curl -X 'GET' \
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_content_by_user?chatgpt_user=user&sparrow_key=your_key' \
-H 'accept: application/json'
```
Replace URL with your own
## CLI
Navigate to 'cli' folder and run 'chmod +x sparrowdata'. Add to system path to make it executable globally on the system.
1. OCR
```
./sparrowdata --api_url https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/ocr \
--file_path ../docs/models/donut/data/img/test/invoice_2.jpg \
--post_processing false \
--sparrow_key your_key
```
## Deploy to Hugging Face Spaces
1. Create new space - https://huggingface.co/spaces. Follow instructions from readme doc
2. Create huggingface_key secret in space settings
3. In config.py, replace huggingface_key variable with this line of code
```
huggingface_key: str = os.environ.get("huggingface_key")
```
4. Commit and push code to the space, follow readme instructions. Docker container will be deployed automatically. Example:
```
https://huggingface.co/spaces/katanaml-org/sparrow-data
```
5. Sparrow Data API will be accessible by URL, you can get it from space info. Example:
```
https://katanaml-org-sparrow-data.hf.space/api/v1/sparrow-data/docs
```
## MongoDB connection
If post_processing is set to True, then OCR results will be saved to MongoDB. You need to have MongoDB Atlas account and MongoDB Atlas token. Read more: https://docs.atlas.mongodb.com/configure-api-access/
1. Set environment variable for MongoDB Atlas connection, before starting FastAPI service
```
export MONGODB_URL="mongodb+srv://sparrow:<password>@<url>/?retryWrites=true&w=majority"
```
## Dataset info
- [Samples of electronic invoices](https://data.mendeley.com/datasets/tnj49gpmtz)
- [Receipts](https://www.kaggle.com/jenswalter/receipts)
- [SROIE](https://github.com/zzzDavid/ICDAR-2019-SROIE)
## Author
[Katana ML](https://katanaml.io), [Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)
## License
Licensed under the Apache License, Version 2.0. Copyright 2020-2023 Katana ML, Andrej Baranovskij. [Copy of the license](https://github.com/katanaml/sparrow/blob/main/LICENSE).