Spaces:
Runtime error
Runtime error
# Sparrow Data | |
## Description | |
This module implements data structure for Sparrow ML model fine-tuning. We are using list of invoices to build Hugging Face dataset. | |
## Install | |
1. Install | |
``` | |
pip install -r requirements.txt | |
``` | |
2. Install Poppler, required for pdf2image to work (macos example) | |
``` | |
brew install poppler | |
``` | |
3. Mindee docTR OCR installation with dependencies | |
``` | |
pip install torch torchvision torchaudio | |
pip install python-doctr | |
``` | |
## Usage | |
1. Run OCR on invoices with PDF conversion to JPG | |
``` | |
python run_ocr.py | |
``` | |
2. Run data conversion to Sparrow format | |
``` | |
python run_converter.py | |
``` | |
Run Sparrow UI to annotate the documents and create key/value pairs. | |
3. Run data preparation task for Donut model fine-tuning. This task will create metadata. It will create Hugging Face dataset with train, validation and test splits for Donut model fine-tuning | |
``` | |
python run_donut.py | |
``` | |
4. Push dataset to Huggung Face Hub. You need to have Hugging Face account and Hugging Face Hub token. Read more: https://huggingface.co/docs/datasets/main/en/image_dataset | |
``` | |
python run_donut_upload.py | |
``` | |
5. Test dataset by using load_dataset and fetching data from Hugging Face Hub | |
``` | |
python run_donut_test.py | |
``` | |
## FastAPI Service | |
Set environment variables in **set_env_vars.sh** | |
1. Run | |
``` | |
cd api | |
``` | |
``` | |
RUN_LOCALLY=true ./start.sh | |
``` | |
2. FastAPI Swagger | |
``` | |
http://127.0.0.1:8000/api/v1/sparrow-data/docs | |
``` | |
**Run in Docker container** | |
1. Build Docker image | |
``` | |
docker build --tag katanaml/sparrow-data . | |
``` | |
2. Run Docker container | |
``` | |
docker run -e RUN_LOCALLY=true -it --name sparrow-data -p 7860:7860 katanaml/sparrow-data:latest | |
``` | |
## Endpoints | |
1. Info | |
``` | |
curl -X 'GET' \ | |
'https://katanaml-org-sparrow-data.hf.space/api-dataset/v1/sparrow-data/dataset_info' \ | |
-H 'accept: application/json' | |
``` | |
Replace URL with your own | |
2. Ground truth | |
``` | |
curl -X 'GET' \ | |
'https://katanaml-org-sparrow-data.hf.space/api-dataset/v1/sparrow-data/ground_truth' \ | |
-H 'accept: application/json' | |
``` | |
Replace URL with your own | |
3. OCR service | |
``` | |
curl -X 'POST' \ | |
'https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/ocr' \ | |
-H 'accept: application/json' \ | |
-H 'Content-Type: multipart/form-data' \ | |
-F 'file=' \ | |
-F 'image_url=https://raw.githubusercontent.com/katanaml/sparrow/main/sparrow-data/docs/input/invoices/processed/images/invoice_10.jpg' \ | |
-F 'post_processing=false' \ | |
-F 'sparrow_key=your_key' | |
``` | |
Replace URL with your own | |
4. OCR statistics | |
``` | |
curl -X 'GET' \ | |
'https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/statistics' \ | |
-H 'accept: application/json' | |
``` | |
Replace URL with your own | |
## Endpoints - ChatGPT Plugin | |
1. Get OCR content for receipt | |
``` | |
curl -X 'GET' \ | |
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_by_id?receipt_id=34563&sparrow_key=your_key' \ | |
-H 'accept: application/json' | |
``` | |
Replace URL with your own | |
2. Post Receipt JSON content to DB | |
``` | |
curl -X 'POST' \ | |
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/store_receipt_db' \ | |
-H 'accept: application/json' \ | |
-H 'Content-Type: application/x-www-form-urlencoded' \ | |
-d 'chatgpt_user=user&receipt_id=12345&receipt_content=%7Breceipt%7D&sparrow_key=your_key' | |
``` | |
Replace URL with your own | |
3. Get receipt JSON from DB by ID | |
``` | |
curl -X 'GET' \ | |
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_by_id?chatgpt_user=user&receipt_id=12345&sparrow_key=your_key' \ | |
-H 'accept: application/json' | |
``` | |
Replace URL with your own | |
4. Delete receipt JSON from DB by ID | |
``` | |
curl -X 'DELETE' \ | |
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_by_id?chatgpt_user=user&receipt_id=13456&sparrow_key=your_key' \ | |
-H 'accept: application/json' | |
``` | |
Replace URL with your own | |
5. Get all IDs for receipts stored in DB | |
``` | |
curl -X 'GET' \ | |
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_ids_by_user?chatgpt_user=user&sparrow_key=your_key' \ | |
-H 'accept: application/json' | |
``` | |
Replace URL with your own | |
6. Get all receipts content stored in DB | |
``` | |
curl -X 'GET' \ | |
'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_content_by_user?chatgpt_user=user&sparrow_key=your_key' \ | |
-H 'accept: application/json' | |
``` | |
Replace URL with your own | |
## CLI | |
Navigate to 'cli' folder and run 'chmod +x sparrowdata'. Add to system path to make it executable globally on the system. | |
1. OCR | |
``` | |
./sparrowdata --api_url https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/ocr \ | |
--file_path ../docs/models/donut/data/img/test/invoice_2.jpg \ | |
--post_processing false \ | |
--sparrow_key your_key | |
``` | |
## Deploy to Hugging Face Spaces | |
1. Create new space - https://huggingface.co/spaces. Follow instructions from readme doc | |
2. Create huggingface_key secret in space settings | |
3. In config.py, replace huggingface_key variable with this line of code | |
``` | |
huggingface_key: str = os.environ.get("huggingface_key") | |
``` | |
4. Commit and push code to the space, follow readme instructions. Docker container will be deployed automatically. Example: | |
``` | |
https://huggingface.co/spaces/katanaml-org/sparrow-data | |
``` | |
5. Sparrow Data API will be accessible by URL, you can get it from space info. Example: | |
``` | |
https://katanaml-org-sparrow-data.hf.space/api/v1/sparrow-data/docs | |
``` | |
## MongoDB connection | |
If post_processing is set to True, then OCR results will be saved to MongoDB. You need to have MongoDB Atlas account and MongoDB Atlas token. Read more: https://docs.atlas.mongodb.com/configure-api-access/ | |
1. Set environment variable for MongoDB Atlas connection, before starting FastAPI service | |
``` | |
export MONGODB_URL="mongodb+srv://sparrow:<password>@<url>/?retryWrites=true&w=majority" | |
``` | |
## Dataset info | |
- [Samples of electronic invoices](https://data.mendeley.com/datasets/tnj49gpmtz) | |
- [Receipts](https://www.kaggle.com/jenswalter/receipts) | |
- [SROIE](https://github.com/zzzDavid/ICDAR-2019-SROIE) | |
## Author | |
[Katana ML](https://katanaml.io), [Andrej Baranovskij](https://github.com/abaranovskis-redsamurai) | |
## License | |
Licensed under the Apache License, Version 2.0. Copyright 2020-2023 Katana ML, Andrej Baranovskij. [Copy of the license](https://github.com/katanaml/sparrow/blob/main/LICENSE). | |