sparrow-data-itn

Runtime error

App Files Files Community

ITNovaML commited on Jul 17, 2023

Commit

bfe03ac

•

1 Parent(s): 3d2a3a3

Upload 8 files

Browse files

Files changed (8) hide show

README.MD +267 -0
requirements.txt +11 -0
run_converter.py +27 -0
run_donut.py +54 -0
run_donut_data_generator.py +22 -0
run_donut_test.py +8 -0
run_donut_upload.py +8 -0
run_ocr.py +32 -0

README.MD ADDED Viewed

	@@ -0,0 +1,267 @@

+# Sparrow Data
+## Description
+This module implements data structure for Sparrow ML model fine-tuning. We are using list of invoices to build Hugging Face dataset.
+## Install
+1. Install
+```
+pip install -r requirements.txt
+```
+2. Install Poppler, required for pdf2image to work (macos example)
+```
+brew install poppler
+```
+3. Mindee docTR OCR installation with dependencies
+```
+pip install torch torchvision torchaudio
+pip install python-doctr
+```
+## Usage
+1. Run OCR on invoices with PDF conversion to JPG
+```
+python run_ocr.py
+```
+2. Run data conversion to Sparrow format
+```
+python run_converter.py
+```
+Run Sparrow UI to annotate the documents and create key/value pairs.
+3. Run data preparation task for Donut model fine-tuning. This task will create metadata. It will create Hugging Face dataset with train, validation and test splits for Donut model fine-tuning
+```
+python run_donut.py
+```
+4. Push dataset to Huggung Face Hub. You need to have Hugging Face account and Hugging Face Hub token. Read more: https://huggingface.co/docs/datasets/main/en/image_dataset
+```
+python run_donut_upload.py
+```
+5. Test dataset by using load_dataset and fetching data from Hugging Face Hub
+```
+python run_donut_test.py
+```
+## FastAPI Service
+Set environment variables in **set_env_vars.sh**
+1. Run
+```
+cd api
+```
+```
+RUN_LOCALLY=true ./start.sh
+```
+2. FastAPI Swagger
+```
+http://127.0.0.1:8000/api/v1/sparrow-data/docs
+```
+**Run in Docker container**
+1. Build Docker image
+```
+docker build --tag katanaml/sparrow-data .
+```
+2. Run Docker container
+```
+docker run -e RUN_LOCALLY=true -it --name sparrow-data -p 7860:7860 katanaml/sparrow-data:latest
+```
+## Endpoints
+1. Info
+```
+curl -X 'GET' \
+  'https://katanaml-org-sparrow-data.hf.space/api-dataset/v1/sparrow-data/dataset_info' \
+  -H 'accept: application/json'
+```
+Replace URL with your own
+2. Ground truth
+```
+curl -X 'GET' \
+  'https://katanaml-org-sparrow-data.hf.space/api-dataset/v1/sparrow-data/ground_truth' \
+  -H 'accept: application/json'
+```
+Replace URL with your own
+3. OCR service
+```
+curl -X 'POST' \
+  'https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/ocr' \
+  -H 'accept: application/json' \
+  -H 'Content-Type: multipart/form-data' \
+  -F 'file=' \
+  -F 'image_url=https://raw.githubusercontent.com/katanaml/sparrow/main/sparrow-data/docs/input/invoices/processed/images/invoice_10.jpg' \
+  -F 'post_processing=false' \
+  -F 'sparrow_key=your_key'
+```
+Replace URL with your own
+4. OCR statistics
+```
+curl -X 'GET' \
+  'https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/statistics' \
+  -H 'accept: application/json'
+```
+Replace URL with your own
+## Endpoints - ChatGPT Plugin
+1. Get OCR content for receipt
+```
+curl -X 'GET' \
+  'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_by_id?receipt_id=34563&sparrow_key=your_key' \
+  -H 'accept: application/json'
+```
+Replace URL with your own
+2. Post Receipt JSON content to DB
+```
+curl -X 'POST' \
+  'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/store_receipt_db' \
+  -H 'accept: application/json' \
+  -H 'Content-Type: application/x-www-form-urlencoded' \
+  -d 'chatgpt_user=user&receipt_id=12345&receipt_content=%7Breceipt%7D&sparrow_key=your_key'
+```
+Replace URL with your own
+3. Get receipt JSON from DB by ID
+```
+curl -X 'GET' \
+  'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_by_id?chatgpt_user=user&receipt_id=12345&sparrow_key=your_key' \
+  -H 'accept: application/json'
+```
+Replace URL with your own
+4. Delete receipt JSON from DB by ID
+```
+curl -X 'DELETE' \
+  'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_by_id?chatgpt_user=user&receipt_id=13456&sparrow_key=your_key' \
+  -H 'accept: application/json'
+```
+Replace URL with your own
+5. Get all IDs for receipts stored in DB
+```
+curl -X 'GET' \
+  'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_ids_by_user?chatgpt_user=user&sparrow_key=your_key' \
+  -H 'accept: application/json'
+```
+Replace URL with your own
+6. Get all receipts content stored in DB
+```
+curl -X 'GET' \
+  'https://katanaml-org-sparrow-data.hf.space/api-chatgpt-plugin/v1/sparrow-data/receipt_db_content_by_user?chatgpt_user=user&sparrow_key=your_key' \
+  -H 'accept: application/json'
+```
+Replace URL with your own
+## CLI
+Navigate to 'cli' folder and run 'chmod +x sparrowdata'. Add to system path to make it executable globally on the system.
+1. OCR
+```
+./sparrowdata --api_url https://katanaml-org-sparrow-data.hf.space/api-ocr/v1/sparrow-data/ocr \
+              --file_path ../docs/models/donut/data/img/test/invoice_2.jpg \
+              --post_processing false \
+              --sparrow_key your_key
+```
+## Deploy to Hugging Face Spaces
+1. Create new space - https://huggingface.co/spaces. Follow instructions from readme doc
+2. Create huggingface_key secret in space settings
+3. In config.py, replace huggingface_key variable with this line of code
+```
+huggingface_key: str = os.environ.get("huggingface_key")
+```
+4. Commit and push code to the space, follow readme instructions. Docker container will be deployed automatically. Example:
+```
+https://huggingface.co/spaces/katanaml-org/sparrow-data
+```
+5. Sparrow Data API will be accessible by URL, you can get it from space info. Example:
+```
+https://katanaml-org-sparrow-data.hf.space/api/v1/sparrow-data/docs
+```
+## MongoDB connection
+If post_processing is set to True, then OCR results will be saved to MongoDB. You need to have MongoDB Atlas account and MongoDB Atlas token. Read more: https://docs.atlas.mongodb.com/configure-api-access/
+1. Set environment variable for MongoDB Atlas connection, before starting FastAPI service
+```
+export MONGODB_URL="mongodb+srv://sparrow:<password>@<url>/?retryWrites=true&w=majority"
+```
+## Dataset info
+- [Samples of electronic invoices](https://data.mendeley.com/datasets/tnj49gpmtz)
+- [Receipts](https://www.kaggle.com/jenswalter/receipts)
+- [SROIE](https://github.com/zzzDavid/ICDAR-2019-SROIE)
+## Author
+[Katana ML](https://katanaml.io), [Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)
+## License
+Licensed under the Apache License, Version 2.0. Copyright 2020-2023 Katana ML, Andrej Baranovskij. [Copy of the license](https://github.com/katanaml/sparrow/blob/main/LICENSE).

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+pdf2image==1.16.2
+torch==1.13.1
+torchvision
+torchaudio
+datasets==2.10.1
+fastapi==0.96.0
+python-doctr==0.6.0
+paddleocr==2.6.1.3
+paddlepaddle==2.4.2
+uvicorn[standard]
+rapidfuzz<3.0

run_converter.py ADDED Viewed

	@@ -0,0 +1,27 @@

+from tools.data_converter import DataConverter
+import os
+import shutil
+def main():
+    # Convert to sparrow format
+    data_converter = DataConverter()
+    data_converter.convert_to_sparrow_format('docs/input/invoices/processed/ocr',
+                                             'docs/input/invoices/processed/output')
+    # define the source and destination directory
+    src_dir = 'docs/input/invoices/processed/output'
+    dst_dir = '../sparrow-ui/docs/json'
+    # Get list of files in source directory
+    files = os.listdir(src_dir)
+    # Loop through all files in source directory and copy to destination directory
+    for f in files:
+        src_file = os.path.join(src_dir, f)
+        dst_file = os.path.join(dst_dir, f)
+        shutil.copy(src_file, dst_file)
+if __name__ == '__main__':
+    main()

run_donut.py ADDED Viewed

	@@ -0,0 +1,54 @@

+from tools.donut.metadata_generator import DonutMetadataGenerator
+from tools.donut.dataset_generator import DonutDatasetGenerator
+from pathlib import Path
+import os
+import shutil
+def main():
+    # define the source and destination directory
+    src_dir_json = '../sparrow-ui/docs/json/key'
+    src_dir_img = '../sparrow-ui/docs/images'
+    dst_dir_json = 'docs/models/donut/data/key'
+    dst_dir_img = 'docs/models/donut/data/key/img'
+    # copy JSON files from src to dst
+    files = os.listdir(src_dir_json)
+    for f in files:
+        src_file = os.path.join(src_dir_json, f)
+        dst_file = os.path.join(dst_dir_json, f)
+        shutil.copy(src_file, dst_file)
+    # copy images from src to dst
+    files = os.listdir(src_dir_img)
+    for f in files:
+        # copy img file, only if file with sane name exists in dst_dir_json
+        if os.path.isfile(os.path.join(dst_dir_json, f[:-4] + '.json')):
+            src_file = os.path.join(src_dir_img, f)
+            dst_file = os.path.join(dst_dir_img, f)
+            shutil.copy(src_file, dst_file)
+    # Convert to Donut format
+    base_path = 'docs/models/donut/data'
+    data_dir_path = Path(base_path).joinpath("key")
+    files = data_dir_path.glob("*.json")
+    files_list = [file for file in files]
+    # split files_list array into 3 parts, 85% train, 10% validation, 5% test
+    train_files_list = files_list[:int(len(files_list) * 0.85)]
+    print("Train set size:", len(train_files_list))
+    validation_files_list = files_list[int(len(files_list) * 0.85):int(len(files_list) * 0.95)]
+    print("Validation set size:", len(validation_files_list))
+    test_files_list = files_list[int(len(files_list) * 0.95):]
+    print("Test set size:", len(test_files_list))
+    metadata_generator = DonutMetadataGenerator()
+    metadata_generator.generate(base_path, train_files_list, "train")
+    metadata_generator.generate(base_path, validation_files_list, "validation")
+    metadata_generator.generate(base_path, test_files_list, "test")
+    # Generate dataset
+    dataset_generator = DonutDatasetGenerator()
+    dataset_generator.generate(base_path)
+if __name__ == '__main__':
+    main()

run_donut_data_generator.py ADDED Viewed

	@@ -0,0 +1,22 @@

+import cv2
+def main():
+    # file_name = "docs/models/donut/data/key/invoice_0.json"
+    # for i in range(2, 250):
+    #     with open(file_name, "r") as file:
+    #         # create new file name
+    #         new_file_name = file_name.replace("invoice_0", f"invoice_{i}")
+    #         # open new file
+    #         with open(new_file_name, "w") as outfile:
+    #             # write to new file
+    #             outfile.write(file.read())
+    file_name = "docs/models/donut/data/img/test/invoice_1.jpg"
+    img = cv2.imread(file_name)
+    for i in range(250, 500):
+        new_file_name = file_name.replace("invoice_1", f"invoice_{i}")
+        cv2.imwrite(new_file_name, img)
+if __name__ == '__main__':
+    main()

run_donut_test.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from tools.donut.dataset_tester import DonutDatasetTester
+def main():
+    dataset_tester = DonutDatasetTester()
+    dataset_tester.test("katanaml-org/invoices-donut-data-v1")
+if __name__ == '__main__':
+    main()

run_donut_upload.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from tools.donut.dataset_uploader import DonutDatasetUploader
+def main():
+    dataset_uploader = DonutDatasetUploader()
+    dataset_uploader.upload('docs/models/donut/data', "katanaml-org/invoices-donut-data-v1")
+if __name__ == '__main__':
+    main()

run_ocr.py ADDED Viewed

	@@ -0,0 +1,32 @@

+from tools.pdf_converter import PDFConverter
+from tools.ocr_extractor import OCRExtractor
+import os
+import shutil
+def main():
+    # Convert pdf to jpg
+    pdf_converter = PDFConverter()
+    pdf_converter.convert_to_jpg('docs/input/invoices/Dataset with valid information',
+                                 'docs/input/invoices/processed/images')
+    # define the source and destination directory
+    src_dir = 'docs/input/invoices/processed/images'
+    dst_dir = '../sparrow-ui/docs/images'
+    # Get list of files in source directory
+    files = os.listdir(src_dir)
+    # Loop through all files in source directory and copy to destination directory
+    for f in files:
+        src_file = os.path.join(src_dir, f)
+        dst_file = os.path.join(dst_dir, f)
+        shutil.copy(src_file, dst_file)
+    # OCR
+    ocr_extractor = OCRExtractor('db_resnet50', 'crnn_vgg16_bn', pretrained=True)
+    ocr_extractor.extract('docs/input/invoices/processed', show_prediction=False)
+if __name__ == '__main__':
+    main()