Spaces:
Runtime error
Runtime error
metadata
title: camelot-pg
app_file: src/app/run.py
sdk: gradio
sdk_version: 4.32.2
PDF Table Parser
This script extracts tables from PDF files and saves them as CSV files. It supports command-line interface (CLI) for batch processing and also provides an optional web UI for interactive processing.
Features
- Multi-page PDF support
- Progress display per lines/rows, per page, and per file
- CSV output with UTF-8 with BOM encoding
- Customizable edge and row tolerances for table detection
- Optional web UI for interactive processing using Gradio
Installation
- Clone the repository or download the script.
- Install the required dependencies:
pip install rich camelot-py polars gradio gradio_pdf
Usage
Command-Line Interface (CLI)
To run the script via CLI, use the following command:
python src/app/parser.py input1.pdf input2.pdf output1.csv output2.csv
Arguments:
input_files
: List of input PDF filesoutput_files
: List of output CSV files (must match the number of input files)
Optional Arguments:
--delimiter
: Output file delimiter (default:,
)--edge_tol
: Tolerance parameter used to specify the distance between text and table edges (default:50
)--row_tol
: Tolerance parameter used to specify the distance between table rows (default:10
)--webui
: Launch the web UI
Web UI
To run the script with the web UI, use the following command:
python src/app/run.py
This will launch a Gradio-based web application where you can upload PDFs and view the extracted tables interactively.
Example
CLI Example
python src/app/parser.py data/demo.pdf data/output.csv --delimiter ";" --edge_tol 60 --row_tol 40
License
This project is licensed under the MIT License.