File size: 2,523 Bytes
0726016 55da6dd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
---
title: README
emoji: 💻
colorFrom: indigo
colorTo: indigo
sdk: static
pinned: false
---
# Hi there 👋
StabRise - Document Processing Solutions
# Our projects
## PDF DataSource for the Apache Spark
<a href="https://stabrise.com/spark-pdf/"><img alt="Spark Pdf" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/16/d6/16d6a0d6-f162-42ad-a5a3-7dc20361ad24/sparkpdf.png__1000x300_subsampling-2.webp" height="120"></a>
---
**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)
**Home page**: [https://stabrise.com/spark-pdf/](https://stabrise.com/spark-pdf/)
**Quick Start Jupyter Notebook**: [https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)
---
The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.
## Key features:
- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package
## ScaleDP
<a href="https://stabrise.com/scaledp/"><img alt="ScaleDP" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/4a/7d/4a7d97c2-50d7-4b7a-9902-af2df9b574da/scaledplogo.png__1000x300_subsampling-2.webp" height="120" /></a>
---
**Source Code**: [https://github.com/StabRise/scaledp](https://github.com/StabRise/scaledp)
**Home page**: [https://stabrise.com/scaledp/](https://stabrise.com/scaledp/)
**Quick Start Jupyter Notebook**: [https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb)
---
ScaleDP is an Open-Source Library for processing documents using Apache Spark.
### Key features:
- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results
## De-Identify
<a href="https://deidentify.online"><img alt="De-Identify" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/fb/fe/fbfe4b0c-dadb-4878-88ad-1c0ece0dc053/deidentifylogo.png__1000x300_subsampling-2.webp" height="120" /></a>
De-Identify is tool for de-identification/anonymization data
### Supported formats
- text
- images
- pdf documents
- DICOM files
|