Spaces:

StabRise
/

README

Running

File size: 2,523 Bytes

---
title: README
emoji: 💻
colorFrom: indigo
colorTo: indigo
sdk: static
pinned: false
---

# Hi there 👋

StabRise - Document Processing Solutions

# Our projects

## PDF DataSource for the Apache Spark

<a href="https://stabrise.com/spark-pdf/"><img alt="Spark Pdf" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/16/d6/16d6a0d6-f162-42ad-a5a3-7dc20361ad24/sparkpdf.png__1000x300_subsampling-2.webp" height="120"></a>

---

**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)

**Home page**: [https://stabrise.com/spark-pdf/](https://stabrise.com/spark-pdf/)

**Quick Start Jupyter Notebook**: [https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb](https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb)

---

The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

## Key features:

- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package

## ScaleDP

<a href="https://stabrise.com/scaledp/"><img alt="ScaleDP" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/4a/7d/4a7d97c2-50d7-4b7a-9902-af2df9b574da/scaledplogo.png__1000x300_subsampling-2.webp" height="120" /></a>

---

**Source Code**: [https://github.com/StabRise/scaledp](https://github.com/StabRise/scaledp)

**Home page**: [https://stabrise.com/scaledp/](https://stabrise.com/scaledp/)

**Quick Start Jupyter Notebook**: [https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb](https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb)

---

ScaleDP is an Open-Source Library for processing documents using Apache Spark.

### Key features:

- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results


## De-Identify

<a href="https://deidentify.online"><img alt="De-Identify" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/fb/fe/fbfe4b0c-dadb-4878-88ad-1c0ece0dc053/deidentifylogo.png__1000x300_subsampling-2.webp" height="120" /></a>

De-Identify is tool for de-identification/anonymization data

### Supported formats
 - text
 - images
 - pdf documents
 - DICOM files