Russian
English

Paragraph classifier

The classifier is used for binary classification of text lines in PDF or scanned documents.

For each document line, it determines:

  • line is a beginning of a new paragraph or

  • line is a continuation of the previous paragraph

For each line, feature vector is formed based on line's text and formatting, please see dedoc/structure_extractors/feature_extractors/paragraph_feature_extractor.py in dedoc.

  • Training data are available at the link.

  • Training script is here.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train dedoc/paragraph_classifier