HURIDOCS
/

pdf-reading-order

Model card Files Files and versions Community

ali6parmak commited on Dec 1, 2023

Commit

5560bdc

·

1 Parent(s): 03d5263

Update README.md

Files changed (1) hide show

README.md +35 -0

README.md CHANGED Viewed

@@ -1,3 +1,38 @@
 ---
 license: openrail
 ---

 ---
 license: openrail
 ---
+<h3 align="center">PDF Reading Order</h3>
+<p align="center">A model for determining the correct reading order of the PDF files.</p>
+This model uses features from a given PDF to determine it's correct reading order.
+## Quick Start
+This model originally working on our two other models, which are pdf-token-type and pdf-paragraphs-extraction.
+The reason for using paragraph extraction model here is to find & extract "figure" and "table" tokens and reduce the complexity of a given PDF page - since figures and tables are including lots of tokens.
+So, for our paragraph extraction model's details, you can refer to these links:
+    https://huggingface.co/HURIDOCS/pdf-segmetation
+    https://github.com/huridocs/pdf_paragraphs_extraction.git
+You can clone the repo via this link:
+    https://github.com/huridocs/pdf-reading-order
+First, the candidate selector model selects the tokens that could be the next token.
+Then, we are passing the best 18 tokens to the reading order model that candidate selector model selected.
+Reading order model decides the final reading orders of the tokens.
+## Performance
+Test Accuracy   : 16  Mistakes/11438  Labels (99.86%)
+Average Accuracy: 431 Mistakes/184995 Labels (99.77%)
+Speed: ~0.65 seconds per page.