ali6parmak commited on
Commit
5560bdc
·
1 Parent(s): 03d5263

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -0
README.md CHANGED
@@ -1,3 +1,38 @@
1
  ---
2
  license: openrail
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: openrail
3
  ---
4
+
5
+
6
+ <h3 align="center">PDF Reading Order</h3>
7
+ <p align="center">A model for determining the correct reading order of the PDF files.</p>
8
+
9
+ This model uses features from a given PDF to determine it's correct reading order.
10
+
11
+
12
+
13
+ ## Quick Start
14
+
15
+ This model originally working on our two other models, which are pdf-token-type and pdf-paragraphs-extraction.
16
+ The reason for using paragraph extraction model here is to find & extract "figure" and "table" tokens and reduce the complexity of a given PDF page - since figures and tables are including lots of tokens.
17
+
18
+ So, for our paragraph extraction model's details, you can refer to these links:
19
+
20
+ https://huggingface.co/HURIDOCS/pdf-segmetation
21
+ https://github.com/huridocs/pdf_paragraphs_extraction.git
22
+
23
+ You can clone the repo via this link:
24
+
25
+ https://github.com/huridocs/pdf-reading-order
26
+
27
+
28
+ First, the candidate selector model selects the tokens that could be the next token.
29
+ Then, we are passing the best 18 tokens to the reading order model that candidate selector model selected.
30
+ Reading order model decides the final reading orders of the tokens.
31
+
32
+
33
+ ## Performance
34
+
35
+ Test Accuracy : 16 Mistakes/11438 Labels (99.86%)
36
+ Average Accuracy: 431 Mistakes/184995 Labels (99.77%)
37
+
38
+ Speed: ~0.65 seconds per page.