Vik Paruchuri commited on
Commit
3c6746a
·
1 Parent(s): 9be8438

Better debugging, heading detection

Browse files
.gitignore CHANGED
@@ -8,6 +8,7 @@ wandb
8
  *.dat
9
  report.json
10
  benchmark_data
 
11
 
12
  # Byte-compiled / optimized / DLL files
13
  __pycache__/
 
8
  *.dat
9
  report.json
10
  benchmark_data
11
+ debug
12
 
13
  # Byte-compiled / optimized / DLL files
14
  __pycache__/
README.md CHANGED
@@ -89,6 +89,7 @@ First, some configuration:
89
  - Inspect the settings in `marker/settings.py`. You can override any settings with environment variables.
90
  - Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
91
  - By default, marker will use `surya` for OCR. Surya is slower on CPU, but more accurate than tesseract. It also doesn't require you to specify the languages in the document. If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above). If you don't want OCR at all, set `OCR_ENGINE` to `None`.
 
92
 
93
  ## Interactive App
94
 
@@ -107,15 +108,15 @@ marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --ma
107
 
108
  - `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
109
  - `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
 
110
  - `--langs` is an optional comma separated list of the languages in the document, for OCR. Optional by default, required if you use tesseract.
111
- - `--ocr_all_pages` is an optional argument to force OCR on all pages of the PDF. If this or the env var `OCR_ALL_PAGES` are true, OCR will be forced.
112
 
113
  The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py). If you need more languages, you can use any language supported by [Tesseract](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016) if you set `OCR_ENGINE` to `ocrmypdf`. If you don't need OCR, marker can work with any language.
114
 
115
  ## Convert multiple files
116
 
117
  ```shell
118
- marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --min_length 10000
119
  ```
120
 
121
  - `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.
@@ -136,7 +137,7 @@ You can use language names or codes. The exact codes depend on the OCR engine.
136
  ## Convert multiple files on multiple GPUs
137
 
138
  ```shell
139
- MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
140
  ```
141
 
142
  - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
@@ -150,15 +151,18 @@ Note that the env variables above are specific to this script, and cannot be set
150
 
151
  There are some settings that you may find useful if things aren't working the way you expect:
152
 
153
- - `OCR_ALL_PAGES` - set this to true to force OCR all pages. This can be very useful if the table layouts aren't recognized properly by default, or if there is garbled text.
154
  - `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
155
  - `OCR_ENGINE` - can set this to `surya` or `ocrmypdf`.
156
- - `DEBUG` - setting this to `True` shows ray logs when converting multiple pdfs
157
  - Verify that you set the languages correctly, or passed in a metadata file.
158
  - If you're getting out of memory errors, decrease worker count (increased the `VRAM_PER_TASK` setting). You can also try splitting up long PDFs into multiple files.
159
 
160
  In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.
161
 
 
 
 
 
162
  ## Useful settings
163
 
164
  These settings can improve/change output quality:
 
89
  - Inspect the settings in `marker/settings.py`. You can override any settings with environment variables.
90
  - Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
91
  - By default, marker will use `surya` for OCR. Surya is slower on CPU, but more accurate than tesseract. It also doesn't require you to specify the languages in the document. If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above). If you don't want OCR at all, set `OCR_ENGINE` to `None`.
92
+ - Some PDFs, even digital ones, have bad text in them. Set `OCR_ALL_PAGES=true` to force OCR if you find bad output from marker.
93
 
94
  ## Interactive App
95
 
 
108
 
109
  - `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
110
  - `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
111
+ - `--start_page` is the page to start from (default is None, will start from the first page).
112
  - `--langs` is an optional comma separated list of the languages in the document, for OCR. Optional by default, required if you use tesseract.
 
113
 
114
  The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py). If you need more languages, you can use any language supported by [Tesseract](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016) if you set `OCR_ENGINE` to `ocrmypdf`. If you don't need OCR, marker can work with any language.
115
 
116
  ## Convert multiple files
117
 
118
  ```shell
119
+ marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10
120
  ```
121
 
122
  - `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.
 
137
  ## Convert multiple files on multiple GPUs
138
 
139
  ```shell
140
+ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
141
  ```
142
 
143
  - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
 
151
 
152
  There are some settings that you may find useful if things aren't working the way you expect:
153
 
154
+ - `OCR_ALL_PAGES` - set this to true to force OCR all pages. This can be very useful if there is garbled text in the output of marker.
155
  - `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
156
  - `OCR_ENGINE` - can set this to `surya` or `ocrmypdf`.
 
157
  - Verify that you set the languages correctly, or passed in a metadata file.
158
  - If you're getting out of memory errors, decrease worker count (increased the `VRAM_PER_TASK` setting). You can also try splitting up long PDFs into multiple files.
159
 
160
  In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.
161
 
162
+ ## Debugging
163
+
164
+ Set `DEBUG=true` to save data to the `debug` subfolder in the marker root directory. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
165
+
166
  ## Useful settings
167
 
168
  These settings can improve/change output quality:
convert_single.py CHANGED
@@ -2,6 +2,9 @@ import time
2
 
3
  import pypdfium2 # Needs to be at the top to avoid warnings
4
  import os
 
 
 
5
  os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" # For some reason, transformers decided to use .isin for a simple op, which is not supported on MPS
6
 
7
  import argparse
@@ -22,8 +25,6 @@ def main():
22
  parser.add_argument("--start_page", type=int, default=None, help="Page to start processing at")
23
  parser.add_argument("--langs", type=str, help="Optional languages to use for OCR, comma separated", default=None)
24
  parser.add_argument("--batch_multiplier", type=int, default=2, help="How much to increase batch sizes")
25
- parser.add_argument("--debug", action="store_true", help="Enable debug logging", default=False)
26
- parser.add_argument("--ocr_all_pages", action="store_true", help="Force OCR on all pages", default=False)
27
  args = parser.parse_args()
28
 
29
  langs = args.langs.split(",") if args.langs else None
@@ -31,14 +32,13 @@ def main():
31
  fname = args.filename
32
  model_lst = load_all_models()
33
  start = time.time()
34
- full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page, ocr_all_pages=args.ocr_all_pages)
35
 
36
  fname = os.path.basename(fname)
37
  subfolder_path = save_markdown(args.output, fname, full_text, images, out_meta)
38
 
39
  print(f"Saved markdown to the {subfolder_path} folder")
40
- if args.debug:
41
- print(f"Total time: {time.time() - start}")
42
 
43
 
44
  if __name__ == "__main__":
 
2
 
3
  import pypdfium2 # Needs to be at the top to avoid warnings
4
  import os
5
+
6
+ from marker.settings import settings
7
+
8
  os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" # For some reason, transformers decided to use .isin for a simple op, which is not supported on MPS
9
 
10
  import argparse
 
25
  parser.add_argument("--start_page", type=int, default=None, help="Page to start processing at")
26
  parser.add_argument("--langs", type=str, help="Optional languages to use for OCR, comma separated", default=None)
27
  parser.add_argument("--batch_multiplier", type=int, default=2, help="How much to increase batch sizes")
 
 
28
  args = parser.parse_args()
29
 
30
  langs = args.langs.split(",") if args.langs else None
 
32
  fname = args.filename
33
  model_lst = load_all_models()
34
  start = time.time()
35
+ full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page)
36
 
37
  fname = os.path.basename(fname)
38
  subfolder_path = save_markdown(args.output, fname, full_text, images, out_meta)
39
 
40
  print(f"Saved markdown to the {subfolder_path} folder")
41
+ print(f"Total time: {time.time() - start}")
 
42
 
43
 
44
  if __name__ == "__main__":
marker/cleaners/headings.py CHANGED
@@ -68,7 +68,7 @@ def bucket_headings(line_heights, num_levels=settings.HEADING_LEVEL_COUNT):
68
  data_labels = np.concatenate([data, labels.reshape(-1, 1)], axis=1)
69
  data_labels = np.sort(data_labels, axis=0)
70
 
71
- cluster_means = {label: np.mean(data[labels == label, 0]) for label in np.unique(labels)}
72
  label_max = None
73
  label_min = None
74
  heading_ranges = []
@@ -95,15 +95,14 @@ def bucket_headings(line_heights, num_levels=settings.HEADING_LEVEL_COUNT):
95
  return heading_ranges
96
 
97
 
98
- def infer_heading_levels(pages: List[Page]):
99
  all_line_heights = []
100
  for page in pages:
101
  for block in page.blocks:
102
  if block.block_type not in ["Title", "Section-header"]:
103
  continue
104
 
105
- block_heights = [min(l.height, l.width) for l in block.lines] # Account for rotation
106
- all_line_heights.extend(block_heights)
107
 
108
  heading_ranges = bucket_headings(all_line_heights)
109
 
@@ -112,11 +111,11 @@ def infer_heading_levels(pages: List[Page]):
112
  if block.block_type not in ["Title", "Section-header"]:
113
  continue
114
 
115
- block_heights = [min(l.height, l.width) for l in block.lines] # Account for rotation
116
  avg_height = sum(block_heights) / len(block_heights)
117
  for idx, (min_height, max_height) in enumerate(heading_ranges):
118
- if avg_height >= min_height:
119
- block.heading_level = len(heading_ranges) - idx
120
  break
121
 
122
  if block.heading_level is None:
 
68
  data_labels = np.concatenate([data, labels.reshape(-1, 1)], axis=1)
69
  data_labels = np.sort(data_labels, axis=0)
70
 
71
+ cluster_means = {label: np.mean(data_labels[data_labels[:, 1] == label, 0]) for label in np.unique(labels)}
72
  label_max = None
73
  label_min = None
74
  heading_ranges = []
 
95
  return heading_ranges
96
 
97
 
98
+ def infer_heading_levels(pages: List[Page], height_tol=.99):
99
  all_line_heights = []
100
  for page in pages:
101
  for block in page.blocks:
102
  if block.block_type not in ["Title", "Section-header"]:
103
  continue
104
 
105
+ all_line_heights.extend([l.height for l in block.lines])
 
106
 
107
  heading_ranges = bucket_headings(all_line_heights)
108
 
 
111
  if block.block_type not in ["Title", "Section-header"]:
112
  continue
113
 
114
+ block_heights = [l.height for l in block.lines] # Account for rotation
115
  avg_height = sum(block_heights) / len(block_heights)
116
  for idx, (min_height, max_height) in enumerate(heading_ranges):
117
+ if avg_height >= min_height * height_tol:
118
+ block.heading_level = idx + 1
119
  break
120
 
121
  if block.heading_level is None:
marker/convert.py CHANGED
@@ -10,7 +10,7 @@ from PIL import Image
10
 
11
  from marker.utils import flush_cuda_memory
12
  from marker.tables.table import format_tables
13
- from marker.debug.data import dump_bbox_debug_data
14
  from marker.layout.layout import surya_layout, annotate_block_types
15
  from marker.layout.order import surya_order, sort_blocks_in_reading_order
16
  from marker.ocr.lang import replace_langs_with_codes, validate_langs
@@ -108,7 +108,8 @@ def convert_single_pdf(
108
  annotate_block_types(pages)
109
 
110
  # Dump debug data if flags are set
111
- dump_bbox_debug_data(doc, fname, pages)
 
112
 
113
  # Find reading order for blocks
114
  # Sort blocks by reading order
 
10
 
11
  from marker.utils import flush_cuda_memory
12
  from marker.tables.table import format_tables
13
+ from marker.debug.data import dump_bbox_debug_data, draw_page_debug_images
14
  from marker.layout.layout import surya_layout, annotate_block_types
15
  from marker.layout.order import surya_order, sort_blocks_in_reading_order
16
  from marker.ocr.lang import replace_langs_with_codes, validate_langs
 
108
  annotate_block_types(pages)
109
 
110
  # Dump debug data if flags are set
111
+ draw_page_debug_images(fname, pages)
112
+ dump_bbox_debug_data(fname, pages)
113
 
114
  # Find reading order for blocks
115
  # Sort blocks by reading order
marker/debug/data.py CHANGED
@@ -1,74 +1,62 @@
1
- import base64
2
  import json
 
3
  import os
4
  from typing import List
5
 
6
- from marker.pdf.images import render_image
 
7
  from marker.schema.page import Page
8
  from marker.settings import settings
9
  from PIL import Image
10
- import io
11
 
12
 
13
- def dump_equation_debug_data(doc, images, converted_spans):
14
- if not settings.DEBUG_DATA_FOLDER or settings.DEBUG_LEVEL == 0:
15
  return
16
 
17
- if len(images) == 0:
18
- return
19
 
20
- # We attempted one conversion per image
21
- assert len(converted_spans) == len(images)
22
-
23
- data_lines = []
24
- for idx, (pil_image, converted_span) in enumerate(zip(images, converted_spans)):
25
- if converted_span is None:
26
- continue
27
- # Image is a BytesIO object
28
- img_bytes = io.BytesIO()
29
- pil_image.save(img_bytes, format="WEBP", lossless=True)
30
- b64_image = base64.b64encode(img_bytes.getvalue()).decode("utf-8")
31
- data_lines.append({
32
- "image": b64_image,
33
- "text": converted_span.text,
34
- "bbox": converted_span.bbox
35
- })
36
 
37
- # Remove extension from doc name
38
- doc_base = os.path.basename(doc.name).rsplit(".", 1)[0]
 
 
 
 
39
 
40
- debug_file = os.path.join(settings.DEBUG_DATA_FOLDER, f"{doc_base}_equations.json")
41
- with open(debug_file, "w+") as f:
42
- json.dump(data_lines, f)
 
 
 
 
43
 
 
44
 
45
- def dump_bbox_debug_data(doc, fname, blocks: List[Page]):
46
- if not settings.DEBUG_DATA_FOLDER or settings.DEBUG_LEVEL < 2:
 
 
 
 
47
  return
48
 
49
  # Remove extension from doc name
50
- doc_base = fname.rsplit(".", 1)[0]
51
 
52
  debug_file = os.path.join(settings.DEBUG_DATA_FOLDER, f"{doc_base}_bbox.json")
53
  debug_data = []
54
- for idx, page_blocks in enumerate(blocks):
55
- page = doc[idx]
56
-
57
- png_image = render_image(page, dpi=settings.TEXIFY_DPI)
58
- width, height = png_image.size
59
- max_dimension = 6000
60
- if width > max_dimension or height > max_dimension:
61
- scaling_factor = min(max_dimension / width, max_dimension / height)
62
- png_image = png_image.resize((int(width * scaling_factor), int(height * scaling_factor)), Image.ANTIALIAS)
63
-
64
- img_bytes = io.BytesIO()
65
- png_image.save(img_bytes, format="WEBP", lossless=True, quality=100)
66
- b64_image = base64.b64encode(img_bytes.getvalue()).decode("utf-8")
67
-
68
  page_data = page_blocks.model_dump(exclude=["images", "layout", "text_lines"])
69
  page_data["layout"] = page_blocks.layout.model_dump(exclude=["segmentation_map"])
70
  page_data["text_lines"] = page_blocks.text_lines.model_dump(exclude=["heatmap", "affinity_map"])
71
- #page_data["image"] = b64_image
72
  debug_data.append(page_data)
73
 
74
  with open(debug_file, "w+") as f:
 
 
1
  import json
2
+ import math
3
  import os
4
  from typing import List
5
 
6
+ from marker.debug.render import render_on_image
7
+ from marker.schema.bbox import rescale_bbox
8
  from marker.schema.page import Page
9
  from marker.settings import settings
10
  from PIL import Image
 
11
 
12
 
13
+ def draw_page_debug_images(fname, pages: List[Page]):
14
+ if not settings.DEBUG:
15
  return
16
 
17
+ # Remove extension from doc name
18
+ doc_base = os.path.basename(fname).rsplit(".", 1)[0]
19
 
20
+ debug_folder = os.path.join(settings.DEBUG_DATA_FOLDER, doc_base)
21
+ os.makedirs(debug_folder, exist_ok=True)
22
+ for idx, page in enumerate(pages):
23
+ img_size = (int(math.ceil(page.text_lines.image_bbox[2])), int(math.ceil(page.text_lines.image_bbox[3])))
24
+ png_image = Image.new("RGB", img_size, color="white")
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ line_bboxes = []
27
+ line_text = []
28
+ for block in page.blocks:
29
+ for line in block.lines:
30
+ line_bboxes.append(rescale_bbox(page.bbox, page.text_lines.image_bbox, line.bbox))
31
+ line_text.append(line.prelim_text)
32
 
33
+ render_on_image(line_bboxes, png_image, labels=line_text, color="black", draw_bbox=False)
34
+
35
+ line_bboxes = [line.bbox for line in page.text_lines.bboxes]
36
+ render_on_image(line_bboxes, png_image, color="blue")
37
+
38
+ layout_boxes = [rescale_bbox(page.layout.image_bbox, page.text_lines.image_bbox, box.bbox) for box in page.layout.bboxes]
39
+ layout_labels = [box.label for box in page.layout.bboxes]
40
 
41
+ render_on_image(layout_boxes, png_image, labels=layout_labels, color="red")
42
 
43
+ debug_file = os.path.join(debug_folder, f"page_{idx}.png")
44
+ png_image.save(debug_file)
45
+
46
+
47
+ def dump_bbox_debug_data(fname, pages: List[Page]):
48
+ if not settings.DEBUG:
49
  return
50
 
51
  # Remove extension from doc name
52
+ doc_base = os.path.basename(fname).rsplit(".", 1)[0]
53
 
54
  debug_file = os.path.join(settings.DEBUG_DATA_FOLDER, f"{doc_base}_bbox.json")
55
  debug_data = []
56
+ for idx, page_blocks in enumerate(pages):
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  page_data = page_blocks.model_dump(exclude=["images", "layout", "text_lines"])
58
  page_data["layout"] = page_blocks.layout.model_dump(exclude=["segmentation_map"])
59
  page_data["text_lines"] = page_blocks.text_lines.model_dump(exclude=["heatmap", "affinity_map"])
 
60
  debug_data.append(page_data)
61
 
62
  with open(debug_file, "w+") as f:
marker/equations/equations.py CHANGED
@@ -2,7 +2,6 @@ from collections import defaultdict
2
  from copy import deepcopy
3
  from typing import List
4
 
5
- from marker.debug.data import dump_equation_debug_data
6
  from marker.equations.inference import get_total_texify_tokens, get_latex_batched
7
  from marker.pdf.images import render_bbox_image
8
  from marker.schema.bbox import rescale_bbox
@@ -177,7 +176,4 @@ def replace_equations(doc, pages: List[Page], texify_model, batch_multiplier=1):
177
  successful_ocr += success_count
178
  unsuccessful_ocr += fail_count
179
 
180
- # If debug mode is on, dump out conversions for comparison
181
- dump_equation_debug_data(doc, images, converted_spans)
182
-
183
  return pages, {"successful_ocr": successful_ocr, "unsuccessful_ocr": unsuccessful_ocr, "equations": eq_count}
 
2
  from copy import deepcopy
3
  from typing import List
4
 
 
5
  from marker.equations.inference import get_total_texify_tokens, get_latex_batched
6
  from marker.pdf.images import render_bbox_image
7
  from marker.schema.bbox import rescale_bbox
 
176
  successful_ocr += success_count
177
  unsuccessful_ocr += fail_count
178
 
 
 
 
179
  return pages, {"successful_ocr": successful_ocr, "unsuccessful_ocr": unsuccessful_ocr, "equations": eq_count}
marker/postprocessors/markdown.py CHANGED
@@ -146,22 +146,22 @@ def merge_lines(blocks: List[List[MergedBlock]]):
146
  prev_line = None
147
  block_text = ""
148
  block_type = ""
149
- block_heading_level = None
150
 
151
  for idx, page in enumerate(blocks):
152
  for block in page:
153
  block_type = block.block_type
154
- if block_type != prev_type and prev_type:
155
  text_blocks.append(
156
  FullyMergedBlock(
157
- text=block_surround(block_text, prev_type, block_heading_level),
158
  block_type=prev_type
159
  )
160
  )
161
  block_text = ""
162
 
163
  prev_type = block_type
164
- block_heading_level = block.heading_level
165
  # Join lines in the block together properly
166
  for i, line in enumerate(block.lines):
167
  line_height = line.bbox[3] - line.bbox[1]
@@ -180,7 +180,7 @@ def merge_lines(blocks: List[List[MergedBlock]]):
180
  # Append the final block
181
  text_blocks.append(
182
  FullyMergedBlock(
183
- text=block_surround(block_text, prev_type, block_heading_level),
184
  block_type=block_type
185
  )
186
  )
 
146
  prev_line = None
147
  block_text = ""
148
  block_type = ""
149
+ prev_heading_level = None
150
 
151
  for idx, page in enumerate(blocks):
152
  for block in page:
153
  block_type = block.block_type
154
+ if (block_type != prev_type and prev_type) or (block.heading_level != prev_heading_level and prev_heading_level):
155
  text_blocks.append(
156
  FullyMergedBlock(
157
+ text=block_surround(block_text, prev_type, prev_heading_level),
158
  block_type=prev_type
159
  )
160
  )
161
  block_text = ""
162
 
163
  prev_type = block_type
164
+ prev_heading_level = block.heading_level
165
  # Join lines in the block together properly
166
  for i, line in enumerate(block.lines):
167
  line_height = line.bbox[3] - line.bbox[1]
 
180
  # Append the final block
181
  text_blocks.append(
182
  FullyMergedBlock(
183
+ text=block_surround(block_text, prev_type, prev_heading_level),
184
  block_type=block_type
185
  )
186
  )
marker/settings.py CHANGED
@@ -4,6 +4,7 @@ from dotenv import find_dotenv
4
  from pydantic import computed_field
5
  from pydantic_settings import BaseSettings
6
  import torch
 
7
 
8
 
9
  class Settings(BaseSettings):
@@ -12,6 +13,7 @@ class Settings(BaseSettings):
12
  IMAGE_DPI: int = 96 # DPI to render images pulled from pdf at
13
  EXTRACT_IMAGES: bool = True # Extract images from pdfs and save them
14
  PAGINATE_OUTPUT: bool = False # Paginate output markdown
 
15
 
16
  @computed_field
17
  @property
@@ -84,9 +86,11 @@ class Settings(BaseSettings):
84
  HEADING_DEFAULT_LEVEL: int = 2
85
 
86
  # Debug
87
- DEBUG: bool = False # Enable debug logging
88
- DEBUG_DATA_FOLDER: Optional[str] = None
89
- DEBUG_LEVEL: int = 0 # 0 to 2, 2 means log everything
 
 
90
 
91
  @computed_field
92
  @property
 
4
  from pydantic import computed_field
5
  from pydantic_settings import BaseSettings
6
  import torch
7
+ import os
8
 
9
 
10
  class Settings(BaseSettings):
 
13
  IMAGE_DPI: int = 96 # DPI to render images pulled from pdf at
14
  EXTRACT_IMAGES: bool = True # Extract images from pdfs and save them
15
  PAGINATE_OUTPUT: bool = False # Paginate output markdown
16
+ BASE_DIR: str = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
17
 
18
  @computed_field
19
  @property
 
86
  HEADING_DEFAULT_LEVEL: int = 2
87
 
88
  # Debug
89
+ DEBUG_DATA_FOLDER: str = os.path.join(BASE_DIR, "debug")
90
+ DEBUG: bool = False
91
+ FONT_DIR: str = os.path.join(BASE_DIR, "static", "fonts")
92
+ DEBUG_RENDER_FONT: str = os.path.join(FONT_DIR, "GoNotoCurrent-Regular.ttf")
93
+ FONT_DL_BASE: str = "https://github.com/satbyy/go-noto-universal/releases/download/v7.0"
94
 
95
  @computed_field
96
  @property
marker/tables/table.py CHANGED
@@ -62,7 +62,12 @@ def get_table_boxes(pages: List[Page], doc: PdfDocument, fname):
62
  out_img_sizes = []
63
  for i in range(len(table_counts)):
64
  if i in table_idxs:
65
- text_lines.extend([sel_text_lines.pop(0)] * table_counts[i])
 
 
 
 
 
66
  out_img_sizes.extend([img_sizes[i]] * table_counts[i])
67
 
68
  assert len(table_imgs) == len(table_bboxes) == len(text_lines) == len(out_img_sizes)
 
62
  out_img_sizes = []
63
  for i in range(len(table_counts)):
64
  if i in table_idxs:
65
+ page_ocred = pages[i].ocr_method is not None
66
+ if page_ocred:
67
+ # This will force re-detection of cells if the page was ocred (the text lines are not accurate)
68
+ text_lines.extend([None] * table_counts[i])
69
+ else:
70
+ text_lines.extend([sel_text_lines.pop(0)] * table_counts[i])
71
  out_img_sizes.extend([img_sizes[i]] * table_counts[i])
72
 
73
  assert len(table_imgs) == len(table_bboxes) == len(text_lines) == len(out_img_sizes)
static/fonts/.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ *
2
+ !.gitignore