Vik Paruchuri
commited on
Commit
·
3c6746a
1
Parent(s):
9be8438
Better debugging, heading detection
Browse files- .gitignore +1 -0
- README.md +9 -5
- convert_single.py +5 -5
- marker/cleaners/headings.py +6 -7
- marker/convert.py +3 -2
- marker/debug/data.py +34 -46
- marker/equations/equations.py +0 -4
- marker/postprocessors/markdown.py +5 -5
- marker/settings.py +7 -3
- marker/tables/table.py +6 -1
- static/fonts/.gitignore +2 -0
.gitignore
CHANGED
|
@@ -8,6 +8,7 @@ wandb
|
|
| 8 |
*.dat
|
| 9 |
report.json
|
| 10 |
benchmark_data
|
|
|
|
| 11 |
|
| 12 |
# Byte-compiled / optimized / DLL files
|
| 13 |
__pycache__/
|
|
|
|
| 8 |
*.dat
|
| 9 |
report.json
|
| 10 |
benchmark_data
|
| 11 |
+
debug
|
| 12 |
|
| 13 |
# Byte-compiled / optimized / DLL files
|
| 14 |
__pycache__/
|
README.md
CHANGED
|
@@ -89,6 +89,7 @@ First, some configuration:
|
|
| 89 |
- Inspect the settings in `marker/settings.py`. You can override any settings with environment variables.
|
| 90 |
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
|
| 91 |
- By default, marker will use `surya` for OCR. Surya is slower on CPU, but more accurate than tesseract. It also doesn't require you to specify the languages in the document. If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above). If you don't want OCR at all, set `OCR_ENGINE` to `None`.
|
|
|
|
| 92 |
|
| 93 |
## Interactive App
|
| 94 |
|
|
@@ -107,15 +108,15 @@ marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --ma
|
|
| 107 |
|
| 108 |
- `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
|
| 109 |
- `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
|
|
|
|
| 110 |
- `--langs` is an optional comma separated list of the languages in the document, for OCR. Optional by default, required if you use tesseract.
|
| 111 |
-
- `--ocr_all_pages` is an optional argument to force OCR on all pages of the PDF. If this or the env var `OCR_ALL_PAGES` are true, OCR will be forced.
|
| 112 |
|
| 113 |
The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py). If you need more languages, you can use any language supported by [Tesseract](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016) if you set `OCR_ENGINE` to `ocrmypdf`. If you don't need OCR, marker can work with any language.
|
| 114 |
|
| 115 |
## Convert multiple files
|
| 116 |
|
| 117 |
```shell
|
| 118 |
-
marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10
|
| 119 |
```
|
| 120 |
|
| 121 |
- `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.
|
|
@@ -136,7 +137,7 @@ You can use language names or codes. The exact codes depend on the OCR engine.
|
|
| 136 |
## Convert multiple files on multiple GPUs
|
| 137 |
|
| 138 |
```shell
|
| 139 |
-
|
| 140 |
```
|
| 141 |
|
| 142 |
- `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
|
|
@@ -150,15 +151,18 @@ Note that the env variables above are specific to this script, and cannot be set
|
|
| 150 |
|
| 151 |
There are some settings that you may find useful if things aren't working the way you expect:
|
| 152 |
|
| 153 |
-
- `OCR_ALL_PAGES` - set this to true to force OCR all pages. This can be very useful if
|
| 154 |
- `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
|
| 155 |
- `OCR_ENGINE` - can set this to `surya` or `ocrmypdf`.
|
| 156 |
-
- `DEBUG` - setting this to `True` shows ray logs when converting multiple pdfs
|
| 157 |
- Verify that you set the languages correctly, or passed in a metadata file.
|
| 158 |
- If you're getting out of memory errors, decrease worker count (increased the `VRAM_PER_TASK` setting). You can also try splitting up long PDFs into multiple files.
|
| 159 |
|
| 160 |
In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.
|
| 161 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 162 |
## Useful settings
|
| 163 |
|
| 164 |
These settings can improve/change output quality:
|
|
|
|
| 89 |
- Inspect the settings in `marker/settings.py`. You can override any settings with environment variables.
|
| 90 |
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
|
| 91 |
- By default, marker will use `surya` for OCR. Surya is slower on CPU, but more accurate than tesseract. It also doesn't require you to specify the languages in the document. If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above). If you don't want OCR at all, set `OCR_ENGINE` to `None`.
|
| 92 |
+
- Some PDFs, even digital ones, have bad text in them. Set `OCR_ALL_PAGES=true` to force OCR if you find bad output from marker.
|
| 93 |
|
| 94 |
## Interactive App
|
| 95 |
|
|
|
|
| 108 |
|
| 109 |
- `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
|
| 110 |
- `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
|
| 111 |
+
- `--start_page` is the page to start from (default is None, will start from the first page).
|
| 112 |
- `--langs` is an optional comma separated list of the languages in the document, for OCR. Optional by default, required if you use tesseract.
|
|
|
|
| 113 |
|
| 114 |
The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py). If you need more languages, you can use any language supported by [Tesseract](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016) if you set `OCR_ENGINE` to `ocrmypdf`. If you don't need OCR, marker can work with any language.
|
| 115 |
|
| 116 |
## Convert multiple files
|
| 117 |
|
| 118 |
```shell
|
| 119 |
+
marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10
|
| 120 |
```
|
| 121 |
|
| 122 |
- `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.
|
|
|
|
| 137 |
## Convert multiple files on multiple GPUs
|
| 138 |
|
| 139 |
```shell
|
| 140 |
+
METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
|
| 141 |
```
|
| 142 |
|
| 143 |
- `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
|
|
|
|
| 151 |
|
| 152 |
There are some settings that you may find useful if things aren't working the way you expect:
|
| 153 |
|
| 154 |
+
- `OCR_ALL_PAGES` - set this to true to force OCR all pages. This can be very useful if there is garbled text in the output of marker.
|
| 155 |
- `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
|
| 156 |
- `OCR_ENGINE` - can set this to `surya` or `ocrmypdf`.
|
|
|
|
| 157 |
- Verify that you set the languages correctly, or passed in a metadata file.
|
| 158 |
- If you're getting out of memory errors, decrease worker count (increased the `VRAM_PER_TASK` setting). You can also try splitting up long PDFs into multiple files.
|
| 159 |
|
| 160 |
In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.
|
| 161 |
|
| 162 |
+
## Debugging
|
| 163 |
+
|
| 164 |
+
Set `DEBUG=true` to save data to the `debug` subfolder in the marker root directory. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
|
| 165 |
+
|
| 166 |
## Useful settings
|
| 167 |
|
| 168 |
These settings can improve/change output quality:
|
convert_single.py
CHANGED
|
@@ -2,6 +2,9 @@ import time
|
|
| 2 |
|
| 3 |
import pypdfium2 # Needs to be at the top to avoid warnings
|
| 4 |
import os
|
|
|
|
|
|
|
|
|
|
| 5 |
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" # For some reason, transformers decided to use .isin for a simple op, which is not supported on MPS
|
| 6 |
|
| 7 |
import argparse
|
|
@@ -22,8 +25,6 @@ def main():
|
|
| 22 |
parser.add_argument("--start_page", type=int, default=None, help="Page to start processing at")
|
| 23 |
parser.add_argument("--langs", type=str, help="Optional languages to use for OCR, comma separated", default=None)
|
| 24 |
parser.add_argument("--batch_multiplier", type=int, default=2, help="How much to increase batch sizes")
|
| 25 |
-
parser.add_argument("--debug", action="store_true", help="Enable debug logging", default=False)
|
| 26 |
-
parser.add_argument("--ocr_all_pages", action="store_true", help="Force OCR on all pages", default=False)
|
| 27 |
args = parser.parse_args()
|
| 28 |
|
| 29 |
langs = args.langs.split(",") if args.langs else None
|
|
@@ -31,14 +32,13 @@ def main():
|
|
| 31 |
fname = args.filename
|
| 32 |
model_lst = load_all_models()
|
| 33 |
start = time.time()
|
| 34 |
-
full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page
|
| 35 |
|
| 36 |
fname = os.path.basename(fname)
|
| 37 |
subfolder_path = save_markdown(args.output, fname, full_text, images, out_meta)
|
| 38 |
|
| 39 |
print(f"Saved markdown to the {subfolder_path} folder")
|
| 40 |
-
|
| 41 |
-
print(f"Total time: {time.time() - start}")
|
| 42 |
|
| 43 |
|
| 44 |
if __name__ == "__main__":
|
|
|
|
| 2 |
|
| 3 |
import pypdfium2 # Needs to be at the top to avoid warnings
|
| 4 |
import os
|
| 5 |
+
|
| 6 |
+
from marker.settings import settings
|
| 7 |
+
|
| 8 |
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" # For some reason, transformers decided to use .isin for a simple op, which is not supported on MPS
|
| 9 |
|
| 10 |
import argparse
|
|
|
|
| 25 |
parser.add_argument("--start_page", type=int, default=None, help="Page to start processing at")
|
| 26 |
parser.add_argument("--langs", type=str, help="Optional languages to use for OCR, comma separated", default=None)
|
| 27 |
parser.add_argument("--batch_multiplier", type=int, default=2, help="How much to increase batch sizes")
|
|
|
|
|
|
|
| 28 |
args = parser.parse_args()
|
| 29 |
|
| 30 |
langs = args.langs.split(",") if args.langs else None
|
|
|
|
| 32 |
fname = args.filename
|
| 33 |
model_lst = load_all_models()
|
| 34 |
start = time.time()
|
| 35 |
+
full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page)
|
| 36 |
|
| 37 |
fname = os.path.basename(fname)
|
| 38 |
subfolder_path = save_markdown(args.output, fname, full_text, images, out_meta)
|
| 39 |
|
| 40 |
print(f"Saved markdown to the {subfolder_path} folder")
|
| 41 |
+
print(f"Total time: {time.time() - start}")
|
|
|
|
| 42 |
|
| 43 |
|
| 44 |
if __name__ == "__main__":
|
marker/cleaners/headings.py
CHANGED
|
@@ -68,7 +68,7 @@ def bucket_headings(line_heights, num_levels=settings.HEADING_LEVEL_COUNT):
|
|
| 68 |
data_labels = np.concatenate([data, labels.reshape(-1, 1)], axis=1)
|
| 69 |
data_labels = np.sort(data_labels, axis=0)
|
| 70 |
|
| 71 |
-
cluster_means = {label: np.mean(
|
| 72 |
label_max = None
|
| 73 |
label_min = None
|
| 74 |
heading_ranges = []
|
|
@@ -95,15 +95,14 @@ def bucket_headings(line_heights, num_levels=settings.HEADING_LEVEL_COUNT):
|
|
| 95 |
return heading_ranges
|
| 96 |
|
| 97 |
|
| 98 |
-
def infer_heading_levels(pages: List[Page]):
|
| 99 |
all_line_heights = []
|
| 100 |
for page in pages:
|
| 101 |
for block in page.blocks:
|
| 102 |
if block.block_type not in ["Title", "Section-header"]:
|
| 103 |
continue
|
| 104 |
|
| 105 |
-
|
| 106 |
-
all_line_heights.extend(block_heights)
|
| 107 |
|
| 108 |
heading_ranges = bucket_headings(all_line_heights)
|
| 109 |
|
|
@@ -112,11 +111,11 @@ def infer_heading_levels(pages: List[Page]):
|
|
| 112 |
if block.block_type not in ["Title", "Section-header"]:
|
| 113 |
continue
|
| 114 |
|
| 115 |
-
block_heights = [
|
| 116 |
avg_height = sum(block_heights) / len(block_heights)
|
| 117 |
for idx, (min_height, max_height) in enumerate(heading_ranges):
|
| 118 |
-
if avg_height >= min_height:
|
| 119 |
-
block.heading_level =
|
| 120 |
break
|
| 121 |
|
| 122 |
if block.heading_level is None:
|
|
|
|
| 68 |
data_labels = np.concatenate([data, labels.reshape(-1, 1)], axis=1)
|
| 69 |
data_labels = np.sort(data_labels, axis=0)
|
| 70 |
|
| 71 |
+
cluster_means = {label: np.mean(data_labels[data_labels[:, 1] == label, 0]) for label in np.unique(labels)}
|
| 72 |
label_max = None
|
| 73 |
label_min = None
|
| 74 |
heading_ranges = []
|
|
|
|
| 95 |
return heading_ranges
|
| 96 |
|
| 97 |
|
| 98 |
+
def infer_heading_levels(pages: List[Page], height_tol=.99):
|
| 99 |
all_line_heights = []
|
| 100 |
for page in pages:
|
| 101 |
for block in page.blocks:
|
| 102 |
if block.block_type not in ["Title", "Section-header"]:
|
| 103 |
continue
|
| 104 |
|
| 105 |
+
all_line_heights.extend([l.height for l in block.lines])
|
|
|
|
| 106 |
|
| 107 |
heading_ranges = bucket_headings(all_line_heights)
|
| 108 |
|
|
|
|
| 111 |
if block.block_type not in ["Title", "Section-header"]:
|
| 112 |
continue
|
| 113 |
|
| 114 |
+
block_heights = [l.height for l in block.lines] # Account for rotation
|
| 115 |
avg_height = sum(block_heights) / len(block_heights)
|
| 116 |
for idx, (min_height, max_height) in enumerate(heading_ranges):
|
| 117 |
+
if avg_height >= min_height * height_tol:
|
| 118 |
+
block.heading_level = idx + 1
|
| 119 |
break
|
| 120 |
|
| 121 |
if block.heading_level is None:
|
marker/convert.py
CHANGED
|
@@ -10,7 +10,7 @@ from PIL import Image
|
|
| 10 |
|
| 11 |
from marker.utils import flush_cuda_memory
|
| 12 |
from marker.tables.table import format_tables
|
| 13 |
-
from marker.debug.data import dump_bbox_debug_data
|
| 14 |
from marker.layout.layout import surya_layout, annotate_block_types
|
| 15 |
from marker.layout.order import surya_order, sort_blocks_in_reading_order
|
| 16 |
from marker.ocr.lang import replace_langs_with_codes, validate_langs
|
|
@@ -108,7 +108,8 @@ def convert_single_pdf(
|
|
| 108 |
annotate_block_types(pages)
|
| 109 |
|
| 110 |
# Dump debug data if flags are set
|
| 111 |
-
|
|
|
|
| 112 |
|
| 113 |
# Find reading order for blocks
|
| 114 |
# Sort blocks by reading order
|
|
|
|
| 10 |
|
| 11 |
from marker.utils import flush_cuda_memory
|
| 12 |
from marker.tables.table import format_tables
|
| 13 |
+
from marker.debug.data import dump_bbox_debug_data, draw_page_debug_images
|
| 14 |
from marker.layout.layout import surya_layout, annotate_block_types
|
| 15 |
from marker.layout.order import surya_order, sort_blocks_in_reading_order
|
| 16 |
from marker.ocr.lang import replace_langs_with_codes, validate_langs
|
|
|
|
| 108 |
annotate_block_types(pages)
|
| 109 |
|
| 110 |
# Dump debug data if flags are set
|
| 111 |
+
draw_page_debug_images(fname, pages)
|
| 112 |
+
dump_bbox_debug_data(fname, pages)
|
| 113 |
|
| 114 |
# Find reading order for blocks
|
| 115 |
# Sort blocks by reading order
|
marker/debug/data.py
CHANGED
|
@@ -1,74 +1,62 @@
|
|
| 1 |
-
import base64
|
| 2 |
import json
|
|
|
|
| 3 |
import os
|
| 4 |
from typing import List
|
| 5 |
|
| 6 |
-
from marker.
|
|
|
|
| 7 |
from marker.schema.page import Page
|
| 8 |
from marker.settings import settings
|
| 9 |
from PIL import Image
|
| 10 |
-
import io
|
| 11 |
|
| 12 |
|
| 13 |
-
def
|
| 14 |
-
if not settings.
|
| 15 |
return
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
if converted_span is None:
|
| 26 |
-
continue
|
| 27 |
-
# Image is a BytesIO object
|
| 28 |
-
img_bytes = io.BytesIO()
|
| 29 |
-
pil_image.save(img_bytes, format="WEBP", lossless=True)
|
| 30 |
-
b64_image = base64.b64encode(img_bytes.getvalue()).decode("utf-8")
|
| 31 |
-
data_lines.append({
|
| 32 |
-
"image": b64_image,
|
| 33 |
-
"text": converted_span.text,
|
| 34 |
-
"bbox": converted_span.bbox
|
| 35 |
-
})
|
| 36 |
|
| 37 |
-
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
return
|
| 48 |
|
| 49 |
# Remove extension from doc name
|
| 50 |
-
doc_base = fname.rsplit(".", 1)[0]
|
| 51 |
|
| 52 |
debug_file = os.path.join(settings.DEBUG_DATA_FOLDER, f"{doc_base}_bbox.json")
|
| 53 |
debug_data = []
|
| 54 |
-
for idx, page_blocks in enumerate(
|
| 55 |
-
page = doc[idx]
|
| 56 |
-
|
| 57 |
-
png_image = render_image(page, dpi=settings.TEXIFY_DPI)
|
| 58 |
-
width, height = png_image.size
|
| 59 |
-
max_dimension = 6000
|
| 60 |
-
if width > max_dimension or height > max_dimension:
|
| 61 |
-
scaling_factor = min(max_dimension / width, max_dimension / height)
|
| 62 |
-
png_image = png_image.resize((int(width * scaling_factor), int(height * scaling_factor)), Image.ANTIALIAS)
|
| 63 |
-
|
| 64 |
-
img_bytes = io.BytesIO()
|
| 65 |
-
png_image.save(img_bytes, format="WEBP", lossless=True, quality=100)
|
| 66 |
-
b64_image = base64.b64encode(img_bytes.getvalue()).decode("utf-8")
|
| 67 |
-
|
| 68 |
page_data = page_blocks.model_dump(exclude=["images", "layout", "text_lines"])
|
| 69 |
page_data["layout"] = page_blocks.layout.model_dump(exclude=["segmentation_map"])
|
| 70 |
page_data["text_lines"] = page_blocks.text_lines.model_dump(exclude=["heatmap", "affinity_map"])
|
| 71 |
-
#page_data["image"] = b64_image
|
| 72 |
debug_data.append(page_data)
|
| 73 |
|
| 74 |
with open(debug_file, "w+") as f:
|
|
|
|
|
|
|
| 1 |
import json
|
| 2 |
+
import math
|
| 3 |
import os
|
| 4 |
from typing import List
|
| 5 |
|
| 6 |
+
from marker.debug.render import render_on_image
|
| 7 |
+
from marker.schema.bbox import rescale_bbox
|
| 8 |
from marker.schema.page import Page
|
| 9 |
from marker.settings import settings
|
| 10 |
from PIL import Image
|
|
|
|
| 11 |
|
| 12 |
|
| 13 |
+
def draw_page_debug_images(fname, pages: List[Page]):
|
| 14 |
+
if not settings.DEBUG:
|
| 15 |
return
|
| 16 |
|
| 17 |
+
# Remove extension from doc name
|
| 18 |
+
doc_base = os.path.basename(fname).rsplit(".", 1)[0]
|
| 19 |
|
| 20 |
+
debug_folder = os.path.join(settings.DEBUG_DATA_FOLDER, doc_base)
|
| 21 |
+
os.makedirs(debug_folder, exist_ok=True)
|
| 22 |
+
for idx, page in enumerate(pages):
|
| 23 |
+
img_size = (int(math.ceil(page.text_lines.image_bbox[2])), int(math.ceil(page.text_lines.image_bbox[3])))
|
| 24 |
+
png_image = Image.new("RGB", img_size, color="white")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
line_bboxes = []
|
| 27 |
+
line_text = []
|
| 28 |
+
for block in page.blocks:
|
| 29 |
+
for line in block.lines:
|
| 30 |
+
line_bboxes.append(rescale_bbox(page.bbox, page.text_lines.image_bbox, line.bbox))
|
| 31 |
+
line_text.append(line.prelim_text)
|
| 32 |
|
| 33 |
+
render_on_image(line_bboxes, png_image, labels=line_text, color="black", draw_bbox=False)
|
| 34 |
+
|
| 35 |
+
line_bboxes = [line.bbox for line in page.text_lines.bboxes]
|
| 36 |
+
render_on_image(line_bboxes, png_image, color="blue")
|
| 37 |
+
|
| 38 |
+
layout_boxes = [rescale_bbox(page.layout.image_bbox, page.text_lines.image_bbox, box.bbox) for box in page.layout.bboxes]
|
| 39 |
+
layout_labels = [box.label for box in page.layout.bboxes]
|
| 40 |
|
| 41 |
+
render_on_image(layout_boxes, png_image, labels=layout_labels, color="red")
|
| 42 |
|
| 43 |
+
debug_file = os.path.join(debug_folder, f"page_{idx}.png")
|
| 44 |
+
png_image.save(debug_file)
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def dump_bbox_debug_data(fname, pages: List[Page]):
|
| 48 |
+
if not settings.DEBUG:
|
| 49 |
return
|
| 50 |
|
| 51 |
# Remove extension from doc name
|
| 52 |
+
doc_base = os.path.basename(fname).rsplit(".", 1)[0]
|
| 53 |
|
| 54 |
debug_file = os.path.join(settings.DEBUG_DATA_FOLDER, f"{doc_base}_bbox.json")
|
| 55 |
debug_data = []
|
| 56 |
+
for idx, page_blocks in enumerate(pages):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
page_data = page_blocks.model_dump(exclude=["images", "layout", "text_lines"])
|
| 58 |
page_data["layout"] = page_blocks.layout.model_dump(exclude=["segmentation_map"])
|
| 59 |
page_data["text_lines"] = page_blocks.text_lines.model_dump(exclude=["heatmap", "affinity_map"])
|
|
|
|
| 60 |
debug_data.append(page_data)
|
| 61 |
|
| 62 |
with open(debug_file, "w+") as f:
|
marker/equations/equations.py
CHANGED
|
@@ -2,7 +2,6 @@ from collections import defaultdict
|
|
| 2 |
from copy import deepcopy
|
| 3 |
from typing import List
|
| 4 |
|
| 5 |
-
from marker.debug.data import dump_equation_debug_data
|
| 6 |
from marker.equations.inference import get_total_texify_tokens, get_latex_batched
|
| 7 |
from marker.pdf.images import render_bbox_image
|
| 8 |
from marker.schema.bbox import rescale_bbox
|
|
@@ -177,7 +176,4 @@ def replace_equations(doc, pages: List[Page], texify_model, batch_multiplier=1):
|
|
| 177 |
successful_ocr += success_count
|
| 178 |
unsuccessful_ocr += fail_count
|
| 179 |
|
| 180 |
-
# If debug mode is on, dump out conversions for comparison
|
| 181 |
-
dump_equation_debug_data(doc, images, converted_spans)
|
| 182 |
-
|
| 183 |
return pages, {"successful_ocr": successful_ocr, "unsuccessful_ocr": unsuccessful_ocr, "equations": eq_count}
|
|
|
|
| 2 |
from copy import deepcopy
|
| 3 |
from typing import List
|
| 4 |
|
|
|
|
| 5 |
from marker.equations.inference import get_total_texify_tokens, get_latex_batched
|
| 6 |
from marker.pdf.images import render_bbox_image
|
| 7 |
from marker.schema.bbox import rescale_bbox
|
|
|
|
| 176 |
successful_ocr += success_count
|
| 177 |
unsuccessful_ocr += fail_count
|
| 178 |
|
|
|
|
|
|
|
|
|
|
| 179 |
return pages, {"successful_ocr": successful_ocr, "unsuccessful_ocr": unsuccessful_ocr, "equations": eq_count}
|
marker/postprocessors/markdown.py
CHANGED
|
@@ -146,22 +146,22 @@ def merge_lines(blocks: List[List[MergedBlock]]):
|
|
| 146 |
prev_line = None
|
| 147 |
block_text = ""
|
| 148 |
block_type = ""
|
| 149 |
-
|
| 150 |
|
| 151 |
for idx, page in enumerate(blocks):
|
| 152 |
for block in page:
|
| 153 |
block_type = block.block_type
|
| 154 |
-
if block_type != prev_type and prev_type:
|
| 155 |
text_blocks.append(
|
| 156 |
FullyMergedBlock(
|
| 157 |
-
text=block_surround(block_text, prev_type,
|
| 158 |
block_type=prev_type
|
| 159 |
)
|
| 160 |
)
|
| 161 |
block_text = ""
|
| 162 |
|
| 163 |
prev_type = block_type
|
| 164 |
-
|
| 165 |
# Join lines in the block together properly
|
| 166 |
for i, line in enumerate(block.lines):
|
| 167 |
line_height = line.bbox[3] - line.bbox[1]
|
|
@@ -180,7 +180,7 @@ def merge_lines(blocks: List[List[MergedBlock]]):
|
|
| 180 |
# Append the final block
|
| 181 |
text_blocks.append(
|
| 182 |
FullyMergedBlock(
|
| 183 |
-
text=block_surround(block_text, prev_type,
|
| 184 |
block_type=block_type
|
| 185 |
)
|
| 186 |
)
|
|
|
|
| 146 |
prev_line = None
|
| 147 |
block_text = ""
|
| 148 |
block_type = ""
|
| 149 |
+
prev_heading_level = None
|
| 150 |
|
| 151 |
for idx, page in enumerate(blocks):
|
| 152 |
for block in page:
|
| 153 |
block_type = block.block_type
|
| 154 |
+
if (block_type != prev_type and prev_type) or (block.heading_level != prev_heading_level and prev_heading_level):
|
| 155 |
text_blocks.append(
|
| 156 |
FullyMergedBlock(
|
| 157 |
+
text=block_surround(block_text, prev_type, prev_heading_level),
|
| 158 |
block_type=prev_type
|
| 159 |
)
|
| 160 |
)
|
| 161 |
block_text = ""
|
| 162 |
|
| 163 |
prev_type = block_type
|
| 164 |
+
prev_heading_level = block.heading_level
|
| 165 |
# Join lines in the block together properly
|
| 166 |
for i, line in enumerate(block.lines):
|
| 167 |
line_height = line.bbox[3] - line.bbox[1]
|
|
|
|
| 180 |
# Append the final block
|
| 181 |
text_blocks.append(
|
| 182 |
FullyMergedBlock(
|
| 183 |
+
text=block_surround(block_text, prev_type, prev_heading_level),
|
| 184 |
block_type=block_type
|
| 185 |
)
|
| 186 |
)
|
marker/settings.py
CHANGED
|
@@ -4,6 +4,7 @@ from dotenv import find_dotenv
|
|
| 4 |
from pydantic import computed_field
|
| 5 |
from pydantic_settings import BaseSettings
|
| 6 |
import torch
|
|
|
|
| 7 |
|
| 8 |
|
| 9 |
class Settings(BaseSettings):
|
|
@@ -12,6 +13,7 @@ class Settings(BaseSettings):
|
|
| 12 |
IMAGE_DPI: int = 96 # DPI to render images pulled from pdf at
|
| 13 |
EXTRACT_IMAGES: bool = True # Extract images from pdfs and save them
|
| 14 |
PAGINATE_OUTPUT: bool = False # Paginate output markdown
|
|
|
|
| 15 |
|
| 16 |
@computed_field
|
| 17 |
@property
|
|
@@ -84,9 +86,11 @@ class Settings(BaseSettings):
|
|
| 84 |
HEADING_DEFAULT_LEVEL: int = 2
|
| 85 |
|
| 86 |
# Debug
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
|
|
|
|
|
|
| 90 |
|
| 91 |
@computed_field
|
| 92 |
@property
|
|
|
|
| 4 |
from pydantic import computed_field
|
| 5 |
from pydantic_settings import BaseSettings
|
| 6 |
import torch
|
| 7 |
+
import os
|
| 8 |
|
| 9 |
|
| 10 |
class Settings(BaseSettings):
|
|
|
|
| 13 |
IMAGE_DPI: int = 96 # DPI to render images pulled from pdf at
|
| 14 |
EXTRACT_IMAGES: bool = True # Extract images from pdfs and save them
|
| 15 |
PAGINATE_OUTPUT: bool = False # Paginate output markdown
|
| 16 |
+
BASE_DIR: str = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
| 17 |
|
| 18 |
@computed_field
|
| 19 |
@property
|
|
|
|
| 86 |
HEADING_DEFAULT_LEVEL: int = 2
|
| 87 |
|
| 88 |
# Debug
|
| 89 |
+
DEBUG_DATA_FOLDER: str = os.path.join(BASE_DIR, "debug")
|
| 90 |
+
DEBUG: bool = False
|
| 91 |
+
FONT_DIR: str = os.path.join(BASE_DIR, "static", "fonts")
|
| 92 |
+
DEBUG_RENDER_FONT: str = os.path.join(FONT_DIR, "GoNotoCurrent-Regular.ttf")
|
| 93 |
+
FONT_DL_BASE: str = "https://github.com/satbyy/go-noto-universal/releases/download/v7.0"
|
| 94 |
|
| 95 |
@computed_field
|
| 96 |
@property
|
marker/tables/table.py
CHANGED
|
@@ -62,7 +62,12 @@ def get_table_boxes(pages: List[Page], doc: PdfDocument, fname):
|
|
| 62 |
out_img_sizes = []
|
| 63 |
for i in range(len(table_counts)):
|
| 64 |
if i in table_idxs:
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
out_img_sizes.extend([img_sizes[i]] * table_counts[i])
|
| 67 |
|
| 68 |
assert len(table_imgs) == len(table_bboxes) == len(text_lines) == len(out_img_sizes)
|
|
|
|
| 62 |
out_img_sizes = []
|
| 63 |
for i in range(len(table_counts)):
|
| 64 |
if i in table_idxs:
|
| 65 |
+
page_ocred = pages[i].ocr_method is not None
|
| 66 |
+
if page_ocred:
|
| 67 |
+
# This will force re-detection of cells if the page was ocred (the text lines are not accurate)
|
| 68 |
+
text_lines.extend([None] * table_counts[i])
|
| 69 |
+
else:
|
| 70 |
+
text_lines.extend([sel_text_lines.pop(0)] * table_counts[i])
|
| 71 |
out_img_sizes.extend([img_sizes[i]] * table_counts[i])
|
| 72 |
|
| 73 |
assert len(table_imgs) == len(table_bboxes) == len(text_lines) == len(out_img_sizes)
|
static/fonts/.gitignore
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*
|
| 2 |
+
!.gitignore
|