Spaces:

derful
/

MinerU

Runtime error

App Files Files Community

derful commited on Jul 31, 2024

Commit

240e0a0

verified ·

1 Parent(s): b584747

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +1 -0
.github/ISSUE_TEMPLATE/bug_report.yml +85 -0
.github/ISSUE_TEMPLATE/feature_request.md +28 -0
.github/workflows/cla.yml +43 -0
.github/workflows/cli.yml +46 -0
.github/workflows/python-package.yml +126 -0
.github/workflows/rerun.yml +23 -0
.github/workflows/update_base.yml +22 -0
.gitignore +37 -0
LICENSE.md +661 -0
README.md +286 -6
README_zh-CN.md +277 -0
demo/app.py +67 -0
demo/demo.py +31 -0
demo/demo1.json +0 -0
demo/demo1.pdf +0 -0
demo/demo2.json +0 -0
demo/demo2.pdf +3 -0
docs/FAQ_zh_cn.md +85 -0
docs/how_to_download_models_en.md +60 -0
docs/how_to_download_models_zh_cn.md +61 -0
docs/images/flowchart_en.png +0 -0
docs/images/flowchart_zh_cn.png +0 -0
docs/images/project_panorama_en.png +0 -0
docs/images/project_panorama_zh_cn.png +0 -0
magic-pdf.template.json +9 -0
magic_pdf/__init__.py +0 -0
magic_pdf/cli/__init__.py +0 -0
magic_pdf/cli/magicpdf.py +359 -0
magic_pdf/dict2md/__init__.py +0 -0
magic_pdf/dict2md/mkcontent.py +397 -0
magic_pdf/dict2md/ocr_mkcontent.py +363 -0
magic_pdf/filter/__init__.py +0 -0
magic_pdf/filter/pdf_classify_by_type.py +393 -0
magic_pdf/filter/pdf_meta_scan.py +388 -0
magic_pdf/layout/__init__.py +0 -0
magic_pdf/layout/bbox_sort.py +681 -0
magic_pdf/layout/layout_det_utils.py +182 -0
magic_pdf/layout/layout_sort.py +732 -0
magic_pdf/layout/layout_spiler_recog.py +101 -0
magic_pdf/layout/mcol_sort.py +336 -0
magic_pdf/libs/Constants.py +11 -0
magic_pdf/libs/MakeContentConfig.py +10 -0
magic_pdf/libs/ModelBlockTypeEnum.py +9 -0
magic_pdf/libs/__init__.py +0 -0
magic_pdf/libs/boxbase.py +408 -0
magic_pdf/libs/calc_span_stats.py +239 -0
magic_pdf/libs/commons.py +204 -0
magic_pdf/libs/config_reader.py +73 -0
magic_pdf/libs/convert_utils.py +5 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+demo/demo2.pdf filter=lfs diff=lfs merge=lfs -text

.github/ISSUE_TEMPLATE/bug_report.yml ADDED Viewed

	@@ -0,0 +1,85 @@

+name: Bug Report | 反馈 Bug
+description: Create a bug report for MinerU | MinerU 的 Bug 反馈
+labels: bug
+# We omit `title: "..."` so that the field defaults to blank. If we set it to
+# empty string, Github seems to reject this .yml file.
+body:
+  - type: textarea
+    id: description
+    attributes:
+      label: Description of the bug | 错误描述
+      description: |
+        A clear and concise description of the bug. | 简单描述遇到的问题
+    validations:
+      required: true
+  - type: textarea
+    id: reproduce
+    attributes:
+      label: How to reproduce the bug | 如何复现
+      # Should not word-wrap this description here.
+      description: |
+        * Explain the steps required to reproduce the bug. | 说明复现此错误所需的步骤。
+        * Include required code snippets, example files, etc. | 包含必要的代码片段、示例文件等。
+        * Describe what you expected to happen (if not obvious). | 描述你期望发生的情况。
+        * If applicable, add screenshots to help explain the problem. | 添加截图以帮助解释问题。
+        * Include any other information that could be relevant, for example information about the Python environment. | 包括任何其他可能相关的信息。
+        For problems when building or installing MinerU: | 在构建或安装 MinerU 时遇到的问题:
+        * Give the **exact** build/install commands that were run. | 提供**确切**的构建/安装命令。
+        * Give the **complete** output from these commands. | 提供这些命令的**完整**输出。
+    validations:
+      required: true
+#  - type: markdown
+#    attributes:
+#      value: |
+#        # The information below is required.
+  - type: dropdown
+    id: os_name
+    attributes:
+      label: Operating system | 操作系统
+      #multiple: true
+      options:
+        -
+        - Windows
+        - Linux
+        - MacOS
+    validations:
+      required: true
+  - type: dropdown
+    id: python_version
+    attributes:
+      label: Python version | Python 版本
+      #multiple: true
+      # Need quotes around `3.10` otherwise it is treated as a number and shows as `3.1`.
+      options:
+        -
+        - "3.12"
+        - "3.11"
+        - "3.10"
+        - "3.9"
+    validations:
+      required: true
+  - type: dropdown
+    id: device_mode
+    attributes:
+      label: Device mode | 设备模式
+      #multiple: true
+      options:
+        -
+        - cpu
+        - cuda
+        - mps
+    validations:
+      required: true

.github/ISSUE_TEMPLATE/feature_request.md ADDED Viewed

	@@ -0,0 +1,28 @@

+---
+name: Feature request | 功能需求
+about: Suggest an idea for this project | 提出一个有价值的idea
+title: ''
+labels: enhancement
+assignees: ''
+---
+**Is your feature request related to a problem? Please describe.**
+**您的特性请求是否与某个问题相关？请描述。**
+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+对存在的问题进行清晰且简洁的描述。例如：我一直很困扰的是 [...]
+**Describe the solution you'd like**
+**描述您期望的解决方案**
+A clear and concise description of what you want to happen.
+清晰且简洁地描述您希望实现的内容。
+**Describe alternatives you've considered**
+**描述您已考虑的替代方案**
+A clear and concise description of any alternative solutions or features you've considered.
+清晰且简洁地描述您已经考虑过的任何替代解决方案。
+**Additional context**
+**提供更多细节**
+Add any other context or screenshots about the feature request here.
+请附上任何相关截图、链接或文件，以帮助我们更好地理解您的请求。

.github/workflows/cla.yml ADDED Viewed

	@@ -0,0 +1,43 @@

+name: "CLA Assistant"
+on:
+  issue_comment:
+    types: [created]
+  pull_request_target:
+    types: [opened,closed,synchronize]
+# explicitly configure permissions, in case your GITHUB_TOKEN workflow permissions are set to read-only in repository settings
+permissions:
+  actions: write
+  contents: write # this can be 'read' if the signatures are in remote repository
+  pull-requests: write
+  statuses: write
+jobs:
+  CLAAssistant:
+    runs-on: ubuntu-latest
+    steps:
+      - name: "CLA Assistant"
+        if: (github.event.comment.body == 'recheck' || github.event.comment.body == 'I have read the CLA Document and I hereby sign the CLA') || github.event_name == 'pull_request_target'
+        uses: contributor-assistant/github-action@v2.4.0
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          # the below token should have repo scope and must be manually added by you in the repository's secret
+          # This token is required only if you have configured to store the signatures in a remote repository/organization
+          PERSONAL_ACCESS_TOKEN: ${{ secrets.RELEASE_TOKEN }}
+        with:
+          path-to-signatures: 'signatures/version1/cla.json'
+          path-to-document: 'https://github.com/cla-assistant/github-action/blob/master/SAPCLA.md' # e.g. a CLA or a DCO document
+          # branch should not be protected
+          branch: 'main'
+          allowlist: user1,bot*
+         # the followings are the optional inputs - If the optional inputs are not given, then default values will be taken
+          #remote-organization-name: enter the remote organization name where the signatures should be stored (Default is storing the signatures in the same repository)
+          #remote-repository-name: enter the  remote repository name where the signatures should be stored (Default is storing the signatures in the same repository)
+          #create-file-commit-message: 'For example: Creating file for storing CLA Signatures'
+          #signed-commit-message: 'For example: $contributorName has signed the CLA in $owner/$repo#$pullRequestNo'
+          #custom-notsigned-prcomment: 'pull request comment with Introductory message to ask new contributors to sign'
+          #custom-pr-sign-comment: 'The signature to be committed in order to sign the CLA'
+          #custom-allsigned-prcomment: 'pull request comment when all contributors has signed, defaults to **CLA Assistant Lite bot** All Contributors have signed the CLA.'
+          #lock-pullrequest-aftermerge: false - if you don't want this bot to automatically lock the pull request after merging (default - true)
+          #use-dco-flag: true - If you are using DCO instead of CLA

.github/workflows/cli.yml ADDED Viewed

	@@ -0,0 +1,46 @@

+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+name: mineru
+on:
+  push:
+    branches:
+      - "master"
+    paths-ignore:
+      - "cmds/**"
+      - "**.md"
+  pull_request:
+    branches:
+      - "master"
+    paths-ignore:
+      - "cmds/**"
+      - "**.md"
+  workflow_dispatch:
+jobs:
+  cli-test:
+    runs-on: ubuntu-latest
+    timeout-minutes: 40
+    strategy:
+      fail-fast: true
+    steps:
+    - name: PDF cli
+      uses: actions/checkout@v3
+      with:
+        fetch-depth: 2
+    - name: check-requirements
+      run: |
+        pip install -r requirements.txt
+        pip install -r requirements-qa.txt
+        pip install magic-pdf
+    - name: test_cli
+      run: |
+        cp magic-pdf.template.json ~/magic-pdf.json
+        echo $GITHUB_WORKSPACE
+        cd $GITHUB_WORKSPACE && export PYTHONPATH=. && pytest -s -v tests/test_unit.py
+        cd $GITHUB_WORKSPACE &&  pytest -s -v tests/test_cli/test_cli.py
+    - name: benchmark
+      run: |
+        cd $GITHUB_WORKSPACE &&  pytest -s -v tests/test_cli/test_bench.py

.github/workflows/python-package.yml ADDED Viewed

	@@ -0,0 +1,126 @@

+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+name: Python package
+on:
+  push:
+    tags:
+      - '*released'
+  workflow_dispatch:
+jobs:
+  update-version:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          ref: master
+          fetch-depth: 0
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+      - name: Update version.py
+        run: |
+          python update_version.py
+      - name: Verify version.py
+        run: |
+          ls -l magic_pdf/libs/version.py
+          cat magic_pdf/libs/version.py
+      - name: Commit changes
+        run: |
+          git config --local user.email "moe@myhloli.com"
+          git config --local user.name "myhloli"
+          git add magic_pdf/libs/version.py
+          if git diff-index --quiet HEAD; then
+            echo "No changes to commit"
+          else
+            git commit -m "Update version.py with new version"
+          fi
+        id: commit_changes
+      - name: Push changes
+        if: steps.commit_changes.outcome == 'success'
+        env:
+          GITHUB_TOKEN: ${{ secrets.RELEASE_TOKEN }}
+        run: |
+          git push origin HEAD:master
+  build:
+    needs: [ update-version ]
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.10"]
+    steps:
+    - name: Checkout code
+      uses: actions/checkout@v4
+      with:
+        ref: master
+        fetch-depth: 0
+    - name: Verify version.py
+      run: |
+        ls -l magic_pdf/libs/version.py
+        cat magic_pdf/libs/version.py
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v5
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+    - name: Install wheel
+      run: |
+        python -m pip install wheel
+    - name: Build wheel
+      run: |
+        python setup.py bdist_wheel
+    - name: Upload artifact
+      uses: actions/upload-artifact@v4
+      with:
+        name: wheel-file
+        path: dist/*.whl
+        retention-days: 30
+  release:
+    needs: [ build ]
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+      - name: Download artifact
+        uses: actions/download-artifact@v4
+        with:
+          name: wheel-file
+          path: dist
+      - name: Create and Upload Release
+        id: create_release
+        uses: softprops/action-gh-release@4634c16e79c963813287e889244c50009e7f0981
+        with:
+          files: './dist/*.whl'
+        env:
+          GITHUB_TOKEN: ${{ secrets.RELEASE_TOKEN }}
+      - name: Publish distribution to PyPI
+        run: |
+          pip install twine
+          twine upload dist/* -u __token__ -p ${{ secrets.PYPI_TOKEN }}

.github/workflows/rerun.yml ADDED Viewed

	@@ -0,0 +1,23 @@

+name: check-status
+on:
+  workflow_run:
+    workflows: [ci]
+    types: [completed]
+jobs:
+  on-failure:
+    runs-on: pdf
+    permissions:
+      actions: write
+    if: ${{ (github.event.workflow_run.head_branch == 'master') && github.event.workflow_run.conclusion == 'failure' && github.event.workflow_run.run_attempt < 3 }}
+    steps:
+      - run: |
+          echo 'The triggering workflow failed'
+          sleep 600
+          curl -L \
+          -X POST \
+          -H "Accept: application/vnd.github+json" \
+          -H "Authorization: Bearer ${{ github.token }}" \
+          -H "X-GitHub-Api-Version: 2022-11-28" \
+          https://api.github.com/repos/${{ github.repository }}/actions/runs/${{ github.event.workflow_run.id }}/rerun-failed-jobs

.github/workflows/update_base.yml ADDED Viewed

	@@ -0,0 +1,22 @@

+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+name: update-base
+on:
+  push:
+    tags:
+      - '*released'
+  workflow_dispatch:
+jobs:
+  pdf-test:
+    runs-on: pdf
+    timeout-minutes: 40
+    steps:
+    - name: update-base
+      uses: actions/checkout@v3
+    - name: start-update
+      run: |
+        echo "start test"

.gitignore ADDED Viewed

	@@ -0,0 +1,37 @@

+*.tar
+*.tar.gz
+venv*/
+envs/
+slurm_logs/
+sync1.sh
+data_preprocess_pj1
+data-preparation1
+__pycache__
+*.log
+*.pyc
+.vscode
+debug/
+*.ipynb
+.idea
+# vscode history
+.history
+.DS_Store
+.env
+bad_words/
+bak/
+app/tests/*
+temp/
+tmp/
+tmp
+.vscode
+.vscode/
+/tests/
+ocr_demo
+/app/common/__init__.py
+/magic_pdf/config/__init__.py

LICENSE.md ADDED Viewed

	@@ -0,0 +1,661 @@

+                    GNU AFFERO GENERAL PUBLIC LICENSE
+                       Version 3, 19 November 2007
+ Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+                            Preamble
+  The GNU Affero General Public License is a free, copyleft license for
+software and other kinds of works, specifically designed to ensure
+cooperation with the community in the case of network server software.
+  The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works.  By contrast,
+our General Public Licenses are intended to guarantee your freedom to
+share and change all versions of a program--to make sure it remains free
+software for all its users.
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+them if you wish), that you receive source code or can get it if you
+want it, that you can change the software or use pieces of it in new
+free programs, and that you know you can do these things.
+  Developers that use our General Public Licenses protect your rights
+with two steps: (1) assert copyright on the software, and (2) offer
+you this License which gives you legal permission to copy, distribute
+and/or modify the software.
+  A secondary benefit of defending all users' freedom is that
+improvements made in alternate versions of the program, if they
+receive widespread use, become available for other developers to
+incorporate.  Many developers of free software are heartened and
+encouraged by the resulting cooperation.  However, in the case of
+software used on network servers, this result may fail to come about.
+The GNU General Public License permits making a modified version and
+letting the public access it on a server without ever releasing its
+source code to the public.
+  The GNU Affero General Public License is designed specifically to
+ensure that, in such cases, the modified source code becomes available
+to the community.  It requires the operator of a network server to
+provide the source code of the modified version running there to the
+users of that server.  Therefore, public use of a modified version, on
+a publicly accessible server, gives the public access to the source
+code of the modified version.
+  An older license, called the Affero General Public License and
+published by Affero, was designed to accomplish similar goals.  This is
+a different license, not a version of the Affero GPL, but Affero has
+released a new version of the Affero GPL which permits relicensing under
+this license.
+  The precise terms and conditions for copying, distribution and
+modification follow.
+                       TERMS AND CONDITIONS
+  0. Definitions.
+  "This License" refers to version 3 of the GNU Affero General Public License.
+  "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+  "The Program" refers to any copyrightable work licensed under this
+License.  Each licensee is addressed as "you".  "Licensees" and
+"recipients" may be individuals or organizations.
+  To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy.  The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+  A "covered work" means either the unmodified Program or a work based
+on the Program.
+  To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy.  Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+  To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies.  Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+  An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License.  If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+  1. Source Code.
+  The "source code" for a work means the preferred form of the work
+for making modifications to it.  "Object code" means any non-source
+form of a work.
+  A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+  The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form.  A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+  The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities.  However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work.  For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+  The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+  The Corresponding Source for a work in source code form is that
+same work.
+  2. Basic Permissions.
+  All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met.  This License explicitly affirms your unlimited
+permission to run the unmodified Program.  The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work.  This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+  You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force.  You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright.  Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+  Conveying under any other circumstances is permitted solely under
+the conditions stated below.  Sublicensing is not allowed; section 10
+makes it unnecessary.
+  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+  No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+  When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+  4. Conveying Verbatim Copies.
+  You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+  You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+  5. Conveying Modified Source Versions.
+  You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+    a) The work must carry prominent notices stating that you modified
+    it, and giving a relevant date.
+    b) The work must carry prominent notices stating that it is
+    released under this License and any conditions added under section
+    7.  This requirement modifies the requirement in section 4 to
+    "keep intact all notices".
+    c) You must license the entire work, as a whole, under this
+    License to anyone who comes into possession of a copy.  This
+    License will therefore apply, along with any applicable section 7
+    additional terms, to the whole of the work, and all its parts,
+    regardless of how they are packaged.  This License gives no
+    permission to license the work in any other way, but it does not
+    invalidate such permission if you have separately received it.
+    d) If the work has interactive user interfaces, each must display
+    Appropriate Legal Notices; however, if the Program has interactive
+    interfaces that do not display Appropriate Legal Notices, your
+    work need not make them do so.
+  A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit.  Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+  6. Conveying Non-Source Forms.
+  You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+    a) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by the
+    Corresponding Source fixed on a durable physical medium
+    customarily used for software interchange.
+    b) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by a
+    written offer, valid for at least three years and valid for as
+    long as you offer spare parts or customer support for that product
+    model, to give anyone who possesses the object code either (1) a
+    copy of the Corresponding Source for all the software in the
+    product that is covered by this License, on a durable physical
+    medium customarily used for software interchange, for a price no
+    more than your reasonable cost of physically performing this
+    conveying of source, or (2) access to copy the
+    Corresponding Source from a network server at no charge.
+    c) Convey individual copies of the object code with a copy of the
+    written offer to provide the Corresponding Source.  This
+    alternative is allowed only occasionally and noncommercially, and
+    only if you received the object code with such an offer, in accord
+    with subsection 6b.
+    d) Convey the object code by offering access from a designated
+    place (gratis or for a charge), and offer equivalent access to the
+    Corresponding Source in the same way through the same place at no
+    further charge.  You need not require recipients to copy the
+    Corresponding Source along with the object code.  If the place to
+    copy the object code is a network server, the Corresponding Source
+    may be on a different server (operated by you or a third party)
+    that supports equivalent copying facilities, provided you maintain
+    clear directions next to the object code saying where to find the
+    Corresponding Source.  Regardless of what server hosts the
+    Corresponding Source, you remain obligated to ensure that it is
+    available for as long as needed to satisfy these requirements.
+    e) Convey the object code using peer-to-peer transmission, provided
+    you inform other peers where the object code and Corresponding
+    Source of the work are being offered to the general public at no
+    charge under subsection 6d.
+  A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+  A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling.  In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage.  For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product.  A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+  "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source.  The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+  If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information.  But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+  The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed.  Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+  Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+  7. Additional Terms.
+  "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law.  If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+  When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it.  (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.)  You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+  Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+    a) Disclaiming warranty or limiting liability differently from the
+    terms of sections 15 and 16 of this License; or
+    b) Requiring preservation of specified reasonable legal notices or
+    author attributions in that material or in the Appropriate Legal
+    Notices displayed by works containing it; or
+    c) Prohibiting misrepresentation of the origin of that material, or
+    requiring that modified versions of such material be marked in
+    reasonable ways as different from the original version; or
+    d) Limiting the use for publicity purposes of names of licensors or
+    authors of the material; or
+    e) Declining to grant rights under trademark law for use of some
+    trade names, trademarks, or service marks; or
+    f) Requiring indemnification of licensors and authors of that
+    material by anyone who conveys the material (or modified versions of
+    it) with contractual assumptions of liability to the recipient, for
+    any liability that these contractual assumptions directly impose on
+    those licensors and authors.
+  All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10.  If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term.  If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+  If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+  Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+  8. Termination.
+  You may not propagate or modify a covered work except as expressly
+provided under this License.  Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+  However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+  Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+  Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+  9. Acceptance Not Required for Having Copies.
+  You are not required to accept this License in order to receive or
+run a copy of the Program.  Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance.  However,
+nothing other than this License grants you permission to propagate or
+modify any covered work.  These actions infringe copyright if you do
+not accept this License.  Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+  10. Automatic Licensing of Downstream Recipients.
+  Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License.  You are not responsible
+for enforcing compliance by third parties with this License.
+  An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations.  If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+  You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License.  For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+  11. Patents.
+  A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based.  The
+work thus licensed is called the contributor's "contributor version".
+  A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version.  For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+  Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+  In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement).  To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+  If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients.  "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+  If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+  A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License.  You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+  Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+  12. No Surrender of Others' Freedom.
+  If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all.  For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+  13. Remote Network Interaction; Use with the GNU General Public License.
+  Notwithstanding any other provision of this License, if you modify the
+Program, your modified version must prominently offer all users
+interacting with it remotely through a computer network (if your version
+supports such interaction) an opportunity to receive the Corresponding
+Source of your version by providing access to the Corresponding Source
+from a network server at no charge, through some standard or customary
+means of facilitating copying of software.  This Corresponding Source
+shall include the Corresponding Source for any work covered by version 3
+of the GNU General Public License that is incorporated pursuant to the
+following paragraph.
+  Notwithstanding any other provision of this License, you have
+permission to link or combine any covered work with a work licensed
+under version 3 of the GNU General Public License into a single
+combined work, and to convey the resulting work.  The terms of this
+License will continue to apply to the part which is the covered work,
+but the work with which it is combined will remain governed by version
+3 of the GNU General Public License.
+  14. Revised Versions of this License.
+  The Free Software Foundation may publish revised and/or new versions of
+the GNU Affero General Public License from time to time.  Such new versions
+will be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+  Each version is given a distinguishing version number.  If the
+Program specifies that a certain numbered version of the GNU Affero General
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version published by the Free Software
+Foundation.  If the Program does not specify a version number of the
+GNU Affero General Public License, you may choose any version ever published
+by the Free Software Foundation.
+  If the Program specifies that a proxy can decide which future
+versions of the GNU Affero General Public License can be used, that proxy's
+public statement of acceptance of a version permanently authorizes you
+to choose that version for the Program.
+  Later license versions may give you additional or different
+permissions.  However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+  15. Disclaimer of Warranty.
+  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+  16. Limitation of Liability.
+  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+  17. Interpretation of Sections 15 and 16.
+  If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+                     END OF TERMS AND CONDITIONS
+            How to Apply These Terms to Your New Programs
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as published
+    by the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+Also add information on how to contact you by electronic and paper mail.
+  If your software can interact with users remotely through a computer
+network, you should also make sure that it provides a way for users to
+get its source.  For example, if your program is a web application, its
+interface could display a "Source" link that leads users to an archive
+of the code.  There are many ways you could offer source, and different
+solutions will be better for different programs; see section 13 for the
+specific requirements.
+  You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU AGPL, see
+<https://www.gnu.org/licenses/>.

README.md CHANGED Viewed

@@ -1,12 +1,292 @@
 ---
 title: MinerU
-emoji: 📈
-colorFrom: red
-colorTo: pink
 sdk: gradio
 sdk_version: 4.39.0
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: MinerU
+app_file: ./demo/app.py
 sdk: gradio
 sdk_version: 4.39.0
 ---
+<div id="top"></div>
+<div align="center">
+[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
+[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
+[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
+[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
+[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
+[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
+[English](README.md) | [简体中文](README_zh-CN.md)
+</div>
+<div align="center">
+</div>
+# MinerU
+## Introduction
+MinerU is a one-stop, open-source, high-quality data extraction tool, includes the following primary features:
+- [Magic-PDF](#Magic-PDF)  PDF Document Extraction
+- [Magic-Doc](#Magic-Doc)  Webpage & E-book Extraction
+# Magic-PDF
+## Introduction
+Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
+Key features include:
+- Support for multiple front-end model inputs
+- Removal of headers, footers, footnotes, and page numbers
+- Human-readable layout formatting
+- Retains the original document's structure and formatting, including headings, paragraphs, lists, and more
+- Extraction and display of images and tables within markdown
+- Conversion of equations into LaTeX format
+- Automatic detection and conversion of garbled PDFs
+- Compatibility with CPU and GPU environments
+- Available for Windows, Linux, and macOS platforms
+https://github.com/opendatalab/MinerU/assets/11393164/618937cb-dc6a-4646-b433-e3131a5f4070
+## Project Panorama
+![Project Panorama](docs/images/project_panorama_en.png)
+## Flowchart
+![Flowchart](docs/images/flowchart_en.png)
+### Dependency repositorys
+- [PDF-Extract-Kit : A Comprehensive Toolkit for High-Quality PDF Content Extraction](https://github.com/opendatalab/PDF-Extract-Kit) 🚀🚀🚀
+## Getting Started
+### Requirements
+- Python >= 3.9
+Using a virtual environment is recommended to avoid potential dependency conflicts; both venv and conda are suitable.
+For example:
+```bash
+conda create -n MinerU python=3.10
+conda activate MinerU
+```
+### Installation and Configuration
+#### 1. Install Magic-PDF
+Install the full-feature package with pip:
+>Note: The pip-installed package supports CPU-only and is ideal for quick tests.
+>
+>For CUDA/MPS acceleration in production, see [Acceleration Using CUDA or MPS](#4-Acceleration-Using-CUDA-or-MPS).
+```bash
+pip install magic-pdf[full-cpu]
+```
+The full-feature package depends on detectron2, which requires a compilation installation.
+If you need to compile it yourself, please refer to https://github.com/facebookresearch/detectron2/issues/5114
+Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):
+```bash
+pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
+```
+#### 2. Downloading model weights files
+For detailed references, please see below [how_to_download_models](docs/how_to_download_models_en.md)
+After downloading the model weights, move the 'models' directory to a directory on a larger disk space, preferably an SSD.
+#### 3. Copy the Configuration File and Make Configurations
+You can get the [magic-pdf.template.json](magic-pdf.template.json) file in the repository root directory.
+```bash
+cp magic-pdf.template.json ~/magic-pdf.json
+```
+In magic-pdf.json, configure "models-dir" to point to the directory where the model weights files are located.
+```json
+{
+  "models-dir": "/tmp/models"
+}
+```
+#### 4. Acceleration Using CUDA or MPS
+If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.
+##### CUDA
+You need to install the corresponding PyTorch version according to your CUDA version.
+This example installs the CUDA 11.8 version.More information https://pytorch.org/get-started/locally/
+```bash
+pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
+```
+Also, you need to modify the value of "device-mode" in the configuration file magic-pdf.json.
+```json
+{
+  "device-mode":"cuda"
+}
+```
+##### MPS
+For macOS users with M-series chip devices, you can use MPS for inference acceleration.
+You also need to modify the value of "device-mode" in the configuration file magic-pdf.json.
+```json
+{
+  "device-mode":"mps"
+}
+```
+### Usage
+#### 1.Usage via Command Line
+###### simple
+```bash
+magic-pdf pdf-command --pdf "pdf_path" --inside_model true
+```
+After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
+You can find the corresponding xxx_model.json file in the markdown directory.
+If you intend to do secondary development on the post-processing pipeline, you can use the command:
+```bash
+magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
+```
+In this way, you won't need to re-run the model data, making debugging more convenient.
+###### more
+```bash
+magic-pdf --help
+```
+#### 2. Usage via Api
+###### Local
+```python
+image_writer = DiskReaderWriter(local_image_dir)
+image_dir = str(os.path.basename(local_image_dir))
+jso_useful_key = {"_pdf_type": "", "model_list": []}
+pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
+pipe.pipe_classify()
+pipe.pipe_parse()
+md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
+```
+###### Object Storage
+```python
+s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
+image_dir = "s3://img_bucket/"
+s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
+pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
+jso_useful_key = {"_pdf_type": "", "model_list": []}
+pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
+pipe.pipe_classify()
+pipe.pipe_parse()
+md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
+```
+Demo can be referred to [demo.py](demo/demo.py)
+# Magic-Doc
+## Introduction
+Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
+Key Features Include:
+- Web Page Extraction
+  - Cross-modal precise parsing of text, images, tables, and formula information.
+- E-Book Document Extraction
+  - Supports various document formats including epub, mobi, with full adaptation for text and images.
+- Language Type Identification
+  - Accurate recognition of 176 languages.
+https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca
+https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d
+https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2
+## Project Repository
+- [Magic-Doc](https://github.com/InternLM/magic-doc)
+  Outstanding Webpage and E-book Extraction Tool
+# All Thanks To Our Contributors
+<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
+</a>
+# License Information
+[LICENSE.md](LICENSE.md)
+The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
+# Acknowledgments
+- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
+- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
+- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
+- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
+# Citation
+```bibtex
+@misc{2024mineru,
+    title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
+    author={MinerU Contributors},
+    howpublished = {\url{https://github.com/opendatalab/MinerU}},
+    year={2024}
+}
+```
+# Star History
+<a>
+ <picture>
+   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
+   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
+   <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
+ </picture>
+</a>

README_zh-CN.md ADDED Viewed

	@@ -0,0 +1,277 @@

+<div id="top"></div>
+<div align="center">
+[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
+[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
+[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
+[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
+[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
+[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
+[English](README.md) | [简体中文](README_zh-CN.md)
+</div>
+<div align="center">
+</div>
+# MinerU
+## 简介
+MinerU 是一款一站式、开源、高质量的数据提取工具，主要包含以下功能:
+- [Magic-PDF](#Magic-PDF)  PDF文档提取
+- [Magic-Doc](#Magic-Doc)  网页与电子书提取
+# Magic-PDF
+## 简介
+Magic-PDF 是一款将 PDF 转化为 markdown 格式的工具。支持转换本地文档或者位于支持S3协议对象存储上的文件。
+主要功能包含
+- 支持多种前端模型输入
+- 删除页眉、页脚、脚注、页码等元素
+- 符合人类阅读顺序的排版格式
+- 保留原文档的结构和格式，包括标题、段落、列表等
+- 提取图像和表格并在markdown中展示
+- 将公式转换成latex
+- 乱码PDF自动识别并转换
+- 支持cpu和gpu环境
+- 支持windows/linux/mac平台
+https://github.com/opendatalab/MinerU/assets/11393164/618937cb-dc6a-4646-b433-e3131a5f4070
+## 项目全景
+![项目全景图](docs/images/project_panorama_zh_cn.png)
+## 流程图
+![流程图](docs/images/flowchart_zh_cn.png)
+### 子模块仓库
+- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
+  - 高质量的PDF内容提取工具包
+## 上手指南
+### 配置要求
+python >= 3.9
+推荐使用虚拟环境，以避免可能发生的依赖冲突，venv和conda均可使用。
+例如：
+```bash
+conda create -n MinerU python=3.10
+conda activate MinerU
+```
+开发基于python 3.10，如果在其他版本python出现问题请切换至3.10。
+### 安装配置
+#### 1. 安装Magic-PDF
+使用pip安装完整功能包：
+>受pypi限制，pip安装的完整功能包仅支持cpu推理，建议只用于快速测试解析能力。
+>
+>如需在生产环境使用CUDA/MPS加速请参考[使用CUDA或MPS加速推理](#4-使用CUDA或MPS加速推理)
+```bash
+pip install magic-pdf[full-cpu]
+```
+完整功能包依赖detectron2，该库需要编译安装，如需自行编译，请参考 https://github.com/facebookresearch/detectron2/issues/5114
+或是直接使用我们预编译的whl包(仅限python 3.10)：
+```bash
+pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
+```
+#### 2. 下载模型权重文件
+详细参考 [如何下载模型文件](docs/how_to_download_models_zh_cn.md)
+下载后请将models目录移动到空间较大的ssd磁盘目录
+#### 3. 拷贝配置文件并进行配置
+在仓库根目录可以获得 [magic-pdf.template.json](magic-pdf.template.json) 文件
+```bash
+cp magic-pdf.template.json ~/magic-pdf.json
+```
+在magic-pdf.json中配置"models-dir"为模型权重文件所在目录
+```json
+{
+  "models-dir": "/tmp/models"
+}
+```
+#### 4. 使用CUDA或MPS加速推理
+如您有可用的Nvidia显卡或在使用Apple Silicon的Mac，可以使用CUDA或MPS进行加速
+##### CUDA
+需要根据自己的CUDA版本安装对应的pytorch版本
+以下是对应CUDA 11.8版本的安装命令，更多信息请参考 https://pytorch.org/get-started/locally/
+```bash
+pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
+```
+同时需要修改配置文件magic-pdf.json中"device-mode"的值
+```json
+{
+  "device-mode":"cuda"
+}
+```
+##### MPS
+使用macOS(M系列芯片设备)可以使用MPS进行推理加速
+需要修改配置文件magic-pdf.json中"device-mode"的值
+```json
+{
+  "device-mode":"mps"
+}
+```
+### 使用说明
+#### 1. 通过命令行使用
+###### 直接使用
+```bash
+magic-pdf pdf-command --pdf "pdf_path" --inside_model true
+```
+程序运行完成后，你可以在"/tmp/magic-pdf"目录下看到生成的markdown文件，markdown目录中可以找到对应的xxx_model.json文件
+如果您有意对后处理pipeline进行二次开发，可以使用命令
+```bash
+magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
+```
+这样就不需要重跑模型数据，调试起来更方便
+###### 更多用法
+```bash
+magic-pdf --help
+```
+#### 2. 通过接口调用
+###### 本地使用
+```python
+image_writer = DiskReaderWriter(local_image_dir)
+image_dir = str(os.path.basename(local_image_dir))
+jso_useful_key = {"_pdf_type": "", "model_list": model_json}
+pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
+pipe.pipe_classify()
+pipe.pipe_parse()
+md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
+```
+###### 在对象存储上使用
+```python
+s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
+image_dir = "s3://img_bucket/"
+s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
+pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
+jso_useful_key = {"_pdf_type": "", "model_list": model_json}
+pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
+pipe.pipe_classify()
+pipe.pipe_parse()
+md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
+```
+详细实现可参考 [demo.py](demo/demo.py)
+### 常见问题处理解答
+参考 [FAQ](docs/FAQ_zh_cn.md)
+# Magic-Doc
+## 简介
+Magic-Doc 是一款支持将网页或多格式电子书转换为 markdown 格式的工具。
+主要功能包含
+- Web网页提取
+  - 跨模态精准解析图文、表格、公式信息
+- 电子书文献提取
+  - 支持 epub，mobi等多格式文献，文本图片全适配
+- 语言类型鉴定
+  - 支持176种语言的准确识别
+https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca
+https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d
+https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2
+## 项目仓库
+- [Magic-Doc](https://github.com/InternLM/magic-doc)
+  优秀的网页与电子书提取工具
+## 感谢我们的贡献者
+<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
+</a>
+## 版权说明
+[LICENSE.md](LICENSE.md)
+本项目目前采用PyMuPDF以实现高级功能，但因其遵循AGPL协议，可能对某些使用场景构成限制。未来版本迭代中，我们计划探索并替换为许可条款更为宽松的PDF处理库，以提升用户友好度及灵活性。
+## 致谢
+- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
+- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
+- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
+- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
+# 引用
+```bibtex
+@misc{2024mineru,
+    title={MinerU: A One-stop, Open-source, High-quality Data Extraction Tool},
+    author={MinerU Contributors},
+    howpublished = {\url{https://github.com/opendatalab/MinerU}},
+    year={2024}
+}
+```
+# Star History
+<a>
+ <picture>
+   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
+   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
+   <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
+ </picture>
+</a>

demo/app.py ADDED Viewed

	@@ -0,0 +1,67 @@

+import os
+import json
+import gradio as gr
+from loguru import logger
+from magic_pdf.pipe.UNIPipe import UNIPipe
+from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
+import magic_pdf.model as model_config
+model_config.__use_inside_model__ = True
+def process_pdf(file_path):
+    try:
+        pdf_bytes = open(file_path, "rb").read()
+        model_json = []  # model_json传空list使用内置模型解析
+        jso_useful_key = {"_pdf_type": "", "model_list": model_json}
+        local_image_dir = os.path.join('uploads', 'images')
+        if not os.path.exists(local_image_dir):
+            os.makedirs(local_image_dir)
+        image_dir = str(os.path.basename(local_image_dir))
+        image_writer = DiskReaderWriter(local_image_dir)
+        pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
+        pipe.pipe_classify()
+        if len(model_json) == 0:
+            if model_config.__use_inside_model__:
+                pipe.pipe_analyze()
+            else:
+                logger.error("need model list input")
+                return None
+        pipe.pipe_parse()
+        md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
+        return md_content
+    except Exception as e:
+        logger.exception(e)
+        return None
+def extract_markdown_from_pdf(pdf):
+    # 保存上传的PDF文件
+    file_path = os.path.join('uploads', pdf.name)
+    with open(file_path, 'wb') as f:
+        f.write(pdf.read())
+    # 处理PDF文件并生成Markdown内容
+    md_content = process_pdf(file_path)
+    return md_content
+def main():
+    # 创建Gradio接口
+    with gr.Blocks() as demo:
+        gr.Markdown("# PDF to Markdown Converter")
+        with gr.Row():
+            with gr.Column():
+                pdf_file = gr.File(label="Upload PDF", file_types=['.pdf'])
+                md_output = gr.Markdown(label="Extracted Markdown")
+        extract_button = gr.Button("Extract Markdown")
+        extract_button.click(extract_markdown_from_pdf, inputs=[
+                             pdf_file], outputs=[md_output])
+    demo.launch(share=True)
+if __name__ == '__main__':
+    main()

demo/demo.py ADDED Viewed

	@@ -0,0 +1,31 @@

+import os
+import json
+from loguru import logger
+from magic_pdf.pipe.UNIPipe import UNIPipe
+from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
+import magic_pdf.model as model_config
+model_config.__use_inside_model__ = True
+try:
+    current_script_dir = os.path.dirname(os.path.abspath(__file__))
+    demo_name = "demo1"
+    pdf_path = os.path.join(current_script_dir, f"{demo_name}.pdf")
+    model_path = os.path.join(current_script_dir, f"{demo_name}.json")
+    pdf_bytes = open(pdf_path, "rb").read()
+    # model_json = json.loads(open(model_path, "r", encoding="utf-8").read())
+    model_json = []  # model_json传空list使用内置模型解析
+    jso_useful_key = {"_pdf_type": "", "model_list": model_json}
+    local_image_dir = os.path.join(current_script_dir, 'images')
+    image_dir = str(os.path.basename(local_image_dir))
+    image_writer = DiskReaderWriter(local_image_dir)
+    pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
+    pipe.pipe_classify()
+    pipe.pipe_parse()
+    md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
+    with open(f"{demo_name}.md", "w", encoding="utf-8") as f:
+        f.write(md_content)
+except Exception as e:
+    logger.exception(e)

demo/demo1.json ADDED Viewed

The diff for this file is too large to render. See raw diff

demo/demo1.pdf ADDED Viewed

Binary file (337 kB). View file

demo/demo2.json ADDED Viewed

The diff for this file is too large to render. See raw diff

demo/demo2.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9e94e95637356e1599510436278747d1150a3dfb822233bdc77a9dcb9a4fc6e4
+size 1808096

docs/FAQ_zh_cn.md ADDED Viewed

	@@ -0,0 +1,85 @@

+# 常见问题解答
+### 1.离线部署首次运行，报错urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>
+首次运行需要在线下载一个小的语言检测模型，如果是离线部署需要手动下载该模型并放到指定目录。
+参考：https://github.com/opendatalab/MinerU/issues/121
+### 2.在较新版本的mac上使用命令安装pip install magic-pdf[full-cpu] zsh: no matches found: magic-pdf[full-cpu]
+在 macOS 上，默认的 shell 从 Bash 切换到了 Z shell，而 Z shell 对于某些类型的字符串匹配有特殊的处理逻辑，这可能导致no matches found错误。
+可以通过在命令行禁用globbing特性，再尝试运行安装命令
+```bash
+setopt no_nomatch
+pip install magic-pdf[full-cpu]
+```
+### 3.在intel cpu 的mac上 安装最新版的完整功能包 magic-pdf[full-cpu] (0.6.x) 不成功
+完整功能包依赖的公式解析库unimernet限制了pytorch的最低版本为2.3.0，而pytorch官方没有为intel cpu的macOS 提供2.3.0版本的预编译包，所以会产生依赖不兼容的问题。
+可以先尝试安装unimernet的老版本之后再尝试安装完整功能包的其他依赖。（为避免依赖冲突，请激活一个全新的虚拟环境）
+```bash
+pip install magic-pdf
+pip install unimernet==0.1.0
+pip install matplotlib ultralytics paddleocr==2.7.3 paddlepaddle
+```
+### 4.在部分较新的M芯片macOS设备上，MPS加速开启失败
+卸载torch和torchvision，重新安装nightly构建版torch和torchvision
+```bash
+pip uninstall torch torchvision
+pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
+```
+参考: https://github.com/opendatalab/PDF-Extract-Kit/issues/23
+### 5.使用过程中遇到paddle相关的报错FatalError: Illegal instruction is detected by the operating system.
+paddlepaddle 2.6.1与部分linux系统环境存在兼容性问题。
+可尝试降级到2.5.2使用，
+```bash
+pip install paddlepaddle==2.5.2
+```
+或卸载paddlepaddle，重新安装paddlepaddle-gpu
+```bash
+pip uninstall paddlepaddle
+pip install paddlepaddle-gpu
+```
+参考：https://github.com/opendatalab/MinerU/issues/146
+### 6.使用过程中遇到_pickle.UnpicklingError: invalid load key, 'v'.错误
+可能是由于模型文件未下载完整导致，可尝试重现下载模型文件后再试
+参考：https://github.com/opendatalab/MinerU/issues/143
+### 7.程序运行完成后，找不到tmp目录
+程序输出目录是在"magic-pdf.json"中通过
+```json
+{
+  "temp-output-dir": "/tmp"
+}
+```
+进行配置的。
+如果没有更改这个参数，使用默认的配置执行程序，在linux/macOS会在绝对路径"/tmp"下创建一个"magic-pdf"文件夹作为输出路径。
+而在windows下，默认的输出路径与执行命令时，命令行所在的盘符相关，如果命令行在C盘，则默认输出路径为"C://tmp/magic-pdf"。
+参考：https://github.com/opendatalab/MinerU/issues/149
+### 8.模型文件应该下载到哪里/models-dir的配置应该怎么填
+模型文件的路径输入是在"magic-pdf.json"中通过
+```json
+{
+  "models-dir": "/tmp/models"
+}
+```
+进行配置的。
+这个路径是绝对路径而不是相对路径，绝对路径的获取可在models目录中通过命令 "pwd" 获取。
+参考：https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
+### 9.命令行中 --model "model_json_path" 指的是什么？
+model_json 指的是通过模型分析后生成的一种有特定格式的json文件。
+如果使用 https://github.com/opendatalab/PDF-Extract-Kit 项目生成，该文件一般在项目的output目录下。
+如果使用 MinerU 的命令行调用内置的模型分析，该文件一般在输出路径"/tmp/magic-pdf/pdf-name"下。
+参考：https://github.com/opendatalab/MinerU/issues/128

docs/how_to_download_models_en.md ADDED Viewed

	@@ -0,0 +1,60 @@

+### Install Git LFS
+Before you begin, make sure Git Large File Storage (Git LFS) is installed on your system. Install it using the following command:
+```bash
+git lfs install
+```
+### Download the Model from Hugging Face
+To download the `PDF-Extract-Kit` model from Hugging Face, use the following command:
+```bash
+git lfs clone https://huggingface.co/wanderkid/PDF-Extract-Kit
+```
+Ensure that Git LFS is enabled during the clone to properly download all large files.
+### Download the Model from ModelScope
+#### SDK Download
+```bash
+# First, install the ModelScope library using pip:
+pip install modelscope
+```
+```python
+# Use the following Python code to download the model using the ModelScope SDK:
+from modelscope import snapshot_download
+model_dir = snapshot_download('wanderkid/PDF-Extract-Kit')
+```
+#### Git Download
+Alternatively, you can use Git to clone the model repository from ModelScope:
+```bash
+git clone https://www.modelscope.cn/wanderkid/PDF-Extract-Kit.git
+```
+Put [model files]() here:
+```
+./
+├── Layout
+│   ├── config.json
+│   └── weights.pth
+├── MFD
+│   └── weights.pt
+├── MFR
+│   └── UniMERNet
+│       ├── config.json
+│       ├── preprocessor_config.json
+│       ├── pytorch_model.bin
+│       ├── README.md
+│       ├── tokenizer_config.json
+│       └── tokenizer.json
+└── README.md
+```

docs/how_to_download_models_zh_cn.md ADDED Viewed

	@@ -0,0 +1,61 @@

+### 安装 Git LFS
+开始之前，请确保您的系统上已安装 Git 大文件存储 (Git LFS)。使用以下命令进行安装
+```bash
+git lfs install
+```
+### 从 Hugging Face 下载模型
+请使用以下命令从 Hugging Face 下载 PDF-Extract-Kit 模型：
+```bash
+git lfs clone https://huggingface.co/wanderkid/PDF-Extract-Kit
+```
+确保在克隆过程中启用了 Git LFS，以便正确下载所有大文件。
+### 从 ModelScope 下载模型
+#### SDK下载
+```bash
+# 首先安装modelscope
+pip install modelscope
+```
+```python
+# 使用modelscope sdk下载模型
+from modelscope import snapshot_download
+model_dir = snapshot_download('wanderkid/PDF-Extract-Kit')
+```
+#### Git下载
+也可以使用git clone从 ModelScope 下载模型:
+```bash
+git clone https://www.modelscope.cn/wanderkid/PDF-Extract-Kit.git
+```
+将 'models' 目录移动到具有较大磁盘空间的目录中，最好是在固态硬盘(SSD)上。
+模型文件夹的结构如下，包含了不同组件的配置文件和权重文件：
+```
+./
+├── Layout
+│   ├── config.json
+│   └── model_final.pth
+├── MFD
+│   └── weights.pt
+├── MFR
+│   └── UniMERNet
+│       ├── config.json
+│       ├── preprocessor_config.json
+│       ├── pytorch_model.bin
+│       ├── README.md
+│       ├── tokenizer_config.json
+│       └── tokenizer.json
+└── README.md
+```

docs/images/flowchart_en.png ADDED Viewed

docs/images/flowchart_zh_cn.png ADDED Viewed

docs/images/project_panorama_en.png ADDED Viewed

docs/images/project_panorama_zh_cn.png ADDED Viewed

magic-pdf.template.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+    "bucket_info":{
+        "bucket-name-1":["ak", "sk", "endpoint"],
+        "bucket-name-2":["ak", "sk", "endpoint"]
+    },
+    "temp-output-dir":"/tmp",
+    "models-dir":"/tmp/models",
+    "device-mode":"cpu"
+}

magic_pdf/__init__.py ADDED Viewed

File without changes

magic_pdf/cli/__init__.py ADDED Viewed

File without changes

magic_pdf/cli/magicpdf.py ADDED Viewed

	@@ -0,0 +1,359 @@

+"""
+这里实现2个click命令：
+第一个：
+ 接收一个完整的s3路径，例如：s3://llm-pdf-text/pdf_ebook_and_paper/pre-clean-mm-markdown/v014/part-660420b490be-000008.jsonl?bytes=0,81350
+    1）根据~/magic-pdf.json里的ak,sk等，构造s3cliReader读取到这个jsonl的对应行，返回json对象。
+    2）根据Json对象里的pdf的s3路径获取到他的ak,sk,endpoint，构造出s3cliReader用来读取pdf
+    3）从magic-pdf.json里读取到本地保存图片、Md等的临时目录位置,构造出LocalImageWriter，用来保存截图
+    4）从magic-pdf.json里读取到本地保存图片、Md等的临时目录位置,构造出LocalIRdWriter，用来读写本地文件
+    最后把以上步骤准备好的对象传入真正的解析API
+第二个：
+  接收1）pdf的本地路径。2）模型json文件（可选）。然后：
+    1）根据~/magic-pdf.json读取到本地保存图片、md等临时目录的位置，构造出LocalImageWriter，用来保存截图
+    2）从magic-pdf.json里读取到本地保存图片、Md等的临时目录位置,构造出LocalIRdWriter，用来读写本地文件
+    3）根据约定，根据pdf本地路径，推导出pdf模型的json，并读入
+效果：
+python magicpdf.py json-command --json  s3://llm-pdf-text/scihub/xxxx.json?bytes=0,81350
+python magicpdf.py pdf-command --pdf  /home/llm/Downloads/xxxx.pdf --model /home/llm/Downloads/xxxx.json  或者 python magicpdf.py --pdf  /home/llm/Downloads/xxxx.pdf
+"""
+import os
+import json as json_parse
+import click
+from loguru import logger
+from pathlib import Path
+from magic_pdf.libs.version import __version__
+from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
+from magic_pdf.libs.draw_bbox import draw_layout_bbox, draw_span_bbox
+from magic_pdf.pipe.UNIPipe import UNIPipe
+from magic_pdf.pipe.OCRPipe import OCRPipe
+from magic_pdf.pipe.TXTPipe import TXTPipe
+from magic_pdf.libs.path_utils import (
+    parse_s3path,
+    parse_s3_range_params,
+    remove_non_official_s3_args,
+)
+from magic_pdf.libs.config_reader import (
+    get_local_dir,
+    get_s3_config,
+)
+from magic_pdf.rw.S3ReaderWriter import S3ReaderWriter
+from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
+from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
+import csv
+import copy
+import magic_pdf.model as model_config
+parse_pdf_methods = click.Choice(["ocr", "txt", "auto"])
+def prepare_env(pdf_file_name, method):
+    local_parent_dir = os.path.join(get_local_dir(), "magic-pdf", pdf_file_name, method)
+    local_image_dir = os.path.join(str(local_parent_dir), "images")
+    local_md_dir = local_parent_dir
+    os.makedirs(local_image_dir, exist_ok=True)
+    os.makedirs(local_md_dir, exist_ok=True)
+    return local_image_dir, local_md_dir
+def write_to_csv(csv_file_path, csv_data):
+    with open(csv_file_path, mode="a", newline="", encoding="utf-8") as csvfile:
+        # 创建csv writer对象
+        csv_writer = csv.writer(csvfile)
+        # 写入数据
+        csv_writer.writerow(csv_data)
+    logger.info(f"数据已成功追加到 '{csv_file_path}'")
+def do_parse(
+        pdf_file_name,
+        pdf_bytes,
+        model_list,
+        parse_method,
+        f_draw_span_bbox=True,
+        f_draw_layout_bbox=True,
+        f_dump_md=True,
+        f_dump_middle_json=True,
+        f_dump_model_json=True,
+        f_dump_orig_pdf=True,
+        f_dump_content_list=True,
+        f_make_md_mode=MakeMode.MM_MD,
+):
+    orig_model_list = copy.deepcopy(model_list)
+    local_image_dir, local_md_dir = prepare_env(pdf_file_name, parse_method)
+    logger.info(f"local output dir is {local_md_dir}")
+    image_writer, md_writer = DiskReaderWriter(local_image_dir), DiskReaderWriter(local_md_dir)
+    image_dir = str(os.path.basename(local_image_dir))
+    if parse_method == "auto":
+        jso_useful_key = {"_pdf_type": "", "model_list": model_list}
+        pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True)
+    elif parse_method == "txt":
+        pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True)
+    elif parse_method == "ocr":
+        pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True)
+    else:
+        logger.error("unknown parse method")
+        exit(1)
+    pipe.pipe_classify()
+    """如果没有传入有效的模型数据，则使用内置model解析"""
+    if len(model_list) == 0:
+        if model_config.__use_inside_model__:
+            pipe.pipe_analyze()
+            orig_model_list = copy.deepcopy(pipe.model_list)
+        else:
+            logger.error("need model list input")
+            exit(1)
+    pipe.pipe_parse()
+    pdf_info = pipe.pdf_mid_data["pdf_info"]
+    if f_draw_layout_bbox:
+        draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir)
+    if f_draw_span_bbox:
+        draw_span_bbox(pdf_info, pdf_bytes, local_md_dir)
+    md_content = pipe.pipe_mk_markdown(image_dir, drop_mode=DropMode.NONE, md_make_mode=f_make_md_mode)
+    if f_dump_md:
+        """写markdown"""
+        md_writer.write(
+            content=md_content,
+            path=f"{pdf_file_name}.md",
+            mode=AbsReaderWriter.MODE_TXT,
+        )
+    if f_dump_middle_json:
+        """写middle_json"""
+        md_writer.write(
+            content=json_parse.dumps(pipe.pdf_mid_data, ensure_ascii=False, indent=4),
+            path=f"{pdf_file_name}_middle.json",
+            mode=AbsReaderWriter.MODE_TXT,
+        )
+    if f_dump_model_json:
+        """写model_json"""
+        md_writer.write(
+            content=json_parse.dumps(orig_model_list, ensure_ascii=False, indent=4),
+            path=f"{pdf_file_name}_model.json",
+            mode=AbsReaderWriter.MODE_TXT,
+        )
+    if f_dump_orig_pdf:
+        """写源pdf"""
+        md_writer.write(
+            content=pdf_bytes,
+            path=f"{pdf_file_name}_origin.pdf",
+            mode=AbsReaderWriter.MODE_BIN,
+        )
+    content_list = pipe.pipe_mk_uni_format(image_dir, drop_mode=DropMode.NONE)
+    if f_dump_content_list:
+        """写content_list"""
+        md_writer.write(
+            content=json_parse.dumps(content_list, ensure_ascii=False, indent=4),
+            path=f"{pdf_file_name}_content_list.json",
+            mode=AbsReaderWriter.MODE_TXT,
+        )
+@click.group()
+@click.version_option(__version__, "--version", "-v", help="显示版本信息")
+@click.help_option("--help", "-h", help="显示帮助信息")
+def cli():
+    pass
+@cli.command()
+@click.option("--json", type=str, help="输入一个S3路径")
+@click.option(
+    "--method",
+    type=parse_pdf_methods,
+    help="指定解析方法。txt: 文本型 pdf 解析方法， ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法",
+    default="auto",
+)
+@click.option("--inside_model", type=click.BOOL, default=True, help="使用内置模型测试")
+@click.option("--model_mode", type=click.STRING, default="full",
+              help="内置模型选择。lite: 快速解析，精度较低，full: 高精度解析，速度较慢")
+def json_command(json, method, inside_model, model_mode):
+    model_config.__use_inside_model__ = inside_model
+    model_config.__model_mode__ = model_mode
+    if not json.startswith("s3://"):
+        logger.error("usage: magic-pdf json-command --json s3://some_bucket/some_path")
+        exit(1)
+    def read_s3_path(s3path):
+        bucket, key = parse_s3path(s3path)
+        s3_ak, s3_sk, s3_endpoint = get_s3_config(bucket)
+        s3_rw = S3ReaderWriter(
+            s3_ak, s3_sk, s3_endpoint, "auto", remove_non_official_s3_args(s3path)
+        )
+        may_range_params = parse_s3_range_params(s3path)
+        if may_range_params is None or 2 != len(may_range_params):
+            byte_start, byte_end = 0, None
+        else:
+            byte_start, byte_end = int(may_range_params[0]), int(may_range_params[1])
+            byte_end += byte_start - 1
+        return s3_rw.read_jsonl(
+            remove_non_official_s3_args(s3path),
+            byte_start,
+            byte_end,
+            AbsReaderWriter.MODE_BIN,
+        )
+    jso = json_parse.loads(read_s3_path(json).decode("utf-8"))
+    s3_file_path = jso.get("file_location")
+    if s3_file_path is None:
+        s3_file_path = jso.get("path")
+    pdf_file_name = Path(s3_file_path).stem
+    pdf_data = read_s3_path(s3_file_path)
+    do_parse(
+        pdf_file_name,
+        pdf_data,
+        jso["doc_layout_result"],
+        method,
+    )
+@cli.command()
+@click.option("--local_json", type=str, help="输入一个本地jsonl路径")
+@click.option(
+    "--method",
+    type=parse_pdf_methods,
+    help="指定解析方法。txt: 文本型 pdf 解析方法， ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法",
+    default="auto",
+)
+@click.option("--inside_model", type=click.BOOL, default=True, help="使用内置模型测试")
+@click.option("--model_mode", type=click.STRING, default="full",
+              help="内置模型选择。lite: 快速解析，精度较低，full: 高精度解析，速度较慢")
+def local_json_command(local_json, method, inside_model, model_mode):
+    model_config.__use_inside_model__ = inside_model
+    model_config.__model_mode__ = model_mode
+    def read_s3_path(s3path):
+        bucket, key = parse_s3path(s3path)
+        s3_ak, s3_sk, s3_endpoint = get_s3_config(bucket)
+        s3_rw = S3ReaderWriter(
+            s3_ak, s3_sk, s3_endpoint, "auto", remove_non_official_s3_args(s3path)
+        )
+        may_range_params = parse_s3_range_params(s3path)
+        if may_range_params is None or 2 != len(may_range_params):
+            byte_start, byte_end = 0, None
+        else:
+            byte_start, byte_end = int(may_range_params[0]), int(may_range_params[1])
+            byte_end += byte_start - 1
+        return s3_rw.read_jsonl(
+            remove_non_official_s3_args(s3path),
+            byte_start,
+            byte_end,
+            AbsReaderWriter.MODE_BIN,
+        )
+    with open(local_json, "r", encoding="utf-8") as f:
+        for json_line in f:
+            jso = json_parse.loads(json_line)
+            s3_file_path = jso.get("file_location")
+            if s3_file_path is None:
+                s3_file_path = jso.get("path")
+            pdf_file_name = Path(s3_file_path).stem
+            pdf_data = read_s3_path(s3_file_path)
+            do_parse(
+                pdf_file_name,
+                pdf_data,
+                jso["doc_layout_result"],
+                method,
+            )
+@cli.command()
+@click.option(
+    "--pdf", type=click.Path(exists=True), required=True,
+    help='pdf 文件路径, 支持单个文件或文件列表, 文件列表需要以".list"结尾, 一行一个pdf文件路径')
+@click.option("--model", type=click.Path(exists=True), help="模型的路径")
+@click.option(
+    "--method",
+    type=parse_pdf_methods,
+    help="指定解析方法。txt: 文本型 pdf 解析方法， ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法",
+    default="auto",
+)
+@click.option("--inside_model", type=click.BOOL, default=True, help="使用内置模型测试")
+@click.option("--model_mode", type=click.STRING, default="full",
+              help="内置模型选择。lite: 快速解析，精度较低，full: 高精度解析，速度较慢")
+def pdf_command(pdf, model, method, inside_model, model_mode):
+    model_config.__use_inside_model__ = inside_model
+    model_config.__model_mode__ = model_mode
+    def read_fn(path):
+        disk_rw = DiskReaderWriter(os.path.dirname(path))
+        return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
+    def get_model_json(model_path, doc_path):
+        # 这里处理pdf和模型相关的逻辑
+        if model_path is None:
+            file_name_without_extension, extension = os.path.splitext(doc_path)
+            if extension == ".pdf":
+                model_path = file_name_without_extension + ".json"
+            else:
+                raise Exception("pdf_path input error")
+            if not os.path.exists(model_path):
+                logger.warning(
+                    f"not found json {model_path} existed"
+                )
+                # 本地无模型数据则调用内置paddle分析，先传空list，在内部识别到空list再调用paddle
+                model_json = "[]"
+            else:
+                model_json = read_fn(model_path).decode("utf-8")
+        else:
+            model_json = read_fn(model_path).decode("utf-8")
+        return model_json
+    def parse_doc(doc_path):
+        try:
+            file_name = str(Path(doc_path).stem)
+            pdf_data = read_fn(doc_path)
+            jso = json_parse.loads(get_model_json(model, doc_path))
+            do_parse(
+                file_name,
+                pdf_data,
+                jso,
+                method,
+            )
+        except Exception as e:
+            logger.exception(e)
+    if not pdf:
+        logger.error(f"Error: Missing argument '--pdf'.")
+        exit(f"Error: Missing argument '--pdf'.")
+    else:
+        '''适配多个文档的list文件输入'''
+        if pdf.endswith(".list"):
+            with open(pdf, "r") as f:
+                for line in f.readlines():
+                    line = line.strip()
+                    parse_doc(line)
+        else:
+            '''适配单个文档的输入'''
+            parse_doc(pdf)
+if __name__ == "__main__":
+    """
+    python magic_pdf/cli/magicpdf.py json-command --json s3://llm-pdf-text/pdf_ebook_and_paper/manual/v001/part-660407a28beb-000002.jsonl?bytes=0,63551
+    """
+    cli()

magic_pdf/dict2md/__init__.py ADDED Viewed

File without changes

magic_pdf/dict2md/mkcontent.py ADDED Viewed

	@@ -0,0 +1,397 @@

+import math
+from loguru import logger
+from magic_pdf.libs.boxbase import find_bottom_nearest_text_bbox, find_top_nearest_text_bbox
+from magic_pdf.libs.commons import join_path
+from magic_pdf.libs.ocr_content_type import ContentType
+TYPE_INLINE_EQUATION = ContentType.InlineEquation
+TYPE_INTERLINE_EQUATION = ContentType.InterlineEquation
+UNI_FORMAT_TEXT_TYPE = ['text', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6']
+@DeprecationWarning
+def mk_nlp_markdown_1(para_dict: dict):
+    """
+    对排序后的bboxes拼接内容
+    """
+    content_lst = []
+    for _, page_info in para_dict.items():
+        para_blocks = page_info.get("para_blocks")
+        if not para_blocks:
+            continue
+        for block in para_blocks:
+            item = block["paras"]
+            for _, p in item.items():
+                para_text = p["para_text"]
+                is_title = p["is_para_title"]
+                title_level = p['para_title_level']
+                md_title_prefix = "#"*title_level
+                if is_title:
+                    content_lst.append(f"{md_title_prefix} {para_text}")
+                else:
+                    content_lst.append(para_text)
+    content_text = "\n\n".join(content_lst)
+    return content_text
+# 找到目标字符串在段落中的索引
+def __find_index(paragraph, target):
+    index = paragraph.find(target)
+    if index != -1:
+        return index
+    else:
+        return None
+def __insert_string(paragraph, target, postion):
+    new_paragraph = paragraph[:postion] + target + paragraph[postion:]
+    return new_paragraph
+def __insert_after(content, image_content, target):
+    """
+    在content中找到target，将image_content插入到target后面
+    """
+    index = content.find(target)
+    if index != -1:
+        content = content[:index+len(target)] + "\n\n" + image_content + "\n\n" + content[index+len(target):]
+    else:
+        logger.error(f"Can't find the location of image {image_content} in the markdown file, search target is {target}")
+    return content
+def __insert_before(content, image_content, target):
+    """
+    在content中找到target，将image_content插入到target前面
+    """
+    index = content.find(target)
+    if index != -1:
+        content = content[:index] + "\n\n" + image_content + "\n\n" + content[index:]
+    else:
+        logger.error(f"Can't find the location of image {image_content} in the markdown file, search target is {target}")
+    return content
+@DeprecationWarning
+def mk_mm_markdown_1(para_dict: dict):
+    """拼装多模态markdown"""
+    content_lst = []
+    for _, page_info in para_dict.items():
+        page_lst = [] # 一个page内的段落列表
+        para_blocks = page_info.get("para_blocks")
+        pymu_raw_blocks = page_info.get("preproc_blocks")
+        all_page_images = []
+        all_page_images.extend(page_info.get("images",[]))
+        all_page_images.extend(page_info.get("image_backup", []) )
+        all_page_images.extend(page_info.get("tables",[]))
+        all_page_images.extend(page_info.get("table_backup",[]) )
+        if not para_blocks or not pymu_raw_blocks: # 只有图片的拼接的场景
+            for img in all_page_images:
+                page_lst.append(f"![]({img['image_path']})") # TODO 图片顺序
+            page_md = "\n\n".join(page_lst)
+        else:
+            for block in para_blocks:
+                item = block["paras"]
+                for _, p in item.items():
+                    para_text = p["para_text"]
+                    is_title = p["is_para_title"]
+                    title_level = p['para_title_level']
+                    md_title_prefix = "#"*title_level
+                    if is_title:
+                        page_lst.append(f"{md_title_prefix} {para_text}")
+                    else:
+                        page_lst.append(para_text)
+            """拼装成一个页面的文本"""
+            page_md = "\n\n".join(page_lst)
+            """插入图片"""
+            for img in all_page_images:
+                imgbox = img['bbox']
+                img_content = f"![]({img['image_path']})"
+                # 先看在哪个block内
+                for block in pymu_raw_blocks:
+                    bbox = block['bbox']
+                    if bbox[0]-1 <= imgbox[0] < bbox[2]+1 and bbox[1]-1 <= imgbox[1] < bbox[3]+1:# 确定在block内
+                        for l in block['lines']:
+                            line_box = l['bbox']
+                            if line_box[0]-1 <= imgbox[0] < line_box[2]+1 and line_box[1]-1 <= imgbox[1] < line_box[3]+1: # 在line内的，插入line前面
+                                line_txt = "".join([s['text'] for s in l['spans']])
+                                page_md = __insert_before(page_md, img_content, line_txt)
+                                break
+                            break
+                        else:# 在行与行之间
+                            # 找���图片x0,y0与line的x0,y0最近的line
+                            min_distance = 100000
+                            min_line = None
+                            for l in block['lines']:
+                                line_box = l['bbox']
+                                distance = math.sqrt((line_box[0] - imgbox[0])**2 + (line_box[1] - imgbox[1])**2)
+                                if distance < min_distance:
+                                    min_distance = distance
+                                    min_line = l
+                            if min_line:
+                                line_txt = "".join([s['text'] for s in min_line['spans']])
+                                img_h = imgbox[3] - imgbox[1]
+                                if min_distance<img_h: # 文字在图片前面
+                                    page_md = __insert_after(page_md, img_content, line_txt)
+                                else:
+                                    page_md = __insert_before(page_md, img_content, line_txt)
+                            else:
+                                logger.error(f"Can't find the location of image {img['image_path']} in the markdown file #1")
+                else:# 应当在两个block之间
+                    # 找到上方最近的block，如果上方没有就找大下方最近的block
+                    top_txt_block = find_top_nearest_text_bbox(pymu_raw_blocks, imgbox)
+                    if top_txt_block:
+                        line_txt = "".join([s['text'] for s in top_txt_block['lines'][-1]['spans']])
+                        page_md = __insert_after(page_md, img_content, line_txt)
+                    else:
+                        bottom_txt_block = find_bottom_nearest_text_bbox(pymu_raw_blocks, imgbox)
+                        if bottom_txt_block:
+                            line_txt = "".join([s['text'] for s in bottom_txt_block['lines'][0]['spans']])
+                            page_md = __insert_before(page_md, img_content, line_txt)
+                        else:
+                            logger.error(f"Can't find the location of image {img['image_path']} in the markdown file #2")
+        content_lst.append(page_md)
+    """拼装成全部页面的文本"""
+    content_text = "\n\n".join(content_lst)
+    return content_text
+def __insert_after_para(text, type, element, content_list):
+    """
+    在content_list中找到text，将image_path作为一个新的node插入到text后面
+    """
+    for i, c in enumerate(content_list):
+        content_type = c.get("type")
+        if content_type in UNI_FORMAT_TEXT_TYPE and text in c.get("text", ''):
+            if type == "image":
+                content_node = {
+                    "type": "image",
+                    "img_path": element.get("image_path"),
+                    "img_alt": "",
+                    "img_title": "",
+                    "img_caption": "",
+                }
+            elif type == "table":
+                content_node = {
+                    "type": "table",
+                    "img_path": element.get("image_path"),
+                    "table_latex": element.get("text"),
+                    "table_title": "",
+                    "table_caption": "",
+                    "table_quality": element.get("quality"),
+                }
+            content_list.insert(i+1, content_node)
+            break
+    else:
+        logger.error(f"Can't find the location of image {element.get('image_path')} in the markdown file, search target is {text}")
+def __insert_before_para(text, type, element, content_list):
+    """
+    在content_list中找到text，将image_path作为一个新的node插入到text前面
+    """
+    for i, c in enumerate(content_list):
+        content_type = c.get("type")
+        if content_type in  UNI_FORMAT_TEXT_TYPE and text in c.get("text", ''):
+            if type == "image":
+                content_node = {
+                    "type": "image",
+                    "img_path": element.get("image_path"),
+                    "img_alt": "",
+                    "img_title": "",
+                    "img_caption": "",
+                }
+            elif type == "table":
+                content_node = {
+                    "type": "table",
+                    "img_path": element.get("image_path"),
+                    "table_latex": element.get("text"),
+                    "table_title": "",
+                    "table_caption": "",
+                    "table_quality": element.get("quality"),
+                }
+            content_list.insert(i, content_node)
+            break
+    else:
+        logger.error(f"Can't find the location of image {element.get('image_path')} in the markdown file, search target is {text}")
+def mk_universal_format(pdf_info_list: list, img_buket_path):
+    """
+    构造统一格式 https://aicarrier.feishu.cn/wiki/FqmMwcH69iIdCWkkyjvcDwNUnTY
+    """
+    content_lst = []
+    for page_info in pdf_info_list:
+        page_lst = [] # 一���page内的段落列表
+        para_blocks = page_info.get("para_blocks")
+        pymu_raw_blocks = page_info.get("preproc_blocks")
+        all_page_images = []
+        all_page_images.extend(page_info.get("images",[]))
+        all_page_images.extend(page_info.get("image_backup", []) )
+        # all_page_images.extend(page_info.get("tables",[]))
+        # all_page_images.extend(page_info.get("table_backup",[]) )
+        all_page_tables = []
+        all_page_tables.extend(page_info.get("tables", []))
+        if not para_blocks or not pymu_raw_blocks: # 只有图片的拼接的场景
+            for img in all_page_images:
+                content_node = {
+                    "type": "image",
+                    "img_path": join_path(img_buket_path, img['image_path']),
+                    "img_alt":"",
+                    "img_title":"",
+                    "img_caption":""
+                }
+                page_lst.append(content_node) # TODO 图片顺序
+            for table in all_page_tables:
+                content_node = {
+                    "type": "table",
+                    "img_path": join_path(img_buket_path, table['image_path']),
+                    "table_latex": table.get("text"),
+                    "table_title": "",
+                    "table_caption": "",
+                    "table_quality": table.get("quality"),
+                }
+                page_lst.append(content_node) # TODO 图片顺序
+        else:
+            for block in para_blocks:
+                item = block["paras"]
+                for _, p in item.items():
+                    font_type = p['para_font_type']# 对于文本来说，要么是普通文本，要么是个行间公式
+                    if font_type == TYPE_INTERLINE_EQUATION:
+                        content_node = {
+                            "type": "equation",
+                            "latex": p["para_text"]
+                        }
+                        page_lst.append(content_node)
+                    else:
+                        para_text = p["para_text"]
+                        is_title = p["is_para_title"]
+                        title_level = p['para_title_level']
+                        if is_title:
+                            content_node = {
+                                "type": f"h{title_level}",
+                                "text": para_text
+                            }
+                            page_lst.append(content_node)
+                        else:
+                            content_node = {
+                                "type": "text",
+                                "text": para_text
+                            }
+                            page_lst.append(content_node)
+        content_lst.extend(page_lst)
+        """插入图片"""
+        for img in all_page_images:
+            insert_img_or_table("image", img, pymu_raw_blocks, content_lst)
+        """插入表格"""
+        for table in all_page_tables:
+            insert_img_or_table("table", table, pymu_raw_blocks, content_lst)
+    # end for
+    return content_lst
+def insert_img_or_table(type, element, pymu_raw_blocks, content_lst):
+    element_bbox = element['bbox']
+    # 先看在哪个block内
+    for block in pymu_raw_blocks:
+        bbox = block['bbox']
+        if bbox[0] - 1 <= element_bbox[0] < bbox[2] + 1 and bbox[1] - 1 <= element_bbox[1] < bbox[
+            3] + 1:  # 确定在这个大的block内，然后进入逐行比较距离
+            for l in block['lines']:
+                line_box = l['bbox']
+                if line_box[0] - 1 <= element_bbox[0] < line_box[2] + 1 and line_box[1] - 1 <= element_bbox[1] < line_box[
+                    3] + 1:  # 在line内的，插入line前面
+                    line_txt = "".join([s['text'] for s in l['spans']])
+                    __insert_before_para(line_txt, type, element, content_lst)
+                    break
+                break
+            else:  # 在行与行之间
+                # 找到图片x0,y0与line的x0,y0最近的line
+                min_distance = 100000
+                min_line = None
+                for l in block['lines']:
+                    line_box = l['bbox']
+                    distance = math.sqrt((line_box[0] - element_bbox[0]) ** 2 + (line_box[1] - element_bbox[1]) ** 2)
+                    if distance < min_distance:
+                        min_distance = distance
+                        min_line = l
+                if min_line:
+                    line_txt = "".join([s['text'] for s in min_line['spans']])
+                    img_h = element_bbox[3] - element_bbox[1]
+                    if min_distance < img_h:  # 文字在图片前面
+                        __insert_after_para(line_txt, type, element, content_lst)
+                    else:
+                        __insert_before_para(line_txt, type, element, content_lst)
+                    break
+                else:
+                    logger.error(f"Can't find the location of image {element.get('image_path')} in the markdown file #1")
+    else:  # 应当在两个block之间
+        # 找到上方最近的block，如果上方没有就找大下方最近的block
+        top_txt_block = find_top_nearest_text_bbox(pymu_raw_blocks, element_bbox)
+        if top_txt_block:
+            line_txt = "".join([s['text'] for s in top_txt_block['lines'][-1]['spans']])
+            __insert_after_para(line_txt, type, element, content_lst)
+        else:
+            bottom_txt_block = find_bottom_nearest_text_bbox(pymu_raw_blocks, element_bbox)
+            if bottom_txt_block:
+                line_txt = "".join([s['text'] for s in bottom_txt_block['lines'][0]['spans']])
+                __insert_before_para(line_txt, type, element, content_lst)
+            else:  # TODO ，图片可能独占一列，这种情况上下是没有图片的
+                logger.error(f"Can't find the location of image {element.get('image_path')} in the markdown file #2")
+def mk_mm_markdown(content_list):
+    """
+    基于同一格式的内容列表，构造markdown，含图片
+    """
+    content_md = []
+    for c in content_list:
+        content_type = c.get("type")
+        if content_type == "text":
+            content_md.append(c.get("text"))
+        elif content_type == "equation":
+            content = c.get("latex")
+            if content.startswith("$$") and content.endswith("$$"):
+                content_md.append(content)
+            else:
+                content_md.append(f"\n$$\n{c.get('latex')}\n$$\n")
+        elif content_type in UNI_FORMAT_TEXT_TYPE:
+            content_md.append(f"{'#'*int(content_type[1])} {c.get('text')}")
+        elif content_type == "image":
+            content_md.append(f"![]({c.get('img_path')})")
+    return "\n\n".join(content_md)
+def mk_nlp_markdown(content_list):
+    """
+    基于同一格式的内容列表，构造markdown，不含图片
+    """
+    content_md = []
+    for c in content_list:
+        content_type = c.get("type")
+        if content_type == "text":
+            content_md.append(c.get("text"))
+        elif content_type == "equation":
+            content_md.append(f"$$\n{c.get('latex')}\n$$")
+        elif content_type == "table":
+            content_md.append(f"$$$\n{c.get('table_latex')}\n$$$")
+        elif content_type in UNI_FORMAT_TEXT_TYPE:
+            content_md.append(f"{'#'*int(content_type[1])} {c.get('text')}")
+    return "\n\n".join(content_md)

magic_pdf/dict2md/ocr_mkcontent.py ADDED Viewed

	@@ -0,0 +1,363 @@

+from loguru import logger
+from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
+from magic_pdf.libs.commons import join_path
+from magic_pdf.libs.language import detect_lang
+from magic_pdf.libs.markdown_utils import ocr_escape_special_markdown_char
+from magic_pdf.libs.ocr_content_type import ContentType, BlockType
+import wordninja
+import re
+def split_long_words(text):
+    segments = text.split(' ')
+    for i in range(len(segments)):
+        words = re.findall(r'\w+|[^\w]', segments[i], re.UNICODE)
+        for j in range(len(words)):
+            if len(words[j]) > 15:
+                words[j] = ' '.join(wordninja.split(words[j]))
+        segments[i] = ''.join(words)
+    return ' '.join(segments)
+def ocr_mk_mm_markdown_with_para(pdf_info_list: list, img_buket_path):
+    markdown = []
+    for page_info in pdf_info_list:
+        paras_of_layout = page_info.get("para_blocks")
+        page_markdown = ocr_mk_markdown_with_para_core_v2(paras_of_layout, "mm", img_buket_path)
+        markdown.extend(page_markdown)
+    return '\n\n'.join(markdown)
+def ocr_mk_nlp_markdown_with_para(pdf_info_dict: list):
+    markdown = []
+    for page_info in pdf_info_dict:
+        paras_of_layout = page_info.get("para_blocks")
+        page_markdown = ocr_mk_markdown_with_para_core_v2(paras_of_layout, "nlp")
+        markdown.extend(page_markdown)
+    return '\n\n'.join(markdown)
+def ocr_mk_mm_markdown_with_para_and_pagination(pdf_info_dict: list, img_buket_path):
+    markdown_with_para_and_pagination = []
+    page_no = 0
+    for page_info in pdf_info_dict:
+        paras_of_layout = page_info.get("para_blocks")
+        if not paras_of_layout:
+            continue
+        page_markdown = ocr_mk_markdown_with_para_core_v2(paras_of_layout, "mm", img_buket_path)
+        markdown_with_para_and_pagination.append({
+            'page_no': page_no,
+            'md_content': '\n\n'.join(page_markdown)
+        })
+        page_no += 1
+    return markdown_with_para_and_pagination
+def ocr_mk_markdown_with_para_core(paras_of_layout, mode, img_buket_path=""):
+    page_markdown = []
+    for paras in paras_of_layout:
+        for para in paras:
+            para_text = ''
+            for line in para:
+                for span in line['spans']:
+                    span_type = span.get('type')
+                    content = ''
+                    language = ''
+                    if span_type == ContentType.Text:
+                        content = span['content']
+                        language = detect_lang(content)
+                        if language == 'en':  # 只对英文长词进行分词处理，中文分词会丢失文本
+                            content = ocr_escape_special_markdown_char(split_long_words(content))
+                        else:
+                            content = ocr_escape_special_markdown_char(content)
+                    elif span_type == ContentType.InlineEquation:
+                        content = f"${span['content']}$"
+                    elif span_type == ContentType.InterlineEquation:
+                        content = f"\n$$\n{span['content']}\n$$\n"
+                    elif span_type in [ContentType.Image, ContentType.Table]:
+                        if mode == 'mm':
+                            content = f"\n![]({join_path(img_buket_path, span['image_path'])})\n"
+                        elif mode == 'nlp':
+                            pass
+                    if content != '':
+                        if language == 'en':  # 英文语境下 content间需要空格分隔
+                            para_text += content + ' '
+                        else:  # 中文语境下，content间不需要空格分隔
+                            para_text += content
+            if para_text.strip() == '':
+                continue
+            else:
+                page_markdown.append(para_text.strip() + '  ')
+    return page_markdown
+def ocr_mk_markdown_with_para_core_v2(paras_of_layout, mode, img_buket_path=""):
+    page_markdown = []
+    for para_block in paras_of_layout:
+        para_text = ''
+        para_type = para_block['type']
+        if para_type == BlockType.Text:
+            para_text = merge_para_with_text(para_block)
+        elif para_type == BlockType.Title:
+            para_text = f"# {merge_para_with_text(para_block)}"
+        elif para_type == BlockType.InterlineEquation:
+            para_text = merge_para_with_text(para_block)
+        elif para_type == BlockType.Image:
+            if mode == 'nlp':
+                continue
+            elif mode == 'mm':
+                for block in para_block['blocks']:  # 1st.拼image_body
+                    if block['type'] == BlockType.ImageBody:
+                        for line in block['lines']:
+                            for span in line['spans']:
+                                if span['type'] == ContentType.Image:
+                                    para_text += f"\n![]({join_path(img_buket_path, span['image_path'])})  \n"
+                for block in para_block['blocks']:  # 2nd.拼image_caption
+                    if block['type'] == BlockType.ImageCaption:
+                        para_text += merge_para_with_text(block)
+        elif para_type == BlockType.Table:
+            if mode == 'nlp':
+                continue
+            elif mode == 'mm':
+                for block in para_block['blocks']:  # 1st.拼table_caption
+                    if block['type'] == BlockType.TableCaption:
+                        para_text += merge_para_with_text(block)
+                for block in para_block['blocks']:  # 2nd.拼table_body
+                    if block['type'] == BlockType.TableBody:
+                        for line in block['lines']:
+                            for span in line['spans']:
+                                if span['type'] == ContentType.Table:
+                                    para_text += f"\n![]({join_path(img_buket_path, span['image_path'])})  \n"
+                for block in para_block['blocks']:  # 3rd.拼table_footnote
+                    if block['type'] == BlockType.TableFootnote:
+                        para_text += merge_para_with_text(block)
+        if para_text.strip() == '':
+            continue
+        else:
+            page_markdown.append(para_text.strip() + '  ')
+    return page_markdown
+def merge_para_with_text(para_block):
+    para_text = ''
+    for line in para_block['lines']:
+        line_text = ""
+        line_lang = ""
+        for span in line['spans']:
+            span_type = span['type']
+            if span_type == ContentType.Text:
+                line_text += span['content'].strip()
+        if line_text != "":
+            line_lang = detect_lang(line_text)
+        for span in line['spans']:
+            span_type = span['type']
+            content = ''
+            if span_type == ContentType.Text:
+                content = span['content']
+                language = detect_lang(content)
+                if language == 'en':  # 只对英文长词进行分词处理，中文分词会丢失文本
+                    content = ocr_escape_special_markdown_char(split_long_words(content))
+                else:
+                    content = ocr_escape_special_markdown_char(content)
+            elif span_type == ContentType.InlineEquation:
+                content = f"${span['content']}$"
+            elif span_type == ContentType.InterlineEquation:
+                content = f"\n$$\n{span['content']}\n$$\n"
+            if content != '':
+                if 'zh' in line_lang:  # 遇到一些一个字一个span的文档，这种单字语言判断不准，需要用整行文本判断
+                    para_text += content  # 中文语境下，content间不需要空格分隔
+                else:
+                    para_text += content + ' '  # 英文语境下 content间需要空格分隔
+    return para_text
+def para_to_standard_format(para, img_buket_path):
+    para_content = {}
+    if len(para) == 1:
+        para_content = line_to_standard_format(para[0], img_buket_path)
+    elif len(para) > 1:
+        para_text = ''
+        inline_equation_num = 0
+        for line in para:
+            for span in line['spans']:
+                language = ''
+                span_type = span.get('type')
+                content = ""
+                if span_type == ContentType.Text:
+                    content = span['content']
+                    language = detect_lang(content)
+                    if language == 'en':  # 只对英文长词进行分词处理，中文分词会丢失文本
+                        content = ocr_escape_special_markdown_char(split_long_words(content))
+                    else:
+                        content = ocr_escape_special_markdown_char(content)
+                elif span_type == ContentType.InlineEquation:
+                    content = f"${span['content']}$"
+                    inline_equation_num += 1
+                if language == 'en':  # 英文语境下 content间需要空格分隔
+                    para_text += content + ' '
+                else:  # 中文语境下，content间不需要空格分隔
+                    para_text += content
+        para_content = {
+            'type': 'text',
+            'text': para_text,
+            'inline_equation_num': inline_equation_num
+        }
+    return para_content
+def para_to_standard_format_v2(para_block, img_buket_path):
+    para_type = para_block['type']
+    if para_type == BlockType.Text:
+        para_content = {
+            'type': 'text',
+            'text': merge_para_with_text(para_block),
+        }
+    elif para_type == BlockType.Title:
+        para_content = {
+            'type': 'text',
+            'text': merge_para_with_text(para_block),
+            'text_level': 1
+        }
+    elif para_type == BlockType.InterlineEquation:
+        para_content = {
+            'type': 'equation',
+            'text': merge_para_with_text(para_block),
+            'text_format': "latex"
+        }
+    elif para_type == BlockType.Image:
+        para_content = {
+            'type': 'image',
+        }
+        for block in para_block['blocks']:
+            if block['type'] == BlockType.ImageBody:
+                para_content['img_path'] = join_path(img_buket_path, block["lines"][0]["spans"][0]['image_path'])
+            if block['type'] == BlockType.ImageCaption:
+                para_content['img_caption'] = merge_para_with_text(block)
+    elif para_type == BlockType.Table:
+        para_content = {
+            'type': 'table',
+        }
+        for block in para_block['blocks']:
+            if block['type'] == BlockType.TableBody:
+                para_content['img_path'] = join_path(img_buket_path, block["lines"][0]["spans"][0]['image_path'])
+            if block['type'] == BlockType.TableCaption:
+                para_content['table_caption'] = merge_para_with_text(block)
+            if block['type'] == BlockType.TableFootnote:
+                para_content['table_footnote'] = merge_para_with_text(block)
+    return para_content
+def make_standard_format_with_para(pdf_info_dict: list, img_buket_path: str):
+    content_list = []
+    for page_info in pdf_info_dict:
+        paras_of_layout = page_info.get("para_blocks")
+        if not paras_of_layout:
+            continue
+        for para_block in paras_of_layout:
+            para_content = para_to_standard_format_v2(para_block, img_buket_path)
+            content_list.append(para_content)
+    return content_list
+def line_to_standard_format(line, img_buket_path):
+    line_text = ""
+    inline_equation_num = 0
+    for span in line['spans']:
+        if not span.get('content'):
+            if not span.get('image_path'):
+                continue
+            else:
+                if span['type'] == ContentType.Image:
+                    content = {
+                        'type': 'image',
+                        'img_path': join_path(img_buket_path, span['image_path'])
+                    }
+                    return content
+                elif span['type'] == ContentType.Table:
+                    content = {
+                        'type': 'table',
+                        'img_path': join_path(img_buket_path, span['image_path'])
+                    }
+                    return content
+        else:
+            if span['type'] == ContentType.InterlineEquation:
+                interline_equation = span['content']
+                content = {
+                    'type': 'equation',
+                    'latex': f"$$\n{interline_equation}\n$$"
+                }
+                return content
+            elif span['type'] == ContentType.InlineEquation:
+                inline_equation = span['content']
+                line_text += f"${inline_equation}$"
+                inline_equation_num += 1
+            elif span['type'] == ContentType.Text:
+                text_content = ocr_escape_special_markdown_char(span['content'])  # 转义特殊符号
+                line_text += text_content
+    content = {
+        'type': 'text',
+        'text': line_text,
+        'inline_equation_num': inline_equation_num
+    }
+    return content
+def ocr_mk_mm_standard_format(pdf_info_dict: list):
+    """
+    content_list
+    type         string      image/text/table/equation(行间的单独拿出来，行内的和text合并)
+    latex        string      latex文本字段。
+    text         string      纯文本格式的文本数据。
+    md           string      markdown格式的文本数据。
+    img_path     string      s3://full/path/to/img.jpg
+    """
+    content_list = []
+    for page_info in pdf_info_dict:
+        blocks = page_info.get("preproc_blocks")
+        if not blocks:
+            continue
+        for block in blocks:
+            for line in block['lines']:
+                content = line_to_standard_format(line)
+                content_list.append(content)
+    return content_list
+def union_make(pdf_info_dict: list, make_mode: str, drop_mode: str, img_buket_path: str = ""):
+    output_content = []
+    for page_info in pdf_info_dict:
+        if page_info.get("need_drop", False):
+            drop_reason = page_info.get("drop_reason")
+            if drop_mode == DropMode.NONE:
+                pass
+            elif drop_mode == DropMode.WHOLE_PDF:
+                raise Exception(f"drop_mode is {DropMode.WHOLE_PDF} , drop_reason is {drop_reason}")
+            elif drop_mode == DropMode.SINGLE_PAGE:
+                logger.warning(f"drop_mode is {DropMode.SINGLE_PAGE} , drop_reason is {drop_reason}")
+                continue
+            else:
+                raise Exception(f"drop_mode can not be null")
+        paras_of_layout = page_info.get("para_blocks")
+        if not paras_of_layout:
+            continue
+        if make_mode == MakeMode.MM_MD:
+            page_markdown = ocr_mk_markdown_with_para_core_v2(paras_of_layout, "mm", img_buket_path)
+            output_content.extend(page_markdown)
+        elif make_mode == MakeMode.NLP_MD:
+            page_markdown = ocr_mk_markdown_with_para_core_v2(paras_of_layout, "nlp")
+            output_content.extend(page_markdown)
+        elif make_mode == MakeMode.STANDARD_FORMAT:
+            for para_block in paras_of_layout:
+                para_content = para_to_standard_format_v2(para_block, img_buket_path)
+                output_content.append(para_content)
+    if make_mode in [MakeMode.MM_MD, MakeMode.NLP_MD]:
+        return '\n\n'.join(output_content)
+    elif make_mode == MakeMode.STANDARD_FORMAT:
+        return output_content

magic_pdf/filter/__init__.py ADDED Viewed

File without changes

magic_pdf/filter/pdf_classify_by_type.py ADDED Viewed

	@@ -0,0 +1,393 @@

+"""
+根据利用meta_scan得到的结果，对pdf是否为文字版进行分类。
+定义标准：
+一、什么pdf会是文字pdf，只要满足以下任意一条
+  1. 随机抽取N页，如果有任何一页文字数目大于100
+  2. 只要存在一个页面，图片的数量为0
+二、什么是扫描版pdf，只要满足以下任意一条
+  1. ~~80%页面上的最大图大小一样并且面积超过页面面积0.6~~
+  2. 大部分页面上文字的长度都是相等的。
+"""
+import json
+import sys
+from collections import Counter
+import click
+import numpy as np
+from loguru import logger
+from magic_pdf.libs.commons import mymax, get_top_percent_list
+from magic_pdf.filter.pdf_meta_scan import scan_max_page, junk_limit_min
+TEXT_LEN_THRESHOLD = 100
+AVG_TEXT_LEN_THRESHOLD = 100
+TEXT_LEN_SAMPLE_RATIO = 0.1  # 抽取0.1的页面进行文字长度统计
+# 一个拼接图片的方案，将某些特殊扫描版本的拆图拼成一张整图
+def merge_images(image_list, page_width, page_height, max_offset=5, max_gap=2):
+    # 先通过set去除所有bbox重叠的图片数据
+    image_list_result = []
+    for page_images in image_list:
+        page_result = []
+        dedup = set()
+        for img in page_images:
+            x0, y0, x1, y1, img_bojid = img
+            if (x0, y0, x1, y1) in dedup:  # 这里面会出现一些重复的bbox，无需重复出现，需要去掉
+                continue
+            else:
+                dedup.add((x0, y0, x1, y1))
+                page_result.append([x0, y0, x1, y1, img_bojid])
+        image_list_result.append(page_result)
+    # 接下来，将同一页可拼接的图片进行合并
+    merged_images = []
+    for page_images in image_list_result:
+        if not page_images:
+            continue
+        # 先将同一页的图片从上到下，从左到右进行排序
+        page_images.sort(key=lambda img: (img[1], img[0]))
+        merged = [page_images[0]]
+        for img in page_images[1:]:
+            x0, y0, x1, y1, imgid = img
+            last_img = merged[-1]
+            last_x0, last_y0, last_x1, last_y1, last_imgid = last_img
+            # 单张图片宽或者高覆盖页面宽高的9成以上是拼图的一个前置条件
+            full_width = abs(x1 - x0) >= page_width * 0.9
+            full_height = abs(y1 - y0) >= page_height * 0.9
+            # 如果宽达标，检测是否能竖着拼
+            if full_width:
+                # 竖着拼需要满足两个前提，左右边界各偏移不能超过 max_offset，第一张图的下边界和第二张图的上边界偏移不能超过 max_gap
+                close1 = (last_x0 - max_offset) <= x0 <= (last_x0 + max_offset) and (last_x1 - max_offset) <= x1 <= (
+                            last_x1 + max_offset) and (last_y1 - max_gap) <= y0 <= (last_y1 + max_gap)
+            # 如果高达标，检测是否可以横着拼
+            if full_height:
+                # 横着拼需要满足两个前提，上下边界各偏移不能超过 max_offset，第一张图的右边界和第二张图的左边界偏移不能超过 max_gap
+                close2 = (last_y0 - max_offset) <= y0 <= (last_y0 + max_offset) and (last_y1 - max_offset) <= y1 <= (
+                            last_y1 + max_offset) and (last_x1 - max_gap) <= x0 <= (last_x1 + max_gap)
+            # Check if the image can be merged with the last image
+            if (full_width and close1) or (full_height and close2):
+                # Merge the image with the last image
+                merged[-1] = [min(x0, last_x0), min(y0, last_y0),
+                              max(x1, last_x1), max(y1, last_y1), imgid]
+            else:
+                # Add the image as a new image
+                merged.append(img)
+        merged_images.append(merged)
+    return merged_images
+def classify_by_area(total_page: int, page_width, page_height, img_sz_list, text_len_list: list):
+    """
+    80%页面上的最大图大小一样并且面积超过页面面积0.6则返回False，否则返回True
+    :param pdf_path:
+    :param total_page:
+    :param page_width:
+    :param page_height:
+    :param img_sz_list:
+    :return:
+    """
+    # # 只要有一页没有图片，那么就是文字pdf。但是同时还需要满足一个条件就是这个页面上同时不能有文字。发现过一些扫描版pdf，上面有一些空白页面，既没有图片也没有文字。
+    # if any([len(img_sz) == 0 for img_sz in img_sz_list]):  # 含有不含图片的页面
+    #     # 现在找到这些页面的index
+    #     empty_page_index = [i for i, img_sz in enumerate(img_sz_list) if len(img_sz) == 0]
+    #     # 然后检查这些页面上是否有文字
+    #     text_len_at_page_idx = [text_len for i, text_len in enumerate(text_len_list) if i in empty_page_index and text_len > 0]
+    #     if len(text_len_at_page_idx) > TEXT_LEN_THRESHOLD:  # 没有图片，但是有文字，说明可能是个文字版，如果没有文字则无法判断，留给下一步,现在��求这页文字量超过一定阈值
+    #         return True
+    # 通过objid去掉重复出现10次以上的图片，这些图片是隐藏的透明图层，其特点是id都一样
+    # 先对每个id出现的次数做个统计
+    objid_cnt = Counter([objid for page_img_sz in img_sz_list for _, _, _, _, objid in page_img_sz])
+    # 再去掉出现次数大于10的
+    if total_page >= scan_max_page:  # 新的meta_scan只扫描前 scan_max_page 页，页数大于 scan_max_page 当total_page为 scan_max_page
+        total_page = scan_max_page
+    repeat_threshold = 2  # 把bad_image的阈值设为2
+    # repeat_threshold = min(2, total_page)  # 当total_page为1时，repeat_threshold为1，会产生误判导致所有img变成bad_img
+    bad_image_objid = set([objid for objid, cnt in objid_cnt.items() if cnt >= repeat_threshold])
+    # bad_image_page_idx = [i for i, page_img_sz in enumerate(img_sz_list) if any([objid in bad_image_objid for _, _, _, _, objid in page_img_sz])]
+    # text_len_at_bad_image_page_idx = [text_len for i, text_len in enumerate(text_len_list) if i in bad_image_page_idx and text_len > 0]
+    # 特殊情况，一个文字版pdf，每页覆盖一个超大的透明图片,超大的定义是图片占整页面积的90%以上
+    # fake_image_ids = [objid for objid in bad_image_objid if
+    #                   any([abs((x1 - x0) * (y1 - y0) / page_width * page_height) > 0.9 for images in img_sz_list for
+    #                        x0, y0, x1, y1, _ in images])]  # 原来的代码，any里面恒为true了，原因？？？
+    # fake_image_ids = [objid for objid in bad_image_objid for images in img_sz_list for x0, y0, x1, y1, img_id in images
+    #                   if img_id == objid and abs((x1 - x0) * (y1 - y0)) / (page_width * page_height) > 0.9]
+    # if len(fake_image_ids) > 0 and any([l > TEXT_LEN_THRESHOLD for l in text_len_at_bad_image_page_idx]):  # 这些透明图片所在的页面上有文字大于阈值
+    #     return True
+    img_sz_list = [[img_sz for img_sz in page_img_sz if img_sz[-1] not in bad_image_objid] for page_img_sz in
+                   img_sz_list]  # 过滤掉重复出现的图片
+    # 有的扫描版会把一页图片拆成很多张，需要先把图拼起来再计算
+    img_sz_list = merge_images(img_sz_list, page_width, page_height)
+    # 计算每个页面上最大的图的面积，然后计算这个面积占页面面积的比例
+    max_image_area_per_page = [mymax([(x1 - x0) * (y1 - y0) for x0, y0, x1, y1, _ in page_img_sz]) for page_img_sz in
+                               img_sz_list]
+    page_area = page_width * page_height
+    max_image_area_per_page = [area / page_area for area in max_image_area_per_page]
+    max_image_area_per_page = [area for area in max_image_area_per_page if area > 0.5]
+    if len(max_image_area_per_page) >= 0.5 * total_page:  # 阈值从0.8改到0.5，适配3页里面有两页和两页里面有一页的情况
+        # 这里条件成立的前提是把反复出现的图片去掉了。这些图片是隐藏的透明图层，其特点是id都一样
+        return False
+    else:
+        return True
+def classify_by_text_len(text_len_list: list, total_page: int):
+    """
+    随机抽取10%的页面，如果少于5个页面，那么就取全部页面。
+    查看页面上的文字长度，如果有任何一个页面的文字长度大于TEXT_LEN_THRESHOLD，那么就是文字pdf
+    :param total_page:
+    :param text_len_list:
+    :return:
+    """
+    select_page_cnt = int(total_page * TEXT_LEN_SAMPLE_RATIO)  # 选取10%的页面
+    if select_page_cnt < 5:
+        select_page_cnt = total_page
+    # # 排除头尾各10页
+    # if total_page > 20:  # 如果总页数大于20
+    #     page_range = list(range(10, total_page - 10))  # 从第11页到倒数第11页
+    # else:
+    #     page_range = list(range(total_page))  # 否则选择所有页面
+    # page_num = np.random.choice(page_range, min(select_page_cnt, len(page_range)), replace=False)
+    # 排除前后10页对只有21，22页的pdf很尴尬，如果选出来的中间那一两页恰好没字容易误判，有了avg_words规则，这个规则可以忽略
+    page_num = np.random.choice(total_page, select_page_cnt, replace=False)
+    text_len_lst = [text_len_list[i] for i in page_num]
+    is_text_pdf = any([text_len > TEXT_LEN_THRESHOLD for text_len in text_len_lst])
+    return is_text_pdf
+def classify_by_avg_words(text_len_list: list):
+    """
+    补充规则，如果平均每页字数少于 AVG_TEXT_LEN_THRESHOLD，就不是文字pdf
+    主要是各种图集
+    :param text_len_list:
+    :return:
+    """
+    sum_words = sum(text_len_list)
+    count_of_numbers = len(text_len_list)
+    if count_of_numbers == 0:
+        is_text_pdf = False
+    else:
+        avg_words = round(sum_words / count_of_numbers)
+        if avg_words > AVG_TEXT_LEN_THRESHOLD:
+            is_text_pdf = True
+        else:
+            is_text_pdf = False
+    return is_text_pdf
+def classify_by_img_num(img_sz_list: list, img_num_list: list):
+    """
+    补充规则，有一种扫描版本的PDF，每一页都会放所有的扫描页进去，在 metascan 时会被去重，
+    这种pdf的 metasca 扫描结果的特点是 img_sz_list 内全是空元素，img_num_list中每一页的数量都很大且相同
+    :param img_sz_list:
+    :param img_num_list:
+    :return:
+    """
+    # 计算img_sz_list中非空元素的个数
+    count_img_sz_list_not_none = sum(1 for item in img_sz_list if item)
+    # 获取前80%的元素
+    top_eighty_percent = get_top_percent_list(img_num_list, 0.8)
+    # img_sz_list中非空元素的个数小于1，前80%的元素都相等，且最大值大于等于junk_limit_min
+    if count_img_sz_list_not_none <= 1 and len(set(top_eighty_percent)) == 1 and max(img_num_list) >= junk_limit_min:
+        #拿max和min的值,用来判断list内的值是否全都相等
+        # min_imgs = min(img_num_list)
+        # max_imgs = max(img_num_list)
+        #
+        # if count_img_sz_list_not_none == 0 and max_imgs == min_imgs and max_imgs >= junk_limit_min:
+        return False  # 如果满足这个条件，一定不是文字版pdf
+    else:
+        return True  # 不满足这三个条件，可能是文字版pdf，通过其他规则判断
+def classify_by_text_layout(text_layout_per_page: list):
+    """
+    判断文本布局是否以竖排为主。
+    Args:
+        text_layout_per_page (list): 文本布局列表，列表中的每个元素表示一页的文本布局，
+                                     值为'vertical'表示竖排，值为'horizontal'表示横排。
+    Returns:
+        bool: 若文本布局以竖排为主，则返回False；否则返回True。
+    """
+    # 统计text_layout_per_page中竖排的个数
+    count_vertical = sum(1 for item in text_layout_per_page if item == 'vertical')
+    # 统计text_layout_per_page中横排的个数
+    count_horizontal = sum(1 for item in text_layout_per_page if item == 'horizontal')
+    # 计算text_layout_per_page中竖排的占比
+    known_layout_cnt = count_vertical + count_horizontal
+    if known_layout_cnt != 0:
+        ratio = count_vertical / known_layout_cnt
+        if ratio >= 0.5:  # 阈值设为0.5，适配3页里面有2页和两页里有一页的情况
+            return False  # 文本布局以竖排为主，认为不是文字版pdf
+        else:
+            return True  # 文本布局以横排为主，认为是文字版pdf
+    else:
+        return False  # 文本布局未知，默认认为不是文字版pdf
+def classify_by_img_narrow_strips(page_width, page_height, img_sz_list):
+    """
+    判断一页是否由细长条组成，有两个条件：
+    1. 图片的宽或高达到页面宽或高的90%，且长边需要是窄边长度的数倍以上
+    2. 整个页面所有的图片有80%以上满足条件1
+    Args:
+        page_width (float): 页面宽度
+        page_height (float): 页面高度
+        img_sz_list (list): 图片尺寸列表，每个元素为一个元组，表示图片的矩形区域和尺寸，形如(x0, y0, x1, y1, size)，其中(x0, y0)为矩形区域的左上角坐标，(x1, y1)为矩形区域的右下角坐标，size为图片的尺寸
+    Returns:
+        bool: 如果满足条件的页面的比例小于0.5，返回True，否则返回False
+    """
+    def is_narrow_strip(img):
+        x0, y0, x1, y1, _ = img
+        width, height = x1 - x0, y1 - y0
+        return any([
+            # 图片宽度大于等于页面宽度的90%，且宽度大于等于高度4倍
+            width >= page_width * 0.9 and width >= height * 4,
+            # 图片高度大于等于页面高度的90%，且高度大于等于宽度4倍
+            height >= page_height * 0.9 and height >= width * 4,
+        ])
+    # 初始化满足条件的页面数量
+    narrow_strip_pages_count = 0
+    # 遍历所有页面
+    for page_img_list in img_sz_list:
+        # 忽略空页面
+        if not page_img_list:
+            continue
+        # 计算页面中的图片总数
+        total_images = len(page_img_list)
+        # 计算页面中细长条图片的数量
+        narrow_strip_images_count = 0
+        for img in page_img_list:
+            if is_narrow_strip(img):
+                narrow_strip_images_count += 1
+        # 如果细长条图片的数量少于5，跳过
+        if narrow_strip_images_count < 5:
+            continue
+        else:
+            # 如果细长条图片的比例大于或等于0.8，增加满足条件的页面数量
+            if narrow_strip_images_count / total_images >= 0.8:
+                narrow_strip_pages_count += 1
+    # 计算满足条件的页面的比例
+    narrow_strip_pages_ratio = narrow_strip_pages_count / len(img_sz_list)
+    return narrow_strip_pages_ratio < 0.5
+def classify(total_page: int, page_width, page_height, img_sz_list: list, text_len_list: list, img_num_list: list,
+             text_layout_list: list, invalid_chars: bool):
+    """
+    这里的图片和页面长度单位是pts
+    :param total_page:
+    :param text_len_list:
+    :param page_width:
+    :param page_height:
+    :param img_sz_list:
+    :param pdf_path:
+    :return:
+    """
+    results = {
+        'by_image_area': classify_by_area(total_page, page_width, page_height, img_sz_list, text_len_list),
+        'by_text_len': classify_by_text_len(text_len_list, total_page),
+        'by_avg_words': classify_by_avg_words(text_len_list),
+        'by_img_num': classify_by_img_num(img_sz_list, img_num_list),
+        'by_text_layout': classify_by_text_layout(text_layout_list),
+        'by_img_narrow_strips': classify_by_img_narrow_strips(page_width, page_height, img_sz_list),
+        'by_invalid_chars': invalid_chars,
+    }
+    if all(results.values()):
+        return True, results
+    elif not any(results.values()):
+        return False, results
+    else:
+        logger.warning(
+            f"pdf is not classified by area and text_len, by_image_area: {results['by_image_area']},"
+            f" by_text: {results['by_text_len']}, by_avg_words: {results['by_avg_words']}, by_img_num: {results['by_img_num']},"
+            f" by_text_layout: {results['by_text_layout']}, by_img_narrow_strips: {results['by_img_narrow_strips']},"
+            f" by_invalid_chars: {results['by_invalid_chars']}",
+            file=sys.stderr)  # 利用这种情况可以快速找出来哪些pdf比较特殊，针对性修正分类算法
+        return False, results
+@click.command()
+@click.option("--json-file", type=str, help="pdf信息")
+def main(json_file):
+    if json_file is None:
+        print("json_file is None", file=sys.stderr)
+        exit(0)
+    try:
+        with open(json_file, "r") as f:
+            for l in f:
+                if l.strip() == "":
+                    continue
+                o = json.loads(l)
+                total_page = o["total_page"]
+                page_width = o["page_width_pts"]
+                page_height = o["page_height_pts"]
+                img_sz_list = o["image_info_per_page"]
+                text_len_list = o['text_len_per_page']
+                text_layout_list = o['text_layout_per_page']
+                pdf_path = o['pdf_path']
+                is_encrypted = o['is_encrypted']
+                is_needs_password = o['is_needs_password']
+                if is_encrypted or total_page == 0 or is_needs_password:  # 加密的，需要密码的，没有页面的，都不处理
+                    continue
+                tag = classify(total_page, page_width, page_height, img_sz_list, text_len_list, text_layout_list)
+                o['is_text_pdf'] = tag
+                print(json.dumps(o, ensure_ascii=False))
+    except Exception as e:
+        print("ERROR: ", e, file=sys.stderr)
+if __name__ == "__main__":
+    main()
+    # false = False
+    # true = True
+    # null = None
+    # o = {"pdf_path":"s3://llm-raw-snew/llm-raw-the-eye/raw/World%20Tracker%20Library/worldtracker.org/media/library/Science/Computer%20Science/Shreiner%20-%20OpenGL%20Programming%20Guide%206e%20%5BThe%20Redbook%5D%20%28AW%2C%202008%29.pdf","is_needs_password":false,"is_encrypted":false,"total_page":978,"page_width_pts":368,"page_height_pts":513,"image_info_per_page":[[[0,0,368,513,10037]],[[0,0,368,513,4]],[[0,0,368,513,7]],[[0,0,368,513,10]],[[0,0,368,513,13]],[[0,0,368,513,16]],[[0,0,368,513,19]],[[0,0,368,513,22]],[[0,0,368,513,25]],[[0,0,368,513,28]],[[0,0,368,513,31]],[[0,0,368,513,34]],[[0,0,368,513,37]],[[0,0,368,513,40]],[[0,0,368,513,43]],[[0,0,368,513,46]],[[0,0,368,513,49]],[[0,0,368,513,52]],[[0,0,368,513,55]],[[0,0,368,513,58]],[[0,0,368,513,61]],[[0,0,368,513,64]],[[0,0,368,513,67]],[[0,0,368,513,70]],[[0,0,368,513,73]],[[0,0,368,516,76]],[[0,0,368,516,79]],[[0,0,368,513,82]],[[0,0,368,513,85]],[[0,0,368,513,88]],[[0,0,368,513,91]],[[0,0,368,513,94]],[[0,0,368,513,97]],[[0,0,368,513,100]],[[0,0,368,513,103]],[[0,0,368,513,106]],[[0,0,368,513,109]],[[0,0,368,513,112]],[[0,0,368,513,115]],[[0,0,368,513,118]],[[0,0,368,513,121]],[[0,0,368,513,124]],[[0,0,368,513,127]],[[0,0,368,513,130]],[[0,0,368,513,133]],[[0,0,368,513,136]],[[0,0,368,513,139]],[[0,0,368,513,142]],[[0,0,368,513,145]],[[0,0,368,513,148]],[[0,0,368,513,151]],[[0,0,368,513,154]],[[0,0,368,513,157]],[[0,0,368,513,160]],[[0,0,368,513,163]],[[0,0,368,513,166]],[[0,0,368,513,169]],[[0,0,368,513,172]],[[0,0,368,513,175]],[[0,0,368,513,178]],[[0,0,368,513,181]],[[0,0,368,513,184]],[[0,0,368,513,187]],[[0,0,368,513,190]],[[0,0,368,513,193]],[[0,0,368,513,196]],[[0,0,368,513,199]],[[0,0,368,513,202]],[[0,0,368,513,205]],[[0,0,368,513,208]],[[0,0,368,513,211]],[[0,0,368,513,214]],[[0,0,368,513,217]],[[0,0,368,513,220]],[[0,0,368,513,223]],[[0,0,368,513,226]],[[0,0,368,513,229]],[[0,0,368,513,232]],[[0,0,368,513,235]],[[0,0,368,513,238]],[[0,0,368,513,241]],[[0,0,368,513,244]],[[0,0,368,513,247]],[[0,0,368,513,250]],[[0,0,368,513,253]],[[0,0,368,513,256]],[[0,0,368,513,259]],[[0,0,368,513,262]],[[0,0,368,513,265]],[[0,0,368,513,268]],[[0,0,368,513,271]],[[0,0,368,513,274]],[[0,0,368,513,277]],[[0,0,368,513,280]],[[0,0,368,513,283]],[[0,0,368,513,286]],[[0,0,368,513,289]],[[0,0,368,513,292]],[[0,0,368,513,295]],[[0,0,368,513,298]],[[0,0,368,513,301]],[[0,0,368,513,304]],[[0,0,368,513,307]],[[0,0,368,513,310]],[[0,0,368,513,313]],[[0,0,368,513,316]],[[0,0,368,513,319]],[[0,0,368,513,322]],[[0,0,368,513,325]],[[0,0,368,513,328]],[[0,0,368,513,331]],[[0,0,368,513,334]],[[0,0,368,513,337]],[[0,0,368,513,340]],[[0,0,368,513,343]],[[0,0,368,513,346]],[[0,0,368,513,349]],[[0,0,368,513,352]],[[0,0,368,513,355]],[[0,0,368,513,358]],[[0,0,368,513,361]],[[0,0,368,513,364]],[[0,0,368,513,367]],[[0,0,368,513,370]],[[0,0,368,513,373]],[[0,0,368,513,376]],[[0,0,368,513,379]],[[0,0,368,513,382]],[[0,0,368,513,385]],[[0,0,368,513,388]],[[0,0,368,513,391]],[[0,0,368,513,394]],[[0,0,368,513,397]],[[0,0,368,513,400]],[[0,0,368,513,403]],[[0,0,368,513,406]],[[0,0,368,513,409]],[[0,0,368,513,412]],[[0,0,368,513,415]],[[0,0,368,513,418]],[[0,0,368,513,421]],[[0,0,368,513,424]],[[0,0,368,513,427]],[[0,0,368,513,430]],[[0,0,368,513,433]],[[0,0,368,513,436]],[[0,0,368,513,439]],[[0,0,368,513,442]],[[0,0,368,513,445]],[[0,0,368,513,448]],[[0,0,368,513,451]],[[0,0,368,513,454]],[[0,0,368,513,457]],[[0,0,368,513,460]],[[0,0,368,513,463]],[[0,0,368,513,466]],[[0,0,368,513,469]],[[0,0,368,513,472]],[[0,0,368,513,475]],[[0,0,368,513,478]],[[0,0,368,513,481]],[[0,0,368,513,484]],[[0,0,368,513,487]],[[0,0,368,513,490]],[[0,0,368,513,493]],[[0,0,368,513,496]],[[0,0,368,513,499]],[[0,0,368,513,502]],[[0,0,368,513,505]],[[0,0,368,513,508]],[[0,0,368,513,511]],[[0,0,368,513,514]],[[0,0,368,513,517]],[[0,0,368,513,520]],[[0,0,368,513,523]],[[0,0,368,513,526]],[[0,0,368,513,529]],[[0,0,368,513,532]],[[0,0,368,513,535]],[[0,0,368,513,538]],[[0,0,368,513,541]],[[0,0,368,513,544]],[[0,0,368,513,547]],[[0,0,368,513,550]],[[0,0,368,513,553]],[[0,0,368,513,556]],[[0,0,368,513,559]],[[0,0,368,513,562]],[[0,0,368,513,565]],[[0,0,368,513,568]],[[0,0,368,513,571]],[[0,0,368,513,574]],[[0,0,368,513,577]],[[0,0,368,513,580]],[[0,0,368,513,583]],[[0,0,368,513,586]],[[0,0,368,513,589]],[[0,0,368,513,592]],[[0,0,368,513,595]],[[0,0,368,513,598]],[[0,0,368,513,601]],[[0,0,368,513,604]],[[0,0,368,513,607]],[[0,0,368,513,610]],[[0,0,368,513,613]],[[0,0,368,513,616]],[[0,0,368,513,619]],[[0,0,368,513,622]],[[0,0,368,513,625]],[[0,0,368,513,628]],[[0,0,368,513,631]],[[0,0,368,513,634]],[[0,0,368,513,637]],[[0,0,368,513,640]],[[0,0,368,513,643]],[[0,0,368,513,646]],[[0,0,368,513,649]],[[0,0,368,513,652]],[[0,0,368,513,655]],[[0,0,368,513,658]],[[0,0,368,513,661]],[[0,0,368,513,664]],[[0,0,368,513,667]],[[0,0,368,513,670]],[[0,0,368,513,673]],[[0,0,368,513,676]],[[0,0,368,513,679]],[[0,0,368,513,682]],[[0,0,368,513,685]],[[0,0,368,513,688]],[[0,0,368,513,691]],[[0,0,368,513,694]],[[0,0,368,513,697]],[[0,0,368,513,700]],[[0,0,368,513,703]],[[0,0,368,513,706]],[[0,0,368,513,709]],[[0,0,368,513,712]],[[0,0,368,513,715]],[[0,0,368,513,718]],[[0,0,368,513,721]],[[0,0,368,513,724]],[[0,0,368,513,727]],[[0,0,368,513,730]],[[0,0,368,513,733]],[[0,0,368,513,736]],[[0,0,368,513,739]],[[0,0,368,513,742]],[[0,0,368,513,745]],[[0,0,368,513,748]],[[0,0,368,513,751]],[[0,0,368,513,754]],[[0,0,368,513,757]],[[0,0,368,513,760]],[[0,0,368,513,763]],[[0,0,368,513,766]],[[0,0,368,513,769]],[[0,0,368,513,772]],[[0,0,368,513,775]],[[0,0,368,513,778]],[[0,0,368,513,781]],[[0,0,368,513,784]],[[0,0,368,513,787]],[[0,0,368,513,790]],[[0,0,368,513,793]],[[0,0,368,513,796]],[[0,0,368,513,799]],[[0,0,368,513,802]],[[0,0,368,513,805]],[[0,0,368,513,808]],[[0,0,368,513,811]],[[0,0,368,513,814]],[[0,0,368,513,817]],[[0,0,368,513,820]],[[0,0,368,513,823]],[[0,0,368,513,826]],[[0,0,368,513,829]],[[0,0,368,513,832]],[[0,0,368,513,835]],[[0,0,368,513,838]],[[0,0,368,513,841]],[[0,0,368,513,844]],[[0,0,368,513,847]],[[0,0,368,513,850]],[[0,0,368,513,853]],[[0,0,368,513,856]],[[0,0,368,513,859]],[[0,0,368,513,862]],[[0,0,368,513,865]],[[0,0,368,513,868]],[[0,0,368,513,871]],[[0,0,368,513,874]],[[0,0,368,513,877]],[[0,0,368,513,880]],[[0,0,368,513,883]],[[0,0,368,513,886]],[[0,0,368,513,889]],[[0,0,368,513,892]],[[0,0,368,513,895]],[[0,0,368,513,898]],[[0,0,368,513,901]],[[0,0,368,513,904]],[[0,0,368,513,907]],[[0,0,368,513,910]],[[0,0,368,513,913]],[[0,0,368,513,916]],[[0,0,368,513,919]],[[0,0,368,513,922]],[[0,0,368,513,925]],[[0,0,368,513,928]],[[0,0,368,513,931]],[[0,0,368,513,934]],[[0,0,368,513,937]],[[0,0,368,513,940]],[[0,0,368,513,943]],[[0,0,368,513,946]],[[0,0,368,513,949]],[[0,0,368,513,952]],[[0,0,368,513,955]],[[0,0,368,513,958]],[[0,0,368,513,961]],[[0,0,368,513,964]],[[0,0,368,513,967]],[[0,0,368,513,970]],[[0,0,368,513,973]],[[0,0,368,513,976]],[[0,0,368,513,979]],[[0,0,368,513,982]],[[0,0,368,513,985]],[[0,0,368,513,988]],[[0,0,368,513,991]],[[0,0,368,513,994]],[[0,0,368,513,997]],[[0,0,368,513,1000]],[[0,0,368,513,1003]],[[0,0,368,513,1006]],[[0,0,368,513,1009]],[[0,0,368,513,1012]],[[0,0,368,513,1015]],[[0,0,368,513,1018]],[[0,0,368,513,2797]],[[0,0,368,513,2798]],[[0,0,368,513,2799]],[[0,0,368,513,2800]],[[0,0,368,513,2801]],[[0,0,368,513,2802]],[[0,0,368,513,2803]],[[0,0,368,513,2804]],[[0,0,368,513,2805]],[[0,0,368,513,2806]],[[0,0,368,513,2807]],[[0,0,368,513,2808]],[[0,0,368,513,2809]],[[0,0,368,513,2810]],[[0,0,368,513,2811]],[[0,0,368,513,2812]],[[0,0,368,513,2813]],[[0,0,368,513,2814]],[[0,0,368,513,2815]],[[0,0,368,513,2816]],[[0,0,368,513,2817]],[[0,0,368,513,2818]],[[0,0,368,513,2819]],[[0,0,368,513,2820]],[[0,0,368,513,2821]],[[0,0,368,513,2822]],[[0,0,368,513,2823]],[[0,0,368,513,2824]],[[0,0,368,513,2825]],[[0,0,368,513,2826]],[[0,0,368,513,2827]],[[0,0,368,513,2828]],[[0,0,368,513,2829]],[[0,0,368,513,2830]],[[0,0,368,513,2831]],[[0,0,368,513,2832]],[[0,0,368,513,2833]],[[0,0,368,513,2834]],[[0,0,368,513,2835]],[[0,0,368,513,2836]],[[0,0,368,513,2837]],[[0,0,368,513,2838]],[[0,0,368,513,2839]],[[0,0,368,513,2840]],[[0,0,368,513,2841]],[[0,0,368,513,2842]],[[0,0,368,513,2843]],[[0,0,368,513,2844]],[[0,0,368,513,2845]],[[0,0,368,513,2846]],[[0,0,368,513,2847]],[[0,0,368,513,2848]],[[0,0,368,513,2849]],[[0,0,368,513,2850]],[[0,0,368,513,2851]],[[0,0,368,513,2852]],[[0,0,368,513,2853]],[[0,0,368,513,2854]],[[0,0,368,513,2855]],[[0,0,368,513,2856]],[[0,0,368,513,2857]],[[0,0,368,513,2858]],[[0,0,368,513,2859]],[[0,0,368,513,2860]],[[0,0,368,513,2861]],[[0,0,368,513,2862]],[[0,0,368,513,2863]],[[0,0,368,513,2864]],[[0,0,368,513,2797]],[[0,0,368,513,2798]],[[0,0,368,513,2799]],[[0,0,368,513,2800]],[[0,0,368,513,2801]],[[0,0,368,513,2802]],[[0,0,368,513,2803]],[[0,0,368,513,2804]],[[0,0,368,513,2805]],[[0,0,368,513,2806]],[[0,0,368,513,2807]],[[0,0,368,513,2808]],[[0,0,368,513,2809]],[[0,0,368,513,2810]],[[0,0,368,513,2811]],[[0,0,368,513,2812]],[[0,0,368,513,2813]],[[0,0,368,513,2814]],[[0,0,368,513,2815]],[[0,0,368,513,2816]],[[0,0,368,513,2817]],[[0,0,368,513,2818]],[[0,0,368,513,2819]],[[0,0,368,513,2820]],[[0,0,368,513,2821]],[[0,0,368,513,2822]],[[0,0,368,513,2823]],[[0,0,368,513,2824]],[[0,0,368,513,2825]],[[0,0,368,513,2826]],[[0,0,368,513,2827]],[[0,0,368,513,2828]],[[0,0,368,513,2829]],[[0,0,368,513,2830]],[[0,0,368,513,2831]],[[0,0,368,513,2832]],[[0,0,368,513,2833]],[[0,0,368,513,2834]],[[0,0,368,513,2835]],[[0,0,368,513,2836]],[[0,0,368,513,2837]],[[0,0,368,513,2838]],[[0,0,368,513,2839]],[[0,0,368,513,2840]],[[0,0,368,513,2841]],[[0,0,368,513,2842]],[[0,0,368,513,2843]],[[0,0,368,513,2844]],[[0,0,368,513,2845]],[[0,0,368,513,2846]],[[0,0,368,513,2847]],[[0,0,368,513,2848]],[[0,0,368,513,2849]],[[0,0,368,513,2850]],[[0,0,368,513,2851]],[[0,0,368,513,2852]],[[0,0,368,513,2853]],[[0,0,368,513,2854]],[[0,0,368,513,2855]],[[0,0,368,513,2856]],[[0,0,368,513,2857]],[[0,0,368,513,2858]],[[0,0,368,513,2859]],[[0,0,368,513,2860]],[[0,0,368,513,2861]],[[0,0,368,513,2862]],[[0,0,368,513,2863]],[[0,0,368,513,2864]],[[0,0,368,513,1293]],[[0,0,368,513,1296]],[[0,0,368,513,1299]],[[0,0,368,513,1302]],[[0,0,368,513,1305]],[[0,0,368,513,1308]],[[0,0,368,513,1311]],[[0,0,368,513,1314]],[[0,0,368,513,1317]],[[0,0,368,513,1320]],[[0,0,368,513,1323]],[[0,0,368,513,1326]],[[0,0,368,513,1329]],[[0,0,368,513,1332]],[[0,0,368,513,1335]],[[0,0,368,513,1338]],[[0,0,368,513,1341]],[[0,0,368,513,1344]],[[0,0,368,513,1347]],[[0,0,368,513,1350]],[[0,0,368,513,1353]],[[0,0,368,513,1356]],[[0,0,368,513,1359]],[[0,0,368,513,1362]],[[0,0,368,513,1365]],[[0,0,368,513,1368]],[[0,0,368,513,1371]],[[0,0,368,513,1374]],[[0,0,368,513,1377]],[[0,0,368,513,1380]],[[0,0,368,513,1383]],[[0,0,368,513,1386]],[[0,0,368,513,1389]],[[0,0,368,513,1392]],[[0,0,368,513,1395]],[[0,0,368,513,1398]],[[0,0,368,513,1401]],[[0,0,368,513,1404]],[[0,0,368,513,1407]],[[0,0,368,513,1410]],[[0,0,368,513,1413]],[[0,0,368,513,1416]],[[0,0,368,513,1419]],[[0,0,368,513,1422]],[[0,0,368,513,1425]],[[0,0,368,513,1428]],[[0,0,368,513,1431]],[[0,0,368,513,1434]],[[0,0,368,513,1437]],[[0,0,368,513,1440]],[[0,0,368,513,1443]],[[0,0,368,513,1446]],[[0,0,368,513,1449]],[[0,0,368,513,1452]],[[0,0,368,513,1455]],[[0,0,368,513,1458]],[[0,0,368,513,1461]],[[0,0,368,513,1464]],[[0,0,368,513,1467]],[[0,0,368,513,1470]],[[0,0,368,513,1473]],[[0,0,368,513,1476]],[[0,0,368,513,1479]],[[0,0,368,513,1482]],[[0,0,368,513,1485]],[[0,0,368,513,1488]],[[0,0,368,513,1491]],[[0,0,368,513,1494]],[[0,0,368,513,1497]],[[0,0,368,513,1500]],[[0,0,368,513,1503]],[[0,0,368,513,1506]],[[0,0,368,513,1509]],[[0,0,368,513,1512]],[[0,0,368,513,1515]],[[0,0,368,513,1518]],[[0,0,368,513,1521]],[[0,0,368,513,1524]],[[0,0,368,513,1527]],[[0,0,368,513,1530]],[[0,0,368,513,1533]],[[0,0,368,513,1536]],[[0,0,368,513,1539]],[[0,0,368,513,1542]],[[0,0,368,513,1545]],[[0,0,368,513,1548]],[[0,0,368,513,1551]],[[0,0,368,513,1554]],[[0,0,368,513,1557]],[[0,0,368,513,1560]],[[0,0,368,513,1563]],[[0,0,368,513,1566]],[[0,0,368,513,1569]],[[0,0,368,513,1572]],[[0,0,368,513,1575]],[[0,0,368,513,1578]],[[0,0,368,513,1581]],[[0,0,368,513,1584]],[[0,0,368,513,1587]],[[0,0,368,513,1590]],[[0,0,368,513,1593]],[[0,0,368,513,1596]],[[0,0,368,513,1599]],[[0,0,368,513,1602]],[[0,0,368,513,1605]],[[0,0,368,513,1608]],[[0,0,368,513,1611]],[[0,0,368,513,1614]],[[0,0,368,513,1617]],[[0,0,368,513,1620]],[[0,0,368,513,1623]],[[0,0,368,513,1626]],[[0,0,368,513,1629]],[[0,0,368,513,1632]],[[0,0,368,513,1635]],[[0,0,368,513,1638]],[[0,0,368,513,1641]],[[0,0,368,513,1644]],[[0,0,368,513,1647]],[[0,0,368,513,1650]],[[0,0,368,513,1653]],[[0,0,368,513,1656]],[[0,0,368,513,1659]],[[0,0,368,513,1662]],[[0,0,368,513,1665]],[[0,0,368,513,1668]],[[0,0,368,513,1671]],[[0,0,368,513,1674]],[[0,0,368,513,1677]],[[0,0,368,513,1680]],[[0,0,368,513,1683]],[[0,0,368,513,1686]],[[0,0,368,513,1689]],[[0,0,368,513,1692]],[[0,0,368,513,1695]],[[0,0,368,513,1698]],[[0,0,368,513,1701]],[[0,0,368,513,1704]],[[0,0,368,513,1707]],[[0,0,368,513,1710]],[[0,0,368,513,1713]],[[0,0,368,513,1716]],[[0,0,368,513,1719]],[[0,0,368,513,1722]],[[0,0,368,513,1725]],[[0,0,368,513,1728]],[[0,0,368,513,1731]],[[0,0,368,513,1734]],[[0,0,368,513,1737]],[[0,0,368,513,1740]],[[0,0,368,513,1743]],[[0,0,368,513,1746]],[[0,0,368,513,1749]],[[0,0,368,513,1752]],[[0,0,368,513,1755]],[[0,0,368,513,1758]],[[0,0,368,513,1761]],[[0,0,368,513,1764]],[[0,0,368,513,1767]],[[0,0,368,513,1770]],[[0,0,368,513,1773]],[[0,0,368,513,1776]],[[0,0,368,513,1779]],[[0,0,368,513,1782]],[[0,0,368,513,1785]],[[0,0,368,513,1788]],[[0,0,368,513,1791]],[[0,0,368,513,1794]],[[0,0,368,513,1797]],[[0,0,368,513,1800]],[[0,0,368,513,1803]],[[0,0,368,513,1806]],[[0,0,368,513,1809]],[[0,0,368,513,1812]],[[0,0,368,513,1815]],[[0,0,368,513,1818]],[[0,0,368,513,1821]],[[0,0,368,513,1824]],[[0,0,368,513,1827]],[[0,0,368,513,1830]],[[0,0,368,513,1833]],[[0,0,368,513,1836]],[[0,0,368,513,1839]],[[0,0,368,513,1842]],[[0,0,368,513,1845]],[[0,0,368,513,1848]],[[0,0,368,513,1851]],[[0,0,368,513,1854]],[[0,0,368,513,1857]],[[0,0,368,513,1860]],[[0,0,368,513,1863]],[[0,0,368,513,1866]],[[0,0,368,513,1869]],[[0,0,368,513,1872]],[[0,0,368,513,1875]],[[0,0,368,513,1878]],[[0,0,368,513,1881]],[[0,0,368,513,1884]],[[0,0,368,513,1887]],[[0,0,368,513,1890]],[[0,0,368,513,1893]],[[0,0,368,513,1896]],[[0,0,368,513,1899]],[[0,0,368,513,1902]],[[0,0,368,513,1905]],[[0,0,368,513,1908]],[[0,0,368,513,1911]],[[0,0,368,513,1914]],[[0,0,368,513,1917]],[[0,0,368,513,1920]],[[0,0,368,513,1923]],[[0,0,368,513,1926]],[[0,0,368,513,1929]],[[0,0,368,513,1932]],[[0,0,368,513,1935]],[[0,0,368,513,1938]],[[0,0,368,513,1941]],[[0,0,368,513,1944]],[[0,0,368,513,1947]],[[0,0,368,513,1950]],[[0,0,368,513,1953]],[[0,0,368,513,1956]],[[0,0,368,513,1959]],[[0,0,368,513,1962]],[[0,0,368,513,1965]],[[0,0,368,513,1968]],[[0,0,368,513,1971]],[[0,0,368,513,1974]],[[0,0,368,513,1977]],[[0,0,368,513,1980]],[[0,0,368,513,1983]],[[0,0,368,513,1986]],[[0,0,368,513,1989]],[[0,0,368,513,1992]],[[0,0,368,513,1995]],[[0,0,368,513,1998]],[[0,0,368,513,2001]],[[0,0,368,513,2004]],[[0,0,368,513,2007]],[[0,0,368,513,2010]],[[0,0,368,513,2013]],[[0,0,368,513,2016]],[[0,0,368,513,2019]],[[0,0,368,513,2022]],[[0,0,368,513,2025]],[[0,0,368,513,2028]],[[0,0,368,513,2031]],[[0,0,368,513,2034]],[[0,0,368,513,2037]],[[0,0,368,513,2040]],[[0,0,368,513,2043]],[[0,0,368,513,2046]],[[0,0,368,513,2049]],[[0,0,368,513,2052]],[[0,0,368,513,2055]],[[0,0,368,513,2058]],[[0,0,368,513,2061]],[[0,0,368,513,2064]],[[0,0,368,513,2067]],[[0,0,368,513,2070]],[[0,0,368,513,2073]],[[0,0,368,513,2076]],[[0,0,368,513,2079]],[[0,0,368,513,2082]],[[0,0,368,513,2085]],[[0,0,368,513,2088]],[[0,0,368,513,2091]],[[0,0,368,513,2094]],[[0,0,368,513,2097]],[[0,0,368,513,2100]],[[0,0,368,513,2103]],[[0,0,368,513,2106]],[[0,0,368,513,2109]],[[0,0,368,513,2112]],[[0,0,368,513,2115]],[[0,0,368,513,2118]],[[0,0,368,513,2121]],[[0,0,368,513,2124]],[[0,0,368,513,2127]],[[0,0,368,513,2130]],[[0,0,368,513,2133]],[[0,0,368,513,2136]],[[0,0,368,513,2139]],[[0,0,368,513,2142]],[[0,0,368,513,2145]],[[0,0,368,513,2148]],[[0,0,368,513,2151]],[[0,0,368,513,2154]],[[0,0,368,513,2157]],[[0,0,368,513,2160]],[[0,0,368,513,2163]],[[0,0,368,513,2166]],[[0,0,368,513,2169]],[[0,0,368,513,2172]],[[0,0,368,513,2175]],[[0,0,368,513,2178]],[[0,0,368,513,2181]],[[0,0,368,513,2184]],[[0,0,368,513,2187]],[[0,0,368,513,2190]],[[0,0,368,513,2193]],[[0,0,368,513,2196]],[[0,0,368,513,2199]],[[0,0,368,513,2202]],[[0,0,368,513,2205]],[[0,0,368,513,2208]],[[0,0,368,513,2211]],[[0,0,368,513,2214]],[[0,0,368,513,2217]],[[0,0,368,513,2220]],[[0,0,368,513,2223]],[[0,0,368,513,2226]],[[0,0,368,513,2229]],[[0,0,368,513,2232]],[[0,0,368,513,2235]],[[0,0,368,513,2238]],[[0,0,368,513,2241]],[[0,0,368,513,2244]],[[0,0,368,513,2247]],[[0,0,368,513,2250]],[[0,0,368,513,2253]],[[0,0,368,513,2256]],[[0,0,368,513,2259]],[[0,0,368,513,2262]],[[0,0,368,513,2265]],[[0,0,368,513,2268]],[[0,0,368,513,2271]],[[0,0,368,513,2274]],[[0,0,368,513,2277]],[[0,0,368,513,2280]],[[0,0,368,513,2283]],[[0,0,368,513,2286]],[[0,0,368,513,2289]],[[0,0,368,513,2292]],[[0,0,368,513,2295]],[[0,0,368,513,2298]],[[0,0,368,513,2301]],[[0,0,368,513,2304]],[[0,0,368,513,2307]],[[0,0,368,513,2310]],[[0,0,368,513,2313]],[[0,0,368,513,2316]],[[0,0,368,513,2319]],[[0,0,368,513,2322]],[[0,0,368,513,2325]],[[0,0,368,513,2328]],[[0,0,368,513,2331]],[[0,0,368,513,2334]],[[0,0,368,513,2337]],[[0,0,368,513,2340]],[[0,0,368,513,2343]],[[0,0,368,513,2346]],[[0,0,368,513,2349]],[[0,0,368,513,2352]],[[0,0,368,513,2355]],[[0,0,368,513,2358]],[[0,0,368,513,2361]],[[0,0,368,513,2364]],[[0,0,368,513,2367]],[[0,0,368,513,2370]],[[0,0,368,513,2373]],[[0,0,368,513,2376]],[[0,0,368,513,2379]],[[0,0,368,513,2382]],[[0,0,368,513,2385]],[[0,0,368,513,2388]],[[0,0,368,513,2391]],[[0,0,368,513,2394]],[[0,0,368,513,2397]],[[0,0,368,513,2400]],[[0,0,368,513,2403]],[[0,0,368,513,2406]],[[0,0,368,513,2409]],[[0,0,368,513,2412]],[[0,0,368,513,2415]],[[0,0,368,513,2418]],[[0,0,368,513,2421]],[[0,0,368,513,2424]],[[0,0,368,513,2427]],[[0,0,368,513,2430]],[[0,0,368,513,2433]],[[0,0,368,513,2436]],[[0,0,368,513,2439]],[[0,0,368,513,2442]],[[0,0,368,513,2445]],[[0,0,368,513,2448]],[[0,0,368,513,2451]],[[0,0,368,513,2454]],[[0,0,368,513,2457]],[[0,0,368,513,2460]],[[0,0,368,513,2463]],[[0,0,368,513,2466]],[[0,0,368,513,2469]],[[0,0,368,513,2472]],[[0,0,368,513,2475]],[[0,0,368,513,2478]],[[0,0,368,513,2481]],[[0,0,368,513,2484]],[[0,0,368,513,2487]],[[0,0,368,513,2490]],[[0,0,368,513,2493]],[[0,0,368,513,2496]],[[0,0,368,513,2499]],[[0,0,368,513,2502]],[[0,0,368,513,2505]],[[0,0,368,513,2508]],[[0,0,368,513,2511]],[[0,0,368,513,2514]],[[0,0,368,513,2517]],[[0,0,368,513,2520]],[[0,0,368,513,2523]],[[0,0,368,513,2526]],[[0,0,368,513,2529]],[[0,0,368,513,2532]],[[0,0,368,513,2535]],[[0,0,368,513,2538]],[[0,0,368,513,2541]],[[0,0,368,513,2544]],[[0,0,368,513,2547]],[[0,0,368,513,2550]],[[0,0,368,513,2553]],[[0,0,368,513,2556]],[[0,0,368,513,2559]],[[0,0,368,513,2562]],[[0,0,368,513,2565]],[[0,0,368,513,2568]],[[0,0,368,513,2571]],[[0,0,368,513,2574]],[[0,0,368,513,2577]],[[0,0,368,513,2580]],[[0,0,368,513,2583]],[[0,0,368,513,2586]],[[0,0,368,513,2589]],[[0,0,368,513,2592]],[[0,0,368,513,2595]],[[0,0,368,513,2598]],[[0,0,368,513,2601]],[[0,0,368,513,2604]],[[0,0,368,513,2607]],[[0,0,368,513,2610]],[[0,0,368,513,2613]],[[0,0,368,513,2616]],[[0,0,368,513,2619]],[[0,0,368,513,2622]],[[0,0,368,513,2625]],[[0,0,368,513,2628]],[[0,0,368,513,2631]],[[0,0,368,513,2634]],[[0,0,368,513,2637]],[[0,0,368,513,2640]],[[0,0,368,513,2643]],[[0,0,368,513,2646]],[[0,0,368,513,2649]],[[0,0,368,513,2652]],[[0,0,368,513,2655]],[[0,0,368,513,2658]],[[0,0,368,513,2661]],[[0,0,368,513,2664]],[[0,0,368,513,2667]],[[0,0,368,513,2670]],[[0,0,368,513,2673]],[[0,0,368,513,2676]],[[0,0,368,513,2679]],[[0,0,368,513,2682]],[[0,0,368,513,2685]],[[0,0,368,513,2688]],[[0,0,368,513,2691]],[[0,0,368,513,2694]],[[0,0,368,513,2697]],[[0,0,368,513,2700]],[[0,0,368,513,2703]],[[0,0,368,513,2706]],[[0,0,368,513,2709]],[[0,0,368,513,2712]],[[0,0,368,513,2715]],[[0,0,368,513,2718]],[[0,0,368,513,2721]],[[0,0,368,513,2724]],[[0,0,368,513,2727]],[[0,0,368,513,2730]],[[0,0,368,513,2733]],[[0,0,368,513,2736]],[[0,0,368,513,2739]],[[0,0,368,513,2742]],[[0,0,368,513,2745]],[[0,0,368,513,2748]],[[0,0,368,513,2751]],[[0,0,368,513,2754]],[[0,0,368,513,2757]],[[0,0,368,513,2760]],[[0,0,368,513,2763]],[[0,0,368,513,2766]],[[0,0,368,513,2769]],[[0,0,368,513,2772]],[[0,0,368,513,2775]],[[0,0,368,513,2778]],[[0,0,368,513,2781]],[[0,0,368,513,2784]],[[0,0,368,513,2787]],[[0,0,368,513,2790]],[[0,0,368,513,2793]],[[0,0,368,513,2796]]],"text_len_per_page":[53,53,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54,54],"metadata":{"format":"PDF 1.6","title":"","author":"","subject":"","keywords":"","creator":"Adobe Acrobat 7.0","producer":"Adobe Acrobat 7.0 Image Conversion Plug-in","creationDate":"D:20080404141457+01'00'","modDate":"D:20080404144821+01'00'","trapped":"","encryption":null}}
+    # o = json.loads(json.dumps(o))
+    # total_page = o["total_page"]
+    # page_width = o["page_width_pts"]
+    # page_height = o["page_height_pts"]
+    # img_sz_list = o["image_info_per_page"]
+    # text_len_list = o['text_len_per_page']
+    # pdf_path = o['pdf_path']
+    # is_encrypted = o['is_encrypted']
+    # is_needs_password = o['is_needs_password']
+    # if is_encrypted or total_page == 0 or is_needs_password:  # 加密的，需要密码的，没有页面的，都不处理
+    #     print("加密的")
+    #     exit(0)
+    # tag = classify(pdf_path, total_page, page_width, page_height, img_sz_list, text_len_list)
+    # o['is_text_pdf'] = tag
+    # print(json.dumps(o, ensure_ascii=False))

magic_pdf/filter/pdf_meta_scan.py ADDED Viewed

	@@ -0,0 +1,388 @@

+"""
+输入： s3路径，每行一个
+输出： pdf文件元信息，包括每一页上的所有图片的长宽高，bbox位置
+"""
+import sys
+import click
+from magic_pdf.libs.commons import read_file, mymax, get_top_percent_list
+from magic_pdf.libs.commons import fitz
+from loguru import logger
+from collections import Counter
+from magic_pdf.libs.drop_reason import DropReason
+from magic_pdf.libs.language import detect_lang
+from magic_pdf.libs.pdf_check import detect_invalid_chars
+scan_max_page = 50
+junk_limit_min = 10
+def calculate_max_image_area_per_page(result: list, page_width_pts, page_height_pts):
+    max_image_area_per_page = [mymax([(x1 - x0) * (y1 - y0) for x0, y0, x1, y1, _ in page_img_sz]) for page_img_sz in
+                               result]
+    page_area = int(page_width_pts) * int(page_height_pts)
+    max_image_area_per_page = [area / page_area for area in max_image_area_per_page]
+    max_image_area_per_page = [area for area in max_image_area_per_page if area > 0.6]
+    return max_image_area_per_page
+def process_image(page, junk_img_bojids=[]):
+    page_result = []  # 存每个页面里的多张图四元组信息
+    items = page.get_images()
+    dedup = set()
+    for img in items:
+        # 这里返回的是图片在page上的实际展示的大小。返回一个数组，每个元素第一部分是
+        img_bojid = img[0]  # 在pdf文件中是全局唯一的，如果这个图反复出现在pdf里那么就可能是垃圾信息，例如水印、页眉页脚等
+        if img_bojid in junk_img_bojids:  # 如果是垃圾图像，就跳过
+            continue
+        recs = page.get_image_rects(img, transform=True)
+        if recs:
+            rec = recs[0][0]
+            x0, y0, x1, y1 = map(int, rec)
+            width = x1 - x0
+            height = y1 - y0
+            if (x0, y0, x1, y1, img_bojid) in dedup:  # 这里面会出现一些重复的bbox，无需重复出现，需要去掉
+                continue
+            if not all([width, height]):  # 长和宽任何一个都不能是0，否则这个图片不可见，没有实际意义
+                continue
+            dedup.add((x0, y0, x1, y1, img_bojid))
+            page_result.append([x0, y0, x1, y1, img_bojid])
+    return page_result
+def get_image_info(doc: fitz.Document, page_width_pts, page_height_pts) -> list:
+    """
+    返回每个页面里的图片的四元组，每个页面多个图片。
+    :param doc:
+    :return:
+    """
+    # 使用 Counter 计数 img_bojid 的出现次数
+    img_bojid_counter = Counter(img[0] for page in doc for img in page.get_images())
+    # 找出出现次数超过 len(doc) 半数的 img_bojid
+    junk_limit = max(len(doc) * 0.5, junk_limit_min)  # 对一些页数比较少的进行豁免
+    junk_img_bojids = [img_bojid for img_bojid, count in img_bojid_counter.items() if count >= junk_limit]
+    #todo 加个判断，用前十页就行，这些垃圾图片需要满足两个条件，不止出现的次数要足够多，而且图片占书页面积的比例要足够大，且图与图大小都差不多
+    #有两种扫描版，一种文字版，这里可能会有误判
+    #扫描版1：每页都有所有扫描页图片，特点是图占比大，每页展示1张
+    #扫描版2，每页存储的扫描页图片数量递增，特点是图占比大，每页展示1张，需要清空junklist跑前50页图片信息用于分类判断
+    #文字版1.每页存储所有图片，特点是图片占页面比例不大，每页展示可能为0也可能不止1张 这种pdf需要拿前10页抽样检测img大小和个数，如果符合需要清空junklist
+    imgs_len_list = [len(page.get_images()) for page in doc]
+    special_limit_pages = 10
+    # 统一用前十页结果做判断
+    result = []
+    break_loop = False
+    for i, page in enumerate(doc):
+        if break_loop:
+            break
+        if i >= special_limit_pages:
+            break
+        page_result = process_image(page)  # 这里不传junk_img_bojids，拿前十页所有图片信息用于后续分析
+        result.append(page_result)
+        for item in result:
+            if not any(item):  # 如果任何一页没有图片，说明是个文字版，需要判断是否为特殊文字版
+                if max(imgs_len_list) == min(imgs_len_list) and max(
+                        imgs_len_list) >= junk_limit_min:  # 如果是特殊文字版，就把junklist置空并break
+                    junk_img_bojids = []
+                else:  # 不是特殊文字版，是个普通文字版，但是存在垃圾图片，不置空junklist
+                    pass
+                break_loop = True
+                break
+    if not break_loop:
+        # 获取前80%的元素
+        top_eighty_percent = get_top_percent_list(imgs_len_list, 0.8)
+        # 检查前80%的元素是否都相等
+        if len(set(top_eighty_percent)) == 1 and max(imgs_len_list) >= junk_limit_min:
+            # # 如果前10页跑完都有图，根据每页图片数量是否相等判断是否需要清除junklist
+            # if max(imgs_len_list) == min(imgs_len_list) and max(imgs_len_list) >= junk_limit_min:
+            #前10页都有图，且每页数量一致，需要检测图片大小占页面的比例判断是否需要清除junklist
+            max_image_area_per_page = calculate_max_image_area_per_page(result, page_width_pts, page_height_pts)
+            if len(max_image_area_per_page) < 0.8 * special_limit_pages:  # 前10页不全是大图，说明可能是个文字版pdf，把垃圾图片list置空
+                junk_img_bojids = []
+            else:  # 前10页都有图，而且80%都是大图，且每页图片数量一致并都很多，说明是扫描版1，不需要清空junklist
+                pass
+        else:  # 每页图片数量不一致，需要清掉junklist全量跑前50页图片
+            junk_img_bojids = []
+    #正式进入取前50页图片的信息流程
+    result = []
+    for i, page in enumerate(doc):
+        if i >= scan_max_page:
+            break
+        page_result = process_image(page, junk_img_bojids)
+        # logger.info(f"page {i} img_len: {len(page_result)}")
+        result.append(page_result)
+    return result, junk_img_bojids
+def get_pdf_page_size_pts(doc: fitz.Document):
+    page_cnt = len(doc)
+    l: int = min(page_cnt, 50)
+    #把所有宽度和高度塞到两个list 分别取中位数（中间遇到了个在纵页里塞横页的pdf，导致宽高互换了）
+    page_width_list = []
+    page_height_list = []
+    for i in range(l):
+        page = doc[i]
+        page_rect = page.rect
+        page_width_list.append(page_rect.width)
+        page_height_list.append(page_rect.height)
+    page_width_list.sort()
+    page_height_list.sort()
+    median_width = page_width_list[len(page_width_list) // 2]
+    median_height = page_height_list[len(page_height_list) // 2]
+    return median_width, median_height
+def get_pdf_textlen_per_page(doc: fitz.Document):
+    text_len_lst = []
+    for page in doc:
+        # 拿包含img和text的所有blocks
+        # text_block = page.get_text("blocks")
+        # 拿所有text的blocks
+        # text_block = page.get_text("words")
+        # text_block_len = sum([len(t[4]) for t in text_block])
+        #拿所有text的str
+        text_block = page.get_text("text")
+        text_block_len = len(text_block)
+        # logger.info(f"page {page.number} text_block_len: {text_block_len}")
+        text_len_lst.append(text_block_len)
+    return text_len_lst
+def get_pdf_text_layout_per_page(doc: fitz.Document):
+    """
+    根据PDF文档的每一页文本布局，判断该页的文本布局是横向、纵向还是未知。
+    Args:
+        doc (fitz.Document): PDF文档对象。
+    Returns:
+        List[str]: 每一页的文本布局（横向、纵向、未知）。
+    """
+    text_layout_list = []
+    for page_id, page in enumerate(doc):
+        if page_id >= scan_max_page:
+            break
+        # 创建每一页的纵向和横向的文本行数计数器
+        vertical_count = 0
+        horizontal_count = 0
+        text_dict = page.get_text("dict")
+        if "blocks" in text_dict:
+            for block in text_dict["blocks"]:
+                if 'lines' in block:
+                    for line in block["lines"]:
+                        # 获取line的bbox顶点坐标
+                        x0, y0, x1, y1 = line['bbox']
+                        # 计算bbox的宽高
+                        width = x1 - x0
+                        height = y1 - y0
+                        # 计算bbox的面积
+                        area = width * height
+                        font_sizes = []
+                        for span in line['spans']:
+                            if 'size' in span:
+                                font_sizes.append(span['size'])
+                        if len(font_sizes) > 0:
+                            average_font_size = sum(font_sizes) / len(font_sizes)
+                        else:
+                            average_font_size = 10  # 有的line拿不到font_size，先定一个阈值100
+                        if area <= average_font_size ** 2:  # 判断bbox的面积是否小于平均字体大小的平方,单字无法计算是横向还是纵向
+                            continue
+                        else:
+                            if 'wmode' in line:  # 通过wmode判断文本方向
+                                if line['wmode'] == 1:  # 判断是否为竖向文本
+                                    vertical_count += 1
+                                elif line['wmode'] == 0:  # 判断是否为横向文本
+                                    horizontal_count += 1
+                        #     if 'dir' in line:  # 通过旋转角度计算判断文本方向
+                        #         # 获取行的 "dir" 值
+                        #         dir_value = line['dir']
+                        #         cosine, sine = dir_value
+                        #         # 计算角度
+                        #         angle = math.degrees(math.acos(cosine))
+                        #
+                        #         # 判断是否为横向文本
+                        #         if abs(angle - 0) < 0.01 or abs(angle - 180) < 0.01:
+                        #             # line_text = ' '.join(span['text'] for span in line['spans'])
+                        #             # print('This line is horizontal:', line_text)
+                        #             horizontal_count += 1
+                        #         # 判断是否为纵向文本
+                        #         elif abs(angle - 90) < 0.01 or abs(angle - 270) < 0.01:
+                        #             # line_text = ' '.join(span['text'] for span in line['spans'])
+                        #             # print('This line is vertical:', line_text)
+                        #             vertical_count += 1
+        # print(f"page_id: {page_id}, vertical_count: {vertical_count}, horizontal_count: {horizontal_count}")
+        # 判断每一页的文本布局
+        if vertical_count == 0 and horizontal_count == 0:  # 该页没有文本，无法判断
+            text_layout_list.append("unknow")
+            continue
+        else:
+            if vertical_count > horizontal_count:  # 该页的文本纵向行数大于横向的
+                text_layout_list.append("vertical")
+            else:  # 该页的文本横向行数大于纵向的
+                text_layout_list.append("horizontal")
+        # logger.info(f"page_id: {page_id}, vertical_count: {vertical_count}, horizontal_count: {horizontal_count}")
+    return text_layout_list
+'''定义一个自定义异常用来抛出单页svg太多的pdf'''
+class PageSvgsTooManyError(Exception):
+    def __init__(self, message="Page SVGs are too many"):
+        self.message = message
+        super().__init__(self.message)
+def get_svgs_per_page(doc: fitz.Document):
+    svgs_len_list = []
+    for page_id, page in enumerate(doc):
+        # svgs = page.get_drawings()
+        svgs = page.get_cdrawings()  # 切换成get_cdrawings，效率更高
+        len_svgs = len(svgs)
+        if len_svgs >= 3000:
+            raise PageSvgsTooManyError()
+        else:
+            svgs_len_list.append(len_svgs)
+        # logger.info(f"page_id: {page_id}, svgs_len: {len(svgs)}")
+    return svgs_len_list
+def get_imgs_per_page(doc: fitz.Document):
+    imgs_len_list = []
+    for page_id, page in enumerate(doc):
+        imgs = page.get_images()
+        imgs_len_list.append(len(imgs))
+        # logger.info(f"page_id: {page}, imgs_len: {len(imgs)}")
+    return imgs_len_list
+def get_language(doc: fitz.Document):
+    """
+    获取PDF文档的语言。
+    Args:
+        doc (fitz.Document): PDF文档对象。
+    Returns:
+        str: 文档语言，如 "en-US"。
+    """
+    language_lst = []
+    for page_id, page in enumerate(doc):
+        if page_id >= scan_max_page:
+            break
+        # 拿所有text的str
+        text_block = page.get_text("text")
+        page_language = detect_lang(text_block)
+        language_lst.append(page_language)
+        # logger.info(f"page_id: {page_id}, page_language: {page_language}")
+    # 统计text_language_list中每种语言的个数
+    count_dict = Counter(language_lst)
+    # 输出text_language_list中出现的次数最多的语言
+    language = max(count_dict, key=count_dict.get)
+    return language
+def check_invalid_chars(pdf_bytes):
+    """
+    乱码检测
+    """
+    return detect_invalid_chars(pdf_bytes)
+def pdf_meta_scan(pdf_bytes: bytes):
+    """
+    :param s3_pdf_path:
+    :param pdf_bytes: pdf文件的二进制数据
+    几个维度来评价：是否加密，是否需要密码，纸张大小，总页数，是否文字可提取
+    """
+    doc = fitz.open("pdf", pdf_bytes)
+    is_needs_password = doc.needs_pass
+    is_encrypted = doc.is_encrypted
+    total_page = len(doc)
+    if total_page == 0:
+        logger.warning(f"drop this pdf, drop_reason: {DropReason.EMPTY_PDF}")
+        result = {"_need_drop": True, "_drop_reason": DropReason.EMPTY_PDF}
+        return result
+    else:
+        page_width_pts, page_height_pts = get_pdf_page_size_pts(doc)
+        # logger.info(f"page_width_pts: {page_width_pts}, page_height_pts: {page_height_pts}")
+        # svgs_per_page = get_svgs_per_page(doc)
+        # logger.info(f"svgs_per_page: {svgs_per_page}")
+        imgs_per_page = get_imgs_per_page(doc)
+        # logger.info(f"imgs_per_page: {imgs_per_page}")
+        image_info_per_page, junk_img_bojids = get_image_info(doc, page_width_pts, page_height_pts)
+        # logger.info(f"image_info_per_page: {image_info_per_page}, junk_img_bojids: {junk_img_bojids}")
+        text_len_per_page = get_pdf_textlen_per_page(doc)
+        # logger.info(f"text_len_per_page: {text_len_per_page}")
+        text_layout_per_page = get_pdf_text_layout_per_page(doc)
+        # logger.info(f"text_layout_per_page: {text_layout_per_page}")
+        text_language = get_language(doc)
+        # logger.info(f"text_language: {text_language}")
+        invalid_chars = check_invalid_chars(pdf_bytes)
+        # logger.info(f"invalid_chars: {invalid_chars}")
+        # 最后输出一条json
+        res = {
+            "is_needs_password": is_needs_password,
+            "is_encrypted": is_encrypted,
+            "total_page": total_page,
+            "page_width_pts": int(page_width_pts),
+            "page_height_pts": int(page_height_pts),
+            "image_info_per_page": image_info_per_page,
+            "text_len_per_page": text_len_per_page,
+            "text_layout_per_page": text_layout_per_page,
+            "text_language": text_language,
+            # "svgs_per_page": svgs_per_page,
+            "imgs_per_page": imgs_per_page,  # 增加每页img数量list
+            "junk_img_bojids": junk_img_bojids,  # 增加垃圾图片的bojid list
+            "invalid_chars": invalid_chars,
+            "metadata": doc.metadata
+        }
+        # logger.info(json.dumps(res, ensure_ascii=False))
+        return res
+@click.command()
+@click.option('--s3-pdf-path', help='s3上pdf文件的路径')
+@click.option('--s3-profile', help='s3上的profile')
+def main(s3_pdf_path: str, s3_profile: str):
+    """
+    """
+    try:
+        file_content = read_file(s3_pdf_path, s3_profile)
+        pdf_meta_scan(file_content)
+    except Exception as e:
+        print(f"ERROR: {s3_pdf_path}, {e}", file=sys.stderr)
+        logger.exception(e)
+if __name__ == '__main__':
+    main()
+    # "D:\project/20231108code-clean\pdf_cost_time\竖排例子\净空法师-大乘无量寿.pdf"
+    # "D:\project/20231108code-clean\pdf_cost_time\竖排例子\三国演义_繁体竖排版.pdf"
+    # "D:\project/20231108code-clean\pdf_cost_time\scihub\scihub_86800000\libgen.scimag86880000-86880999.zip_10.1021/acsami.1c03109.s002.pdf"
+    # "D:/project/20231108code-clean/pdf_cost_time/scihub/scihub_18600000/libgen.scimag18645000-18645999.zip_10.1021/om3006239.pdf"
+    # file_content = read_file("D:/project/20231108code-clean/pdf_cost_time/scihub/scihub_31000000/libgen.scimag31098000-31098999.zip_10.1109/isit.2006.261791.pdf","")
+    # file_content = read_file("D:\project/20231108code-clean\pdf_cost_time\竖排例子\净空法师_大乘无量寿.pdf","")
+    # doc = fitz.open("pdf", file_content)
+    # text_layout_lst = get_pdf_text_layout_per_page(doc)
+    # print(text_layout_lst)

magic_pdf/layout/__init__.py ADDED Viewed

File without changes

magic_pdf/layout/bbox_sort.py ADDED Viewed

	@@ -0,0 +1,681 @@

+# 定义这里的bbox是一个list [x0, y0, x1, y1, block_content, idx_x, idx_y, content_type, ext_x0, ext_y0, ext_x1, ext_y1], 初始时候idx_x, idx_y都是None
+# 其中x0, y0代表左上角坐标，x1, y1代表右下角坐标，坐标原点在左上角。
+from magic_pdf.layout.layout_spiler_recog import get_spilter_of_page
+from magic_pdf.libs.boxbase import _is_in, _is_in_or_part_overlap, _is_vertical_full_overlap
+from magic_pdf.libs.commons import mymax
+X0_IDX = 0
+Y0_IDX = 1
+X1_IDX = 2
+Y1_IDX = 3
+CONTENT_IDX = 4
+IDX_X = 5
+IDX_Y = 6
+CONTENT_TYPE_IDX = 7
+X0_EXT_IDX = 8
+Y0_EXT_IDX = 9
+X1_EXT_IDX = 10
+Y1_EXT_IDX = 11
+def prepare_bboxes_for_layout_split(image_info, image_backup_info, table_info, inline_eq_info, interline_eq_info, text_raw_blocks: dict, page_boundry, page):
+    """
+    text_raw_blocks:结构参考test/assets/papre/pymu_textblocks.json
+    把bbox重新组装成一个list，每个元素[x0, y0, x1, y1, block_content, idx_x, idx_y, content_type, ext_x0, ext_y0, ext_x1, ext_y1], 初始时候idx_x, idx_y都是None. 对于图片、公式来说，block_content是图片的地址， 对于段落来说，block_content是pymupdf里的block结构
+    """
+    all_bboxes = []
+    for image in image_info:
+        box = image['bbox']
+        # 由于没有实现横向的栏切分，因此在这里先过滤掉一些小的图片。这些图片有可能影响layout，造成没有横向栏切分的情况下，layout切分不准确。例如 scihub_76500000/libgen.scimag76570000-76570999.zip_10.1186/s13287-019-1355-1
+        # 把长宽都小于50的去掉
+        if abs(box[0]-box[2]) < 50 and abs(box[1]-box[3]) < 50:
+            continue
+        all_bboxes.append([box[0], box[1], box[2], box[3], None, None, None, 'image', None, None, None, None])
+    for table in table_info:
+        box = table['bbox']
+        all_bboxes.append([box[0], box[1], box[2], box[3], None, None, None, 'table', None, None, None, None])
+    """由于公式与段落混合，因此公式不再参与layout划分，无需加入all_bboxes"""
+    # 加入文本block
+    text_block_temp = []
+    for block in text_raw_blocks:
+        bbox = block['bbox']
+        text_block_temp.append([bbox[0], bbox[1], bbox[2], bbox[3], None, None, None, 'text', None, None, None, None])
+    text_block_new = resolve_bbox_overlap_for_layout_det(text_block_temp)
+    text_block_new = filter_lines_bbox(text_block_new) # 去掉线条bbox，有可能让layout探测陷入无限循环
+    """找出会影响layout的色块、横向分割线"""
+    spilter_bboxes = get_spilter_of_page(page, [b['bbox'] for b in image_info]+[b['bbox'] for b in image_backup_info], [b['bbox'] for b in table_info], )
+    # 还要去掉存在于spilter_bboxes里的text_block
+    if len(spilter_bboxes) > 0:
+        text_block_new = [box for box in text_block_new if not any([_is_in_or_part_overlap(box[:4], spilter_bbox) for spilter_bbox in spilter_bboxes])]
+    for bbox in text_block_new:
+        all_bboxes.append([bbox[0], bbox[1], bbox[2], bbox[3], None, None, None, 'text', None, None, None, None])
+    for bbox in spilter_bboxes:
+        all_bboxes.append([bbox[0], bbox[1], bbox[2], bbox[3], None, None, None, 'spilter', None, None, None, None])
+    return all_bboxes
+def resolve_bbox_overlap_for_layout_det(bboxes:list):
+    """
+    1. 去掉bbox互相包含的，去掉被包含的
+    2. 上下方向上如果有重叠，就扩大大box范围，直到覆盖小box
+    """
+    def _is_in_other_bbox(i:int):
+        """
+        判断i个box是否被其他box有所包含
+        """
+        for j in range(0, len(bboxes)):
+            if j!=i and _is_in(bboxes[i][:4], bboxes[j][:4]):
+                return True
+            # elif j!=i and _is_bottom_full_overlap(bboxes[i][:4], bboxes[j][:4]):
+            #     return True
+        return False
+    # 首先去掉被包含的bbox
+    new_bbox_1 = []
+    for i in range(0, len(bboxes)):
+        if not _is_in_other_bbox(i):
+            new_bbox_1.append(bboxes[i])
+    # 其次扩展大的box
+    new_box = []
+    new_bbox_2 = []
+    len_1 = len(new_bbox_2)
+    while True:
+        merged_idx = []
+        for i in range(0, len(new_bbox_1)):
+            if i in merged_idx:
+                continue
+            for j in range(i+1, len(new_bbox_1)):
+                if j in merged_idx:
+                    continue
+                bx1 = new_bbox_1[i]
+                bx2 = new_bbox_1[j]
+                if i!=j and _is_vertical_full_overlap(bx1[:4], bx2[:4]):
+                    merged_box = min([bx1[0], bx2[0]]), min([bx1[1], bx2[1]]), max([bx1[2], bx2[2]]), max([bx1[3], bx2[3]])
+                    new_bbox_2.append(merged_box)
+                    merged_idx.append(i)
+                    merged_idx.append(j)
+        for i in range(0, len(new_bbox_1)): # 没有合并的加入进来
+            if i not in merged_idx:
+                new_bbox_2.append(new_bbox_1[i])
+        if len(new_bbox_2)==0 or len_1==len(new_bbox_2):
+            break
+        else:
+            len_1 = len(new_bbox_2)
+            new_box = new_bbox_2
+            new_bbox_1, new_bbox_2 = new_bbox_2, []
+    return new_box
+def filter_lines_bbox(bboxes: list):
+    """
+    过滤掉bbox为空的行
+    """
+    new_box = []
+    for box in bboxes:
+        x0, y0, x1, y1 = box[0], box[1], box[2], box[3]
+        if abs(x0-x1)<=1 or abs(y0-y1)<=1:
+            continue
+        else:
+            new_box.append(box)
+    return new_box
+################################################################################
+# 第一种排序算法
+# 以下是基于延长线遮挡做的一个算法
+#
+################################################################################
+def find_all_left_bbox(this_bbox, all_bboxes) -> list:
+    """
+    寻找this_bbox左边的所有bbox
+    """
+    left_boxes = [box for box in all_bboxes if box[X1_IDX] <= this_bbox[X0_IDX]]
+    return left_boxes
+def find_all_top_bbox(this_bbox, all_bboxes) -> list:
+    """
+    寻找this_bbox上面的所有bbox
+    """
+    top_boxes = [box for box in all_bboxes if box[Y1_IDX] <= this_bbox[Y0_IDX]]
+    return top_boxes
+def get_and_set_idx_x(this_bbox, all_bboxes) -> int:
+    """
+    寻找this_bbox在all_bboxes中的遮挡深度 idx_x
+    """
+    if this_bbox[IDX_X] is not None:
+        return this_bbox[IDX_X]
+    else:
+        all_left_bboxes = find_all_left_bbox(this_bbox, all_bboxes)
+        if len(all_left_bboxes) == 0:
+            this_bbox[IDX_X] = 0
+        else:
+            all_left_bboxes_idx = [get_and_set_idx_x(bbox, all_bboxes) for bbox in all_left_bboxes]
+            max_idx_x = mymax(all_left_bboxes_idx)
+            this_bbox[IDX_X] = max_idx_x + 1
+        return this_bbox[IDX_X]
+def get_and_set_idx_y(this_bbox, all_bboxes) -> int:
+    """
+    寻找this_bbox在all_bboxes中y方向的遮挡深度 idx_y
+    """
+    if this_bbox[IDX_Y] is not None:
+        return this_bbox[IDX_Y]
+    else:
+        all_top_bboxes = find_all_top_bbox(this_bbox, all_bboxes)
+        if len(all_top_bboxes) == 0:
+            this_bbox[IDX_Y] = 0
+        else:
+            all_top_bboxes_idx = [get_and_set_idx_y(bbox, all_bboxes) for bbox in all_top_bboxes]
+            max_idx_y = mymax(all_top_bboxes_idx)
+            this_bbox[IDX_Y] = max_idx_y + 1
+        return this_bbox[IDX_Y]
+def bbox_sort(all_bboxes: list):
+    """
+    排序
+    """
+    all_bboxes_idx_x = [get_and_set_idx_x(bbox, all_bboxes) for bbox in all_bboxes]
+    all_bboxes_idx_y = [get_and_set_idx_y(bbox, all_bboxes) for bbox in all_bboxes]
+    all_bboxes_idx = [(idx_x, idx_y) for idx_x, idx_y in zip(all_bboxes_idx_x, all_bboxes_idx_y)]
+    all_bboxes_idx = [idx_x_y[0] * 100000 + idx_x_y[1] for idx_x_y in all_bboxes_idx]  # 变换成一个点，保证能够先X，X相同时按Y排序
+    all_bboxes_idx = list(zip(all_bboxes_idx, all_bboxes))
+    all_bboxes_idx.sort(key=lambda x: x[0])
+    sorted_bboxes = [bbox for idx, bbox in all_bboxes_idx]
+    return sorted_bboxes
+################################################################################
+# 第二种排序算法
+# 下面的算法在计算idx_x和idx_y的时候不考虑延长线，而只考虑实际的长或者宽被遮挡的情况
+#
+################################################################################
+def find_left_nearest_bbox(this_bbox, all_bboxes) -> list:
+    """
+    在all_bboxes里找到所有右侧高度和this_bbox有重叠的bbox
+    """
+    left_boxes = [box for box in all_bboxes if box[X1_IDX] <= this_bbox[X0_IDX] and any([
+         box[Y0_IDX] < this_bbox[Y0_IDX] < box[Y1_IDX], box[Y0_IDX] < this_bbox[Y1_IDX] < box[Y1_IDX],
+         this_bbox[Y0_IDX] < box[Y0_IDX] < this_bbox[Y1_IDX], this_bbox[Y0_IDX] < box[Y1_IDX] < this_bbox[Y1_IDX],
+         box[Y0_IDX]==this_bbox[Y0_IDX] and box[Y1_IDX]==this_bbox[Y1_IDX]])]
+    # 然后再过滤一下，找到水平上距离this_bbox最近的那个
+    if len(left_boxes) > 0:
+        left_boxes.sort(key=lambda x: x[X1_IDX], reverse=True)
+        left_boxes = [left_boxes[0]]
+    else:
+        left_boxes = []
+    return left_boxes
+def get_and_set_idx_x_2(this_bbox, all_bboxes):
+    """
+    寻找this_bbox在all_bboxes中的被直接遮挡的深度 idx_x
+    这个遮挡深度不考虑延长线，而是被实际的长或者宽遮挡的情况
+    """
+    if this_bbox[IDX_X] is not None:
+        return this_bbox[IDX_X]
+    else:
+        left_nearest_bbox = find_left_nearest_bbox(this_bbox, all_bboxes)
+        if len(left_nearest_bbox) == 0:
+            this_bbox[IDX_X] = 0
+        else:
+            left_idx_x = get_and_set_idx_x_2(left_nearest_bbox[0], all_bboxes)
+            this_bbox[IDX_X] = left_idx_x + 1
+        return this_bbox[IDX_X]
+def find_top_nearest_bbox(this_bbox, all_bboxes) -> list:
+    """
+    在all_bboxes里找到所有下侧宽度和this_bbox有重叠的bbox
+    """
+    top_boxes = [box for box in all_bboxes if box[Y1_IDX] <= this_bbox[Y0_IDX] and any([
+        box[X0_IDX] < this_bbox[X0_IDX] < box[X1_IDX], box[X0_IDX] < this_bbox[X1_IDX] < box[X1_IDX],
+         this_bbox[X0_IDX] < box[X0_IDX] < this_bbox[X1_IDX], this_bbox[X0_IDX] < box[X1_IDX] < this_bbox[X1_IDX],
+        box[X0_IDX]==this_bbox[X0_IDX] and box[X1_IDX]==this_bbox[X1_IDX]])]
+    # 然后再过滤一下，找到水平上距离this_bbox最近的那个
+    if len(top_boxes) > 0:
+        top_boxes.sort(key=lambda x: x[Y1_IDX], reverse=True)
+        top_boxes = [top_boxes[0]]
+    else:
+        top_boxes = []
+    return top_boxes
+def get_and_set_idx_y_2(this_bbox, all_bboxes):
+    """
+    寻找this_bbox在all_bboxes中的被直接遮挡的深度 idx_y
+    这个遮挡深度不考虑延长线，而是被实际的长或者宽遮挡的情况
+    """
+    if this_bbox[IDX_Y] is not None:
+        return this_bbox[IDX_Y]
+    else:
+        top_nearest_bbox = find_top_nearest_bbox(this_bbox, all_bboxes)
+        if len(top_nearest_bbox) == 0:
+            this_bbox[IDX_Y] = 0
+        else:
+            top_idx_y = get_and_set_idx_y_2(top_nearest_bbox[0], all_bboxes)
+            this_bbox[IDX_Y] = top_idx_y + 1
+        return this_bbox[IDX_Y]
+def paper_bbox_sort(all_bboxes: list, page_width, page_height):
+    all_bboxes_idx_x = [get_and_set_idx_x_2(bbox, all_bboxes) for bbox in all_bboxes]
+    all_bboxes_idx_y = [get_and_set_idx_y_2(bbox, all_bboxes) for bbox in all_bboxes]
+    all_bboxes_idx = [(idx_x, idx_y) for idx_x, idx_y in zip(all_bboxes_idx_x, all_bboxes_idx_y)]
+    all_bboxes_idx = [idx_x_y[0] * 100000 + idx_x_y[1] for idx_x_y in all_bboxes_idx]  # 变换成一个点，保证能够先X，X相同时按Y排序
+    all_bboxes_idx = list(zip(all_bboxes_idx, all_bboxes))
+    all_bboxes_idx.sort(key=lambda x: x[0])
+    sorted_bboxes = [bbox for idx, bbox in all_bboxes_idx]
+    return sorted_bboxes
+################################################################################
+"""
+第三种排序算法, 假设page的最左侧为X0，最右侧为X1，最上侧为Y0，最下侧为Y1
+这个排序算法在第二种算法基础上增加对bbox的预处理步骤。预处理思路如下：
+1. 首先在水平方向上对bbox进行扩展。扩展方法是：
+    - 对每个bbox，找到其左边最近的bbox（也就是y方向有重叠），然后将其左边界扩展到左边最近bbox的右边界(x1+1),这里加1是为了避免重叠。如果没有左边的bbox，那么就将其左边界扩展到page的最左侧X0。
+    - 对每个bbox，找到其右边最近的bbox（也就是y方向有重叠），然后将其右边界扩展到右边最近bbox的左边界(x0-1),这里减1是为了避免重叠。如果没有右边的bbox，那么就将其右边界扩展到page的最右侧X1。
+    - 经过上面2个步骤，bbox扩展到了水平方向的最大范围。[左最近bbox.x1+1, 右最近bbox.x0-1]
+2. 合并所有的连续水平方向的bbox, 合并方法是：
+    - 对bbox进行y方向排序，然后从上到下遍历所有bbox，如果当前bbox和下一个bbox的x0, x1等于X0, X1，那么就合并这两个bbox。
+3. 然后在垂直方向上对bbox进行扩展。扩展方法是：
+    - 首先从page上切割掉合并后的水平bbox, 得到几个新的block
+    针对每个block
+    - x0: 扎到位于左侧x=x0延长线的左侧所有的bboxes, 找到最大的x1,让x0=x1+1。如果没有，则x0=X0
+    - x1: 找到位于右侧x=x1延长线右侧所有的bboxes， 找到最小的x0, 让x1=x0-1。如果没有，则x1=X1
+    随后在垂直方向上合并所有的连续的block，方法如下：
+    - 对block进行x方向排序，然后从左到右遍历所有block，如果当前block和下一个block的x0, x1相等，那么就合并这两个block。
+    如果垂直切分后所有小bbox都被分配到了一个block, 那么分割就完成了。这些合并后的block打上标签'GOOD_LAYOUT’
+    如果在某个垂直方向上无法被完全分割到一个block，那么就将这个block打上标签'BAD_LAYOUT'。
+    至此完成，一个页面的预处理，天然的block要么属于'GOOD_LAYOUT'，要么属于'BAD_LAYOUT'。针对含有'BAD_LAYOUT'的页面，可以先按照自上而下，自左到右进行天然排序，也可以先过滤掉这种书籍。
+    (完成条件下次加强：进行水平方向切分，把混乱的layout部分尽可能切割出去)
+"""
+################################################################################
+def find_left_neighbor_bboxes(this_bbox, all_bboxes) -> list:
+    """
+    在all_bboxes里找到所有右侧高度和this_bbox有重叠的bbox
+    这里使用扩展之后的bbox
+    """
+    left_boxes = [box for box in all_bboxes if box[X1_EXT_IDX] <= this_bbox[X0_EXT_IDX] and any([
+         box[Y0_EXT_IDX] < this_bbox[Y0_EXT_IDX] < box[Y1_EXT_IDX], box[Y0_EXT_IDX] < this_bbox[Y1_EXT_IDX] < box[Y1_EXT_IDX],
+         this_bbox[Y0_EXT_IDX] < box[Y0_EXT_IDX] < this_bbox[Y1_EXT_IDX], this_bbox[Y0_EXT_IDX] < box[Y1_EXT_IDX] < this_bbox[Y1_EXT_IDX],
+         box[Y0_EXT_IDX]==this_bbox[Y0_EXT_IDX] and box[Y1_EXT_IDX]==this_bbox[Y1_EXT_IDX]])]
+    # 然后再过滤一下，找到水平上距离this_bbox最近的那个
+    if len(left_boxes) > 0:
+        left_boxes.sort(key=lambda x: x[X1_EXT_IDX], reverse=True)
+        left_boxes = left_boxes
+    else:
+        left_boxes = []
+    return left_boxes
+def find_top_neighbor_bboxes(this_bbox, all_bboxes) -> list:
+    """
+    在all_bboxes里找到所有下侧宽度和this_bbox有重叠的bbox
+    这里使用扩展之后的bbox
+    """
+    top_boxes = [box for box in all_bboxes if box[Y1_EXT_IDX] <= this_bbox[Y0_EXT_IDX] and any([
+        box[X0_EXT_IDX] < this_bbox[X0_EXT_IDX] < box[X1_EXT_IDX], box[X0_EXT_IDX] < this_bbox[X1_EXT_IDX] < box[X1_EXT_IDX],
+         this_bbox[X0_EXT_IDX] < box[X0_EXT_IDX] < this_bbox[X1_EXT_IDX], this_bbox[X0_EXT_IDX] < box[X1_EXT_IDX] < this_bbox[X1_EXT_IDX],
+        box[X0_EXT_IDX]==this_bbox[X0_EXT_IDX] and box[X1_EXT_IDX]==this_bbox[X1_EXT_IDX]])]
+    # 然后再过滤一下，找到水平上距离this_bbox最近的那个
+    if len(top_boxes) > 0:
+        top_boxes.sort(key=lambda x: x[Y1_EXT_IDX], reverse=True)
+        top_boxes = top_boxes
+    else:
+        top_boxes = []
+    return top_boxes
+def get_and_set_idx_x_2_ext(this_bbox, all_bboxes):
+    """
+    寻找this_bbox在all_bboxes中的被直接遮挡的深度 idx_x
+    这个遮挡深度不考虑延长线，而是被实际的长或者宽遮挡的情况
+    """
+    if this_bbox[IDX_X] is not None:
+        return this_bbox[IDX_X]
+    else:
+        left_nearest_bbox = find_left_neighbor_bboxes(this_bbox, all_bboxes)
+        if len(left_nearest_bbox) == 0:
+            this_bbox[IDX_X] = 0
+        else:
+            left_idx_x = [get_and_set_idx_x_2(b, all_bboxes) for b in left_nearest_bbox]
+            this_bbox[IDX_X] = mymax(left_idx_x) + 1
+        return this_bbox[IDX_X]
+def get_and_set_idx_y_2_ext(this_bbox, all_bboxes):
+    """
+    寻找this_bbox在all_bboxes中的被直接遮挡的深度 idx_y
+    这个遮挡深度不考虑延长线，而是被实际的长或者宽遮挡的情况
+    """
+    if this_bbox[IDX_Y] is not None:
+        return this_bbox[IDX_Y]
+    else:
+        top_nearest_bbox = find_top_neighbor_bboxes(this_bbox, all_bboxes)
+        if len(top_nearest_bbox) == 0:
+            this_bbox[IDX_Y] = 0
+        else:
+            top_idx_y = [get_and_set_idx_y_2_ext(b, all_bboxes) for b in top_nearest_bbox]
+            this_bbox[IDX_Y] = mymax(top_idx_y) + 1
+        return this_bbox[IDX_Y]
+def _paper_bbox_sort_ext(all_bboxes: list):
+    all_bboxes_idx_x = [get_and_set_idx_x_2_ext(bbox, all_bboxes) for bbox in all_bboxes]
+    all_bboxes_idx_y = [get_and_set_idx_y_2_ext(bbox, all_bboxes) for bbox in all_bboxes]
+    all_bboxes_idx = [(idx_x, idx_y) for idx_x, idx_y in zip(all_bboxes_idx_x, all_bboxes_idx_y)]
+    all_bboxes_idx = [idx_x_y[0] * 100000 + idx_x_y[1] for idx_x_y in all_bboxes_idx]  # 变换成一个点，保证能够先X，X相同时按Y排序
+    all_bboxes_idx = list(zip(all_bboxes_idx, all_bboxes))
+    all_bboxes_idx.sort(key=lambda x: x[0])
+    sorted_bboxes = [bbox for idx, bbox in all_bboxes_idx]
+    return sorted_bboxes
+# ===============================================================================================
+def find_left_bbox_ext_line(this_bbox, all_bboxes) -> list:
+    """
+    寻找this_bbox左边的所有bbox, 使用延长线
+    """
+    left_boxes = [box for box in all_bboxes if box[X1_IDX] <= this_bbox[X0_IDX]]
+    if len(left_boxes):
+        left_boxes.sort(key=lambda x: x[X1_IDX], reverse=True)
+        left_boxes = left_boxes[0]
+    else:
+        left_boxes = None
+    return left_boxes
+def find_right_bbox_ext_line(this_bbox, all_bboxes) -> list:
+    """
+    寻找this_bbox右边的所有bbox, 使用延长线
+    """
+    right_boxes = [box for box in all_bboxes if box[X0_IDX] >= this_bbox[X1_IDX]]
+    if len(right_boxes):
+        right_boxes.sort(key=lambda x: x[X0_IDX])
+        right_boxes = right_boxes[0]
+    else:
+        right_boxes = None
+    return right_boxes
+# =============================================================================================
+def find_left_nearest_bbox_direct(this_bbox, all_bboxes) -> list:
+    """
+    在all_bboxes里找到所有右侧高度和this_bbox有重叠的bbox， 不用延长线并且不能像
+    """
+    left_boxes = [box for box in all_bboxes if box[X1_IDX] <= this_bbox[X0_IDX] and any([
+         box[Y0_IDX] < this_bbox[Y0_IDX] < box[Y1_IDX], box[Y0_IDX] < this_bbox[Y1_IDX] < box[Y1_IDX],
+         this_bbox[Y0_IDX] < box[Y0_IDX] < this_bbox[Y1_IDX], this_bbox[Y0_IDX] < box[Y1_IDX] < this_bbox[Y1_IDX],
+         box[Y0_IDX]==this_bbox[Y0_IDX] and box[Y1_IDX]==this_bbox[Y1_IDX]])]
+    # 然后再过滤一下，找到水平上距离this_bbox最近的那个——x1最大的那个
+    if len(left_boxes) > 0:
+        left_boxes.sort(key=lambda x: x[X1_EXT_IDX] if x[X1_EXT_IDX] else x[X1_IDX], reverse=True)
+        left_boxes = left_boxes[0]
+    else:
+        left_boxes = None
+    return left_boxes
+def find_right_nearst_bbox_direct(this_bbox, all_bboxes) -> list:
+    """
+    找到在this_bbox右侧且距离this_bbox距离最近的bbox.必须是直接遮挡的那种
+    """
+    right_bboxes = [box for box in all_bboxes if box[X0_IDX] >= this_bbox[X1_IDX] and any([
+        this_bbox[Y0_IDX] < box[Y0_IDX] < this_bbox[Y1_IDX], this_bbox[Y0_IDX] < box[Y1_IDX] < this_bbox[Y1_IDX],
+        box[Y0_IDX] < this_bbox[Y0_IDX] < box[Y1_IDX], box[Y0_IDX] < this_bbox[Y1_IDX] < box[Y1_IDX],
+        box[Y0_IDX]==this_bbox[Y0_IDX] and box[Y1_IDX]==this_bbox[Y1_IDX]])]
+    if len(right_bboxes)>0:
+        right_bboxes.sort(key=lambda x: x[X0_EXT_IDX] if x[X0_EXT_IDX] else x[X0_IDX])
+        right_bboxes = right_bboxes[0]
+    else:
+        right_bboxes = None
+    return right_bboxes
+def reset_idx_x_y(all_boxes:list)->list:
+    for box in all_boxes:
+        box[IDX_X] = None
+        box[IDX_Y] = None
+    return all_boxes
+# ===================================================================================================
+def find_top_nearest_bbox_direct(this_bbox, bboxes_collection) -> list:
+    """
+    找到在this_bbox上方且距离this_bbox距离最近的bbox.必须是直接遮挡的那种
+    """
+    top_bboxes = [box for box in bboxes_collection if box[Y1_IDX] <= this_bbox[Y0_IDX] and any([
+        box[X0_IDX] < this_bbox[X0_IDX] < box[X1_IDX], box[X0_IDX] < this_bbox[X1_IDX] < box[X1_IDX],
+         this_bbox[X0_IDX] < box[X0_IDX] < this_bbox[X1_IDX], this_bbox[X0_IDX] < box[X1_IDX] < this_bbox[X1_IDX],
+        box[X0_IDX]==this_bbox[X0_IDX] and box[X1_IDX]==this_bbox[X1_IDX]])]
+    # 然后再过滤一下，找到上方距离this_bbox最近的那个
+    if len(top_bboxes) > 0:
+        top_bboxes.sort(key=lambda x: x[Y1_IDX], reverse=True)
+        top_bboxes = top_bboxes[0]
+    else:
+        top_bboxes = None
+    return top_bboxes
+def find_bottom_nearest_bbox_direct(this_bbox, bboxes_collection) -> list:
+    """
+    找到在this_bbox下方且距离this_bbox距离最近的bbox.必须是直接遮挡的那种
+    """
+    bottom_bboxes = [box for box in bboxes_collection if box[Y0_IDX] >= this_bbox[Y1_IDX] and any([
+        box[X0_IDX] < this_bbox[X0_IDX] < box[X1_IDX], box[X0_IDX] < this_bbox[X1_IDX] < box[X1_IDX],
+         this_bbox[X0_IDX] < box[X0_IDX] < this_bbox[X1_IDX], this_bbox[X0_IDX] < box[X1_IDX] < this_bbox[X1_IDX],
+        box[X0_IDX]==this_bbox[X0_IDX] and box[X1_IDX]==this_bbox[X1_IDX]])]
+    # 然后再过滤一下，找到水平上距离this_bbox最近的那个
+    if len(bottom_bboxes) > 0:
+        bottom_bboxes.sort(key=lambda x: x[Y0_IDX])
+        bottom_bboxes = bottom_bboxes[0]
+    else:
+        bottom_bboxes = None
+    return bottom_bboxes
+def find_boundry_bboxes(bboxes:list) -> tuple:
+    """
+    找到bboxes的边界——找到所有bbox里最小的(x0, y0), 最大的(x1, y1)
+    """
+    x0, y0, x1, y1 = bboxes[0][X0_IDX], bboxes[0][Y0_IDX], bboxes[0][X1_IDX], bboxes[0][Y1_IDX]
+    for box in bboxes:
+        x0 = min(box[X0_IDX], x0)
+        y0 = min(box[Y0_IDX], y0)
+        x1 = max(box[X1_IDX], x1)
+        y1 = max(box[Y1_IDX], y1)
+    return x0, y0, x1, y1
+def extend_bbox_vertical(bboxes:list, boundry_x0, boundry_y0, boundry_x1, boundry_y1) -> list:
+    """
+    在垂直方向上扩展能够直接垂直打通的bbox,也就是那些上下都没有其他box的bbox
+    """
+    for box in bboxes:
+        top_nearest_bbox = find_top_nearest_bbox_direct(box, bboxes)
+        bottom_nearest_bbox = find_bottom_nearest_bbox_direct(box, bboxes)
+        if top_nearest_bbox is None and bottom_nearest_bbox is None: # 独占一列
+            box[X0_EXT_IDX] = box[X0_IDX]
+            box[Y0_EXT_IDX] = boundry_y0
+            box[X1_EXT_IDX] = box[X1_IDX]
+            box[Y1_EXT_IDX] = boundry_y1
+        # else:
+        #     if top_nearest_bbox is None:
+        #         box[Y0_EXT_IDX] = boundry_y0
+        #     else:
+        #         box[Y0_EXT_IDX] = top_nearest_bbox[Y1_IDX] + 1
+        #     if bottom_nearest_bbox is None:
+        #         box[Y1_EXT_IDX] = boundry_y1
+        #     else:
+        #         box[Y1_EXT_IDX] = bottom_nearest_bbox[Y0_IDX] - 1
+        #     box[X0_EXT_IDX] = box[X0_IDX]
+        #     box[X1_EXT_IDX] = box[X1_IDX]
+    return bboxes
+# ===================================================================================================
+def paper_bbox_sort_v2(all_bboxes: list, page_width:int, page_height:int):
+    """
+    增加预处理行为的排序:
+    return:
+    [
+        {
+            "layout_bbox": [x0, y0, x1, y1],
+            "layout_label":"GOOD_LAYOUT/BAD_LAYOUT",
+            "content_bboxes": [] #每个元素都是[x0, y0, x1, y1, block_content, idx_x, idx_y, content_type, ext_x0, ext_y0, ext_x1, ext_y1], 并且顺序就是阅读顺序
+        }
+    ]
+    """
+    sorted_layouts = [] # 最后的返回结果
+    page_x0, page_y0, page_x1, page_y1 = 1, 1, page_width-1, page_height-1
+    all_bboxes = paper_bbox_sort(all_bboxes) # 大致拍下序
+    # 首先在水平方向上扩展独占一行的bbox
+    for bbox in all_bboxes:
+        left_nearest_bbox = find_left_nearest_bbox_direct(bbox, all_bboxes) # 非扩展线
+        right_nearest_bbox = find_right_nearst_bbox_direct(bbox, all_bboxes)
+        if left_nearest_bbox is None and right_nearest_bbox is None: # 独占一行
+            bbox[X0_EXT_IDX] = page_x0
+            bbox[Y0_EXT_IDX] = bbox[Y0_IDX]
+            bbox[X1_EXT_IDX] = page_x1
+            bbox[Y1_EXT_IDX] = bbox[Y1_IDX]
+    # 此时独占一行的被成功扩展到指定的边界上，这个时候利用边界条件合并连续的bbox，成为一个group
+    if len(all_bboxes)==1:
+        return [{"layout_bbox": [page_x0, page_y0, page_x1, page_y1], "layout_label":"GOOD_LAYOUT", "content_bboxes": all_bboxes}]
+    if len(all_bboxes)==0:
+        return []
+    """
+    然后合并所有连续水平方向的bbox.
+    """
+    all_bboxes.sort(key=lambda x: x[Y0_IDX])
+    h_bboxes = []
+    h_bbox_group = []
+    v_boxes = []
+    for bbox in all_bboxes:
+        if bbox[X0_IDX] == page_x0 and bbox[X1_IDX] == page_x1:
+            h_bbox_group.append(bbox)
+        else:
+            if len(h_bbox_group)>0:
+                h_bboxes.append(h_bbox_group)
+                h_bbox_group = []
+    # 最后一个group
+    if len(h_bbox_group)>0:
+        h_bboxes.append(h_bbox_group)
+    """
+    现在h_bboxes里面是所有的group了，每个group都是一个list
+    对h_bboxes里的每个group进行计算放回到sorted_layouts里
+    """
+    for gp in h_bboxes:
+        gp.sort(key=lambda x: x[Y0_IDX])
+        block_info = {"layout_label":"GOOD_LAYOUT", "content_bboxes": gp}
+        # 然后计算这个group的layout_bbox，也就是最小的x0,y0, 最大的x1,y1
+        x0, y0, x1, y1 = gp[0][X0_EXT_IDX], gp[0][Y0_EXT_IDX], gp[-1][X1_EXT_IDX], gp[-1][Y1_EXT_IDX]
+        block_info["layout_bbox"] = [x0, y0, x1, y1]
+        sorted_layouts.append(block_info)
+    # 接下来利用这些连续的水平bbox的layout_bbox的y0, y1，从水平上切分开其余的为几个部分
+    h_split_lines = [page_y0]
+    for gp in h_bboxes:
+        layout_bbox = gp['layout_bbox']
+        y0, y1 = layout_bbox[1], layout_bbox[3]
+        h_split_lines.append(y0)
+        h_split_lines.append(y1)
+    h_split_lines.append(page_y1)
+    unsplited_bboxes = []
+    for i in range(0, len(h_split_lines), 2):
+        start_y0, start_y1 = h_split_lines[i:i+2]
+        # 然后找出[start_y0, start_y1]之间的其他bbox，这些组成一个未分割板块
+        bboxes_in_block = [bbox for bbox in all_bboxes if bbox[Y0_IDX]>=start_y0 and bbox[Y1_IDX]<=start_y1]
+        unsplited_bboxes.append(bboxes_in_block)
+    # ================== 至此，水平方向的 已经切分排序完毕====================================
+    """
+    接下来针对每个非水平的部分切分垂直方向的
+    此时，只剩下了无法被完全水平打通的bbox了。对这些box，优先进行垂直扩展，然后进行垂直切分.
+    分3步：
+    1. 先把能完全垂直打通的隔离出去当做一个layout
+    2. 其余的先垂直切分
+    3. 垂直切分之后的部分再尝试水平切分
+    4. 剩下的不能被切分的各个部分当成一个layout
+    """
+    # 对每部分进行垂直切分
+    for bboxes_in_block in unsplited_bboxes:
+        # 首先对这个block的bbox进行垂直方向上的扩展
+        boundry_x0, boundry_y0, boundry_x1, boundry_y1 = find_boundry_bboxes(bboxes_in_block)
+        # 进行垂直方向上的扩展
+        extended_vertical_bboxes = extend_bbox_vertical(bboxes_in_block, boundry_x0, boundry_y0, boundry_x1, boundry_y1)
+        # 然后对这个block进行垂直方向上的切分
+        extend_bbox_vertical.sort(key=lambda x: x[X0_IDX]) # x方向上从小到大，代表了从左到右读取
+        v_boxes_group = []
+        for bbox in extended_vertical_bboxes:
+            if bbox[Y0_IDX]==boundry_y0 and bbox[Y1_IDX]==boundry_y1:
+                v_boxes_group.append(bbox)
+            else:
+                if len(v_boxes_group)>0:
+                    v_boxes.append(v_boxes_group)
+                    v_boxes_group = []
+        if len(v_boxes_group)>0:
+            v_boxes.append(v_boxes_group)
+        # 把连续的垂直部分加入到sorted_layouts里。注意这个时候已经是连续的垂直部分了，因为上面已经做了
+        for gp in v_boxes:
+            gp.sort(key=lambda x: x[X0_IDX])
+            block_info = {"layout_label":"GOOD_LAYOUT", "content_bboxes": gp}
+            # 然后计算这个group的layout_bbox，也就是最小的x0,y0, 最大的x1,y1
+            x0, y0, x1, y1 = gp[0][X0_EXT_IDX], gp[0][Y0_EXT_IDX], gp[-1][X1_EXT_IDX], gp[-1][Y1_EXT_IDX]
+            block_info["layout_bbox"] = [x0, y0, x1, y1]
+            sorted_layouts.append(block_info)
+        # 在垂直方向上，划分子块，也就是用贯通的垂直线进行切分。这些被切分出来的块，极大可能是可被垂直切分的，如果不能完全的垂直切分，那么尝试水平切分。都不能的则当成一个layout
+        v_split_lines = [boundry_x0]
+        for gp in v_boxes:
+            layout_bbox = gp['layout_bbox']
+            x0, x1 = layout_bbox[0], layout_bbox[2]
+            v_split_lines.append(x0)
+            v_split_lines.append(x1)
+        v_split_lines.append(boundry_x1)
+    reset_idx_x_y(all_bboxes)
+    all_boxes = _paper_bbox_sort_ext(all_bboxes)
+    return all_boxes

magic_pdf/layout/layout_det_utils.py ADDED Viewed

	@@ -0,0 +1,182 @@

+from magic_pdf.layout.bbox_sort import X0_EXT_IDX, X0_IDX, X1_EXT_IDX, X1_IDX, Y0_IDX, Y1_EXT_IDX, Y1_IDX
+from magic_pdf.libs.boxbase import _is_bottom_full_overlap, _left_intersect, _right_intersect
+def find_all_left_bbox_direct(this_bbox, all_bboxes) -> list:
+    """
+    在all_bboxes里找到所有右侧垂直方向上和this_bbox有重叠的bbox， 不用延长线
+    并且要考虑两个box左右相交的情况，如果相交了，那么右侧的box就不算最左侧。
+    """
+    left_boxes = [box for box in all_bboxes if box[X1_IDX] <= this_bbox[X0_IDX]
+         and any([
+         box[Y0_IDX] < this_bbox[Y0_IDX] < box[Y1_IDX], box[Y0_IDX] < this_bbox[Y1_IDX] < box[Y1_IDX],
+         this_bbox[Y0_IDX] < box[Y0_IDX] < this_bbox[Y1_IDX], this_bbox[Y0_IDX] < box[Y1_IDX] < this_bbox[Y1_IDX],
+         box[Y0_IDX]==this_bbox[Y0_IDX] and box[Y1_IDX]==this_bbox[Y1_IDX]]) or _left_intersect(box[:4], this_bbox[:4])]
+    # 然后再过滤一下，找到水平上距离this_bbox最近的那个——x1最大的那个
+    if len(left_boxes) > 0:
+        left_boxes.sort(key=lambda x: x[X1_EXT_IDX] if x[X1_EXT_IDX] else x[X1_IDX], reverse=True)
+        left_boxes = left_boxes[0]
+    else:
+        left_boxes = None
+    return left_boxes
+def find_all_right_bbox_direct(this_bbox, all_bboxes) -> list:
+    """
+    找到在this_bbox右侧且距离this_bbox距离最近的bbox.必须是直接遮挡的那种
+    """
+    right_bboxes = [box for box in all_bboxes if box[X0_IDX] >= this_bbox[X1_IDX]
+        and any([
+        this_bbox[Y0_IDX] < box[Y0_IDX] < this_bbox[Y1_IDX], this_bbox[Y0_IDX] < box[Y1_IDX] < this_bbox[Y1_IDX],
+        box[Y0_IDX] < this_bbox[Y0_IDX] < box[Y1_IDX], box[Y0_IDX] < this_bbox[Y1_IDX] < box[Y1_IDX],
+        box[Y0_IDX]==this_bbox[Y0_IDX] and box[Y1_IDX]==this_bbox[Y1_IDX]]) or _right_intersect(this_bbox[:4], box[:4])]
+    if len(right_bboxes)>0:
+        right_bboxes.sort(key=lambda x: x[X0_EXT_IDX] if x[X0_EXT_IDX] else x[X0_IDX])
+        right_bboxes = right_bboxes[0]
+    else:
+        right_bboxes = None
+    return right_bboxes
+def find_all_top_bbox_direct(this_bbox, all_bboxes) -> list:
+    """
+    找到在this_bbox上侧且距离this_bbox距离最近的bbox.必须是直接遮挡的那种
+    """
+    top_bboxes = [box for box in all_bboxes if box[Y1_IDX] <= this_bbox[Y0_IDX] and any([
+        box[X0_IDX] < this_bbox[X0_IDX] < box[X1_IDX], box[X0_IDX] < this_bbox[X1_IDX] < box[X1_IDX],
+        this_bbox[X0_IDX] < box[X0_IDX] < this_bbox[X1_IDX], this_bbox[X0_IDX] < box[X1_IDX] < this_bbox[X1_IDX],
+        box[X0_IDX]==this_bbox[X0_IDX] and box[X1_IDX]==this_bbox[X1_IDX]])]
+    if len(top_bboxes)>0:
+        top_bboxes.sort(key=lambda x: x[Y1_EXT_IDX] if x[Y1_EXT_IDX] else x[Y1_IDX], reverse=True)
+        top_bboxes = top_bboxes[0]
+    else:
+        top_bboxes = None
+    return top_bboxes
+def find_all_bottom_bbox_direct(this_bbox, all_bboxes) -> list:
+    """
+    找到在this_bbox下侧且距离this_bbox距离最近的bbox.必须是直接遮挡的那种
+    """
+    bottom_bboxes = [box for box in all_bboxes if box[Y0_IDX] >= this_bbox[Y1_IDX] and any([
+        this_bbox[X0_IDX] < box[X0_IDX] < this_bbox[X1_IDX], this_bbox[X0_IDX] < box[X1_IDX] < this_bbox[X1_IDX],
+        box[X0_IDX] < this_bbox[X0_IDX] < box[X1_IDX], box[X0_IDX] < this_bbox[X1_IDX] < box[X1_IDX],
+        box[X0_IDX]==this_bbox[X0_IDX] and box[X1_IDX]==this_bbox[X1_IDX]])]
+    if len(bottom_bboxes)>0:
+        bottom_bboxes.sort(key=lambda x:  x[Y0_IDX])
+        bottom_bboxes = bottom_bboxes[0]
+    else:
+        bottom_bboxes = None
+    return bottom_bboxes
+# ===================================================================================================================
+def find_bottom_bbox_direct_from_right_edge(this_bbox, all_bboxes) -> list:
+    """
+    找到在this_bbox下侧且距离this_bbox距离最近的bbox.必须是直接遮挡的那种
+    """
+    bottom_bboxes = [box for box in all_bboxes if box[Y0_IDX] >= this_bbox[Y1_IDX] and any([
+        this_bbox[X0_IDX] < box[X0_IDX] < this_bbox[X1_IDX], this_bbox[X0_IDX] < box[X1_IDX] < this_bbox[X1_IDX],
+        box[X0_IDX] < this_bbox[X0_IDX] < box[X1_IDX], box[X0_IDX] < this_bbox[X1_IDX] < box[X1_IDX],
+        box[X0_IDX]==this_bbox[X0_IDX] and box[X1_IDX]==this_bbox[X1_IDX]])]
+    if len(bottom_bboxes)>0:
+        # y0最小， X1最大的那个,也就是box上边缘最靠近this_bbox的那个,并且还最靠右
+        bottom_bboxes.sort(key=lambda x: x[Y0_IDX])
+        bottom_bboxes = [box for box in bottom_bboxes if box[Y0_IDX]==bottom_bboxes[0][Y0_IDX]]
+        # 然后再y1相同的情况下，找到x1最大的那个
+        bottom_bboxes.sort(key=lambda x: x[X1_IDX], reverse=True)
+        bottom_bboxes = bottom_bboxes[0]
+    else:
+        bottom_bboxes = None
+    return bottom_bboxes
+def find_bottom_bbox_direct_from_left_edge(this_bbox, all_bboxes) -> list:
+    """
+    找到在this_bbox下侧且距离this_bbox距离最近的bbox.必���是直接遮挡的那种
+    """
+    bottom_bboxes = [box for box in all_bboxes if box[Y0_IDX] >= this_bbox[Y1_IDX] and any([
+        this_bbox[X0_IDX] < box[X0_IDX] < this_bbox[X1_IDX], this_bbox[X0_IDX] < box[X1_IDX] < this_bbox[X1_IDX],
+        box[X0_IDX] < this_bbox[X0_IDX] < box[X1_IDX], box[X0_IDX] < this_bbox[X1_IDX] < box[X1_IDX],
+        box[X0_IDX]==this_bbox[X0_IDX] and box[X1_IDX]==this_bbox[X1_IDX]])]
+    if len(bottom_bboxes)>0:
+        # y0最小， X0最小的那个
+        bottom_bboxes.sort(key=lambda x: x[Y0_IDX])
+        bottom_bboxes = [box for box in bottom_bboxes if box[Y0_IDX]==bottom_bboxes[0][Y0_IDX]]
+        # 然后再y0相同的情况下，找到x0最小的那个
+        bottom_bboxes.sort(key=lambda x: x[X0_IDX])
+        bottom_bboxes = bottom_bboxes[0]
+    else:
+        bottom_bboxes = None
+    return bottom_bboxes
+def find_top_bbox_direct_from_left_edge(this_bbox, all_bboxes) -> list:
+    """
+    找到在this_bbox上侧且距离this_bbox距离最近的bbox.必须是直接遮挡的那种
+    """
+    top_bboxes = [box for box in all_bboxes if box[Y1_IDX] <= this_bbox[Y0_IDX] and any([
+        box[X0_IDX] < this_bbox[X0_IDX] < box[X1_IDX], box[X0_IDX] < this_bbox[X1_IDX] < box[X1_IDX],
+        this_bbox[X0_IDX] < box[X0_IDX] < this_bbox[X1_IDX], this_bbox[X0_IDX] < box[X1_IDX] < this_bbox[X1_IDX],
+        box[X0_IDX]==this_bbox[X0_IDX] and box[X1_IDX]==this_bbox[X1_IDX]])]
+    if len(top_bboxes)>0:
+        # y1最大， X0最小的那个
+        top_bboxes.sort(key=lambda x: x[Y1_IDX], reverse=True)
+        top_bboxes = [box for box in top_bboxes if box[Y1_IDX]==top_bboxes[0][Y1_IDX]]
+        # 然后再y1相同的情况下，找到x0最小的那个
+        top_bboxes.sort(key=lambda x: x[X0_IDX])
+        top_bboxes = top_bboxes[0]
+    else:
+        top_bboxes = None
+    return top_bboxes
+def find_top_bbox_direct_from_right_edge(this_bbox, all_bboxes) -> list:
+    """
+    找到在this_bbox上侧且距离this_bbox距离最近的bbox.必须是直接遮挡的那种
+    """
+    top_bboxes = [box for box in all_bboxes if box[Y1_IDX] <= this_bbox[Y0_IDX] and any([
+        box[X0_IDX] < this_bbox[X0_IDX] < box[X1_IDX], box[X0_IDX] < this_bbox[X1_IDX] < box[X1_IDX],
+        this_bbox[X0_IDX] < box[X0_IDX] < this_bbox[X1_IDX], this_bbox[X0_IDX] < box[X1_IDX] < this_bbox[X1_IDX],
+        box[X0_IDX]==this_bbox[X0_IDX] and box[X1_IDX]==this_bbox[X1_IDX]])]
+    if len(top_bboxes)>0:
+        # y1最大， X1最大的那个
+        top_bboxes.sort(key=lambda x: x[Y1_IDX], reverse=True)
+        top_bboxes = [box for box in top_bboxes if box[Y1_IDX]==top_bboxes[0][Y1_IDX]]
+        # 然后再y1相同的情况下，找到x1最大的那个
+        top_bboxes.sort(key=lambda x: x[X1_IDX], reverse=True)
+        top_bboxes = top_bboxes[0]
+    else:
+        top_bboxes = None
+    return top_bboxes
+# ===================================================================================================================
+def get_left_edge_bboxes(all_bboxes) -> list:
+    """
+    返回最左边的bbox
+    """
+    left_bboxes = [box for box in all_bboxes if find_all_left_bbox_direct(box, all_bboxes) is None]
+    return left_bboxes
+def get_right_edge_bboxes(all_bboxes) -> list:
+    """
+    返回最右边的bbox
+    """
+    right_bboxes = [box for box in all_bboxes if find_all_right_bbox_direct(box, all_bboxes) is None]
+    return right_bboxes
+def fix_vertical_bbox_pos(bboxes:list):
+    """
+    检查这批bbox在垂直方向是否有轻微的重叠，如果重叠了，就把重叠的bbox往下移动一点
+    在x方向上必须一个包含或者被包含，或者完全重叠，不能只有部分重叠
+    """
+    bboxes.sort(key=lambda x: x[Y0_IDX]) # 从上向下排列
+    for i in range(0, len(bboxes)):
+        for j in range(i+1, len(bboxes)):
+            if _is_bottom_full_overlap(bboxes[i][:4], bboxes[j][:4]):
+                # 如果两个bbox有部分重叠，那么就把下面的bbox往下移动一点
+                bboxes[j][Y0_IDX] = bboxes[i][Y1_IDX] + 2 # 2是个经验值
+                break
+    return bboxes

magic_pdf/layout/layout_sort.py ADDED Viewed

	@@ -0,0 +1,732 @@

+"""
+对pdf上的box进行layout识别，并对内部组成的box进行排序
+"""
+from loguru import logger
+from magic_pdf.layout.bbox_sort import CONTENT_IDX, CONTENT_TYPE_IDX, X0_EXT_IDX, X0_IDX, X1_EXT_IDX, X1_IDX, Y0_EXT_IDX, Y0_IDX, Y1_EXT_IDX, Y1_IDX, paper_bbox_sort
+from magic_pdf.layout.layout_det_utils import find_all_left_bbox_direct, find_all_right_bbox_direct, find_bottom_bbox_direct_from_left_edge, find_bottom_bbox_direct_from_right_edge, find_top_bbox_direct_from_left_edge, find_top_bbox_direct_from_right_edge, find_all_top_bbox_direct, find_all_bottom_bbox_direct, get_left_edge_bboxes, get_right_edge_bboxes
+from magic_pdf.libs.boxbase import get_bbox_in_boundry
+LAYOUT_V = "V"
+LAYOUT_H = "H"
+LAYOUT_UNPROC = "U"
+LAYOUT_BAD = "B"
+def _is_single_line_text(bbox):
+    """
+    检查bbox里面的文字是否只有一行
+    """
+    return True # TODO
+    box_type = bbox[CONTENT_TYPE_IDX]
+    if box_type != 'text':
+        return False
+    paras = bbox[CONTENT_IDX]["paras"]
+    text_content = ""
+    for para_id, para in paras.items():  # 拼装内部的段落文本
+        is_title = para['is_title']
+        if is_title!=0:
+            text_content += f"## {para['text']}"
+        else:
+            text_content += para["text"]
+        text_content += "\n\n"
+    return bbox[CONTENT_TYPE_IDX] == 'text' and len(text_content.split("\n\n")) <= 1
+def _horizontal_split(bboxes:list, boundry:tuple, avg_font_size=20)-> list:
+    """
+    对bboxes进行水平切割
+    方法是：找到左侧和右侧都没有被直接遮挡的box，然后进行扩展，之后进行切割
+    return:
+        返回几个大的Layout区域 [[x0, y0, x1, y1, "h|u|v"], ], h代表水平，u代表未探测的，v代表垂直布局
+    """
+    sorted_layout_blocks = [] # 这是要最终返回的值
+    bound_x0, bound_y0, bound_x1, bound_y1 = boundry
+    all_bboxes = get_bbox_in_boundry(bboxes, boundry)
+    #all_bboxes = paper_bbox_sort(all_bboxes, abs(bound_x1-bound_x0), abs(bound_y1-bound_x0)) # 大致拍下序, 这个是基于直接遮挡的。
+    """
+    首先在水平方向上扩展独占一行的bbox
+    """
+    last_h_split_line_y1 = bound_y0 #记录下上次的水平分割线
+    for i, bbox in enumerate(all_bboxes):
+        left_nearest_bbox = find_all_left_bbox_direct(bbox, all_bboxes) # 非扩展线
+        right_nearest_bbox = find_all_right_bbox_direct(bbox, all_bboxes)
+        if left_nearest_bbox is None and right_nearest_bbox is None: # 独占一行
+            """
+            然而，如果只是孤立的一行文字，那么就还要满足以下几个条件才可以：
+            1. bbox和中心线相交。或者
+            2. 上方或者下方也存在同类水平的独占一行的bbox。 或者
+            3. TODO 加强条件：这个bbox上方和下方是同一列column，那么就不能算作独占一行
+            """
+            # 先检查这个bbox里是否只包含一行文字
+            is_single_line =  _is_single_line_text(bbox)
+            """
+            这里有个点需要注意，当页面内容不是居中的时候，第一次调用传递的是page的boundry，这个时候mid_x就不是中心线了.
+            所以这里计算出最紧致的boundry，然后再计算mid_x
+            """
+            boundry_real_x0, boundry_real_x1 = min([bbox[X0_IDX] for bbox in all_bboxes]), max([bbox[X1_IDX] for bbox in all_bboxes])
+            mid_x = (boundry_real_x0+boundry_real_x1)/2
+            # 检查这个box是否内容在中心线有交
+            # 必须跨过去2个字符的宽度
+            is_cross_boundry_mid_line = min(mid_x-bbox[X0_IDX], bbox[X1_IDX]-mid_x) > avg_font_size*2
+            """
+            检查条件2
+            """
+            is_belong_to_col = False
+            """
+            检查是否能被上方col吸收，方法是：
+            1. 上方非空且不是独占一行的，并且
+            2. 从上个水平分割的最大y=y1开始到当前bbox,最左侧的bbox的[min_x0, max_x1],能够覆盖当前box的[x0, x1]
+            """
+            """
+            以迭代的方式向上找，查找范围是[bound_x0, last_h_sp, bound_x1, bbox[Y0_IDX]]
+            """
+            #先确定上方的y0, y0
+            b_y0, b_y1 = last_h_split_line_y1, bbox[Y0_IDX]
+            #然后从box开始逐个向上找到所有与box在x上有交集的box
+            box_to_check = [bound_x0, b_y0, bound_x1, b_y1]
+            bbox_in_bound_check = get_bbox_in_boundry(all_bboxes, box_to_check)
+            bboxes_on_top = []
+            virtual_box = bbox
+            while True:
+                b_on_top = find_all_top_bbox_direct(virtual_box, bbox_in_bound_check)
+                if b_on_top is not None:
+                    bboxes_on_top.append(b_on_top)
+                    virtual_box = [min([virtual_box[X0_IDX], b_on_top[X0_IDX]]), min(virtual_box[Y0_IDX], b_on_top[Y0_IDX]), max([virtual_box[X1_IDX], b_on_top[X1_IDX]]), b_y1]
+                else:
+                    break
+            # 随后确定这些box的最小x0, 最大x1
+            if len(bboxes_on_top)>0 and len(bboxes_on_top) != len(bbox_in_bound_check):# virtual_box可能会膨胀到占满整个区域，这实际上就不能属于一个col了。
+                min_x0, max_x1 = virtual_box[X0_IDX], virtual_box[X1_IDX]
+                # 然后采用一种比较粗糙的方法，看min_x0，max_x1是否与位于[bound_x0, last_h_sp, bound_x1, bbox[Y0_IDX]]之间的box有相交
+                if not any([b[X0_IDX] <= min_x0-1 <= b[X1_IDX] or b[X0_IDX] <= max_x1+1 <= b[X1_IDX] for b in bbox_in_bound_check]):
+                    # 其上，下都不能被扩展成行，暂时只检查一下上方 TODO
+                    top_nearest_bbox = find_all_top_bbox_direct(bbox, bboxes)
+                    bottom_nearest_bbox = find_all_bottom_bbox_direct(bbox, bboxes)
+                    if not any([
+                        top_nearest_bbox is not None and (find_all_left_bbox_direct(top_nearest_bbox, bboxes) is  None and  find_all_right_bbox_direct(top_nearest_bbox, bboxes) is None),
+                        bottom_nearest_bbox is not None and (find_all_left_bbox_direct(bottom_nearest_bbox, bboxes) is  None and  find_all_right_bbox_direct(bottom_nearest_bbox, bboxes) is None),
+                        top_nearest_bbox is None or bottom_nearest_bbox is None
+                        ]):
+                            is_belong_to_col = True
+            # 检查是否能被下方col吸收 TODO
+            """
+            这里为什么没有is_cross_boundry_mid_line的条件呢？
+            确实有些杂志左右两栏宽度不是对称的。
+            """
+            if not is_belong_to_col or is_cross_boundry_mid_line:
+                bbox[X0_EXT_IDX] = bound_x0
+                bbox[Y0_EXT_IDX] = bbox[Y0_IDX]
+                bbox[X1_EXT_IDX] = bound_x1
+                bbox[Y1_EXT_IDX] = bbox[Y1_IDX]
+                last_h_split_line_y1 = bbox[Y1_IDX] # 更新这条线
+            else:
+                continue
+    """
+    此时独占一行的被成功扩展到指定的边界上，这个时候利用边界条件合并连续的bbox，成为一个group
+    然后合并所有连续水平方向的bbox.
+    """
+    all_bboxes.sort(key=lambda x: x[Y0_IDX])
+    h_bboxes = []
+    h_bbox_group = []
+    for bbox in all_bboxes:
+        if bbox[X0_EXT_IDX] == bound_x0 and bbox[X1_EXT_IDX] == bound_x1:
+            h_bbox_group.append(bbox)
+        else:
+            if len(h_bbox_group)>0:
+                h_bboxes.append(h_bbox_group)
+                h_bbox_group = []
+    # 最后一个group
+    if len(h_bbox_group)>0:
+        h_bboxes.append(h_bbox_group)
+    """
+    现在h_bboxes里面是所有的group了，每个group都是一个list
+    对h_bboxes里的每个group进行计算放回到sorted_layouts里
+    """
+    h_layouts = []
+    for gp in h_bboxes:
+        gp.sort(key=lambda x: x[Y0_IDX])
+        # 然后计算这个group的layout_bbox，也就是最小的x0,y0, 最大的x1,y1
+        x0, y0, x1, y1 = gp[0][X0_EXT_IDX], gp[0][Y0_EXT_IDX], gp[-1][X1_EXT_IDX], gp[-1][Y1_EXT_IDX]
+        h_layouts.append([x0, y0, x1, y1, LAYOUT_H]) # 水平的布局
+    """
+    接下来利用这些连续的水平bbox的layout_bbox的y0, y1，从水平上切分开其余的为几个部分
+    """
+    h_split_lines = [bound_y0]
+    for gp in h_bboxes: # gp是一个list[bbox_list]
+        y0, y1 = gp[0][1], gp[-1][3]
+        h_split_lines.append(y0)
+        h_split_lines.append(y1)
+    h_split_lines.append(bound_y1)
+    unsplited_bboxes = []
+    for i in range(0, len(h_split_lines), 2):
+        start_y0, start_y1 = h_split_lines[i:i+2]
+        # 然后找出[start_y0, start_y1]之间的其他bbox，这些组成一个未分割板块
+        bboxes_in_block = [bbox for bbox in all_bboxes if bbox[Y0_IDX]>=start_y0 and bbox[Y1_IDX]<=start_y1]
+        unsplited_bboxes.append(bboxes_in_block)
+    # 接着把未处理的加入到h_layouts里
+    for bboxes_in_block in unsplited_bboxes:
+        if len(bboxes_in_block) == 0:
+            continue
+        x0, y0, x1, y1 = bound_x0, min([bbox[Y0_IDX] for bbox in bboxes_in_block]), bound_x1, max([bbox[Y1_IDX] for bbox in bboxes_in_block])
+        h_layouts.append([x0, y0, x1, y1, LAYOUT_UNPROC])
+    h_layouts.sort(key=lambda x: x[1]) # 按照y0排序, 也就是从上到下的顺序
+    """
+    转换成如下格式返回
+    """
+    for layout in h_layouts:
+        sorted_layout_blocks.append({
+            "layout_bbox": layout[:4],
+            "layout_label":layout[4],
+            "sub_layout":[],
+        })
+    return sorted_layout_blocks
+###############################################################################################
+#
+#  垂直方向的处理
+#
+#
+###############################################################################################
+def _vertical_align_split_v1(bboxes:list, boundry:tuple)-> list:
+    """
+    计算垂直方向上的对齐， 并分割bboxes成layout。负责对一列多行的进行列维度分割。
+    如果不能完全分割，剩余部分作为layout_lable为u的layout返回
+    -----------------------
+    |     |           |
+    |     |           |
+    |     |           |
+    |     |           |
+    -------------------------
+    此函数会将：以上布局将会切分出来2列
+    """
+    sorted_layout_blocks = [] # 这是要最终返回的值
+    new_boundry = [boundry[0], boundry[1], boundry[2], boundry[3]]
+    v_blocks = []
+    """
+    先从左到右切分
+    """
+    while True:
+        all_bboxes = get_bbox_in_boundry(bboxes, new_boundry)
+        left_edge_bboxes = get_left_edge_bboxes(all_bboxes)
+        if len(left_edge_bboxes) == 0:
+            break
+        right_split_line_x1 = max([bbox[X1_IDX] for bbox in left_edge_bboxes])+1
+        # 然后检查这条线能不与其他bbox的左边界相交或者重合
+        if any([bbox[X0_IDX] <= right_split_line_x1 <= bbox[X1_IDX] for bbox in all_bboxes]):
+            # 垂直切分线与某些box发生相交，说明无法完全垂直方向切分。
+            break
+        else: # 说明成功分割出一列
+            # 找到左侧边界最靠左的bbox作为layout的x0
+            layout_x0 = min([bbox[X0_IDX] for bbox in left_edge_bboxes]) # 这里主要是为了画出来有一定间距
+            v_blocks.append([layout_x0, new_boundry[1], right_split_line_x1, new_boundry[3], LAYOUT_V])
+            new_boundry[0] = right_split_line_x1 # 更新边界
+    """
+    再从右到左切， 此时如果还是无法完全切分，那么剩余部分作为layout_lable为u的layout返回
+    """
+    unsplited_block = []
+    while True:
+        all_bboxes = get_bbox_in_boundry(bboxes, new_boundry)
+        right_edge_bboxes = get_right_edge_bboxes(all_bboxes)
+        if len(right_edge_bboxes) == 0:
+            break
+        left_split_line_x0 = min([bbox[X0_IDX] for bbox in right_edge_bboxes])-1
+        # 然后检查这条线能不与其他bbox的左边界相交或者重合
+        if any([bbox[X0_IDX] <= left_split_line_x0 <= bbox[X1_IDX] for bbox in all_bboxes]):
+            # 这里是余下的
+            unsplited_block.append([new_boundry[0], new_boundry[1], new_boundry[2], new_boundry[3], LAYOUT_UNPROC])
+            break
+        else:
+            # 找到右侧边界最靠右的bbox作为layout的x1
+            layout_x1 = max([bbox[X1_IDX] for bbox in right_edge_bboxes])
+            v_blocks.append([left_split_line_x0, new_boundry[1], layout_x1, new_boundry[3], LAYOUT_V])
+            new_boundry[2] = left_split_line_x0 # 更新右边界
+    """
+    最后拼装成layout格式返回
+    """
+    for block in v_blocks:
+        sorted_layout_blocks.append({
+            "layout_bbox": block[:4],
+            "layout_label":block[4],
+            "sub_layout":[],
+        })
+    for block in unsplited_block:
+        sorted_layout_blocks.append({
+            "layout_bbox": block[:4],
+            "layout_label":block[4],
+            "sub_layout":[],
+        })
+    # 按照x0排序
+    sorted_layout_blocks.sort(key=lambda x: x['layout_bbox'][0])
+    return sorted_layout_blocks
+def _vertical_align_split_v2(bboxes:list, boundry:tuple)-> list:
+    """
+    改进的 _vertical_align_split算法，原算法会因为第二列的box由于左侧没有遮挡被认为是左侧的一部分，导致整个layout多列被识别为一列。
+    利用从左上角的box开始向下看的方法，不断扩展w_x0, w_x1，直到不能继续向下扩展，或者到达边界下边界。
+    """
+    sorted_layout_blocks = [] # 这是要最终返回的值
+    new_boundry = [boundry[0], boundry[1], boundry[2], boundry[3]]
+    bad_boxes = [] # 被割中的box
+    v_blocks = []
+    while True:
+        all_bboxes = get_bbox_in_boundry(bboxes, new_boundry)
+        if len(all_bboxes) == 0:
+            break
+        left_top_box = min(all_bboxes, key=lambda x: (x[X0_IDX],x[Y0_IDX]))# 这里应该加强，检查一下必须是在第一列的 TODO
+        start_box = [left_top_box[X0_IDX], left_top_box[Y0_IDX], left_top_box[X1_IDX], left_top_box[Y1_IDX]]
+        w_x0, w_x1 = left_top_box[X0_IDX], left_top_box[X1_IDX]
+        """
+        然后沿着这个box线向下找最近的那个box, 然后扩展w_x0, w_x1
+        扩展之后，宽度会增加，随后用x=w_x1来检测在边界内是否有box与相交，如果相交，那么就说明不能再扩展了。
+        当不能扩展的时候就要看是否到达下边界：
+        1. 达到，那么更新左边界继续分下一个列
+        2. 没有达到，那么此时开始从右侧切分进入下面的循环里
+        """
+        while left_top_box is not None: # 向下去找
+            virtual_box = [w_x0, left_top_box[Y0_IDX], w_x1, left_top_box[Y1_IDX]]
+            left_top_box = find_bottom_bbox_direct_from_left_edge(virtual_box, all_bboxes)
+            if left_top_box:
+                w_x0, w_x1 = min(virtual_box[X0_IDX], left_top_box[X0_IDX]), max([virtual_box[X1_IDX], left_top_box[X1_IDX]])
+        # 万一这个初始的box在column中间，那么还要向上看
+        start_box = [w_x0, start_box[Y0_IDX], w_x1, start_box[Y1_IDX]] # 扩展一下宽度更鲁棒
+        left_top_box = find_top_bbox_direct_from_left_edge(start_box, all_bboxes)
+        while left_top_box is not None: # 向上去找
+            virtual_box = [w_x0, left_top_box[Y0_IDX], w_x1, left_top_box[Y1_IDX]]
+            left_top_box = find_top_bbox_direct_from_left_edge(virtual_box, all_bboxes)
+            if left_top_box:
+                w_x0, w_x1 = min(virtual_box[X0_IDX], left_top_box[X0_IDX]), max([virtual_box[X1_IDX], left_top_box[X1_IDX]])
+        # 检查相交
+        if any([bbox[X0_IDX] <= w_x1+1 <= bbox[X1_IDX] for bbox in all_bboxes]):
+            for b in all_bboxes:
+                if b[X0_IDX] <= w_x1+1 <= b[X1_IDX]:
+                    bad_boxes.append([b[X0_IDX], b[Y0_IDX], b[X1_IDX], b[Y1_IDX]])
+            break
+        else: # 说明成功分割出一列
+            v_blocks.append([w_x0, new_boundry[1], w_x1, new_boundry[3], LAYOUT_V])
+            new_boundry[0] = w_x1 # 更新边界
+    """
+    接着开始从右上角的box扫描
+    """
+    w_x0 , w_x1 = 0, 0
+    unsplited_block = []
+    while True:
+        all_bboxes = get_bbox_in_boundry(bboxes, new_boundry)
+        if len(all_bboxes) == 0:
+            break
+        # 先找到X1最大的
+        bbox_list_sorted = sorted(all_bboxes, key=lambda bbox: bbox[X1_IDX], reverse=True)
+        # Then, find the boxes with the smallest Y0 value
+        bigest_x1 = bbox_list_sorted[0][X1_IDX]
+        boxes_with_bigest_x1 = [bbox for bbox in bbox_list_sorted if bbox[X1_IDX] == bigest_x1] # 也就是最靠右的那些
+        right_top_box = min(boxes_with_bigest_x1, key=lambda bbox: bbox[Y0_IDX]) # y0最小的那个
+        start_box = [right_top_box[X0_IDX], right_top_box[Y0_IDX], right_top_box[X1_IDX], right_top_box[Y1_IDX]]
+        w_x0, w_x1 = right_top_box[X0_IDX], right_top_box[X1_IDX]
+        while right_top_box is not None:
+            virtual_box = [w_x0, right_top_box[Y0_IDX], w_x1, right_top_box[Y1_IDX]]
+            right_top_box = find_bottom_bbox_direct_from_right_edge(virtual_box, all_bboxes)
+            if right_top_box:
+                w_x0, w_x1 = min([w_x0, right_top_box[X0_IDX]]), max([w_x1, right_top_box[X1_IDX]])
+        # 在向上扫描
+        start_box = [w_x0, start_box[Y0_IDX], w_x1, start_box[Y1_IDX]] # 扩展一下宽度更鲁棒
+        right_top_box = find_top_bbox_direct_from_right_edge(start_box, all_bboxes)
+        while right_top_box is not None:
+            virtual_box = [w_x0, right_top_box[Y0_IDX], w_x1, right_top_box[Y1_IDX]]
+            right_top_box = find_top_bbox_direct_from_right_edge(virtual_box, all_bboxes)
+            if right_top_box:
+                w_x0, w_x1 = min([w_x0, right_top_box[X0_IDX]]), max([w_x1, right_top_box[X1_IDX]])
+        # 检查是否与其他box相交， 垂直切分线与某些box发生相交，说明无法完全垂直方向切分。
+        if any([bbox[X0_IDX] <= w_x0-1 <= bbox[X1_IDX] for bbox in all_bboxes]):
+            unsplited_block.append([new_boundry[0], new_boundry[1], new_boundry[2], new_boundry[3], LAYOUT_UNPROC])
+            for b in all_bboxes:
+                if b[X0_IDX] <= w_x0-1 <= b[X1_IDX]:
+                    bad_boxes.append([b[X0_IDX], b[Y0_IDX], b[X1_IDX], b[Y1_IDX]])
+            break
+        else: # 说明成功分割出一列
+            v_blocks.append([w_x0, new_boundry[1], w_x1, new_boundry[3], LAYOUT_V])
+            new_boundry[2] = w_x0
+    """转换数据结构"""
+    for block in v_blocks:
+        sorted_layout_blocks.append({
+            "layout_bbox": block[:4],
+            "layout_label":block[4],
+            "sub_layout":[],
+        })
+    for block in unsplited_block:
+        sorted_layout_blocks.append({
+            "layout_bbox": block[:4],
+            "layout_label":block[4],
+            "sub_layout":[],
+            "bad_boxes": bad_boxes # 记录下来，这个box是被割中的
+        })
+    # 按照x0排序
+    sorted_layout_blocks.sort(key=lambda x: x['layout_bbox'][0])
+    return sorted_layout_blocks
+def _try_horizontal_mult_column_split(bboxes:list, boundry:tuple)-> list:
+    """
+    尝试水平切分，如果切分不动，那就当一个BAD_LAYOUT返回
+    ------------------
+    |        |       |
+    ------------------
+    |    |       |   |   <-  这里是此函数要切分的场景
+    ------------------
+    |        |       |
+    |        |       |
+    """
+    pass
+def _vertical_split(bboxes:list, boundry:tuple)-> list:
+    """
+    从垂直方向进行切割，分block
+    这个版本里，如果垂直切分不动，那就当一个BAD_LAYOUT返回
+                                --------------------------
+                                    |        |       |
+                                    |        |       |
+                                | |
+    这种列是此函数要切分的  ->    | |
+                                | |
+                                    |        |       |
+                                    |        |       |
+                                -------------------------
+    """
+    sorted_layout_blocks = [] # 这是要最终返回的值
+    bound_x0, bound_y0, bound_x1, bound_y1 = boundry
+    all_bboxes = get_bbox_in_boundry(bboxes, boundry)
+    """
+    all_bboxes = fix_vertical_bbox_pos(all_bboxes) # 垂直方向解覆盖
+    all_bboxes = fix_hor_bbox_pos(all_bboxes)  # 水平解覆盖
+    这两行代码目前先不执行，因为公式检测，表格检测还不是很成熟，导致非常多的textblock参与了运算，时间消耗太大。
+    这两行代码的作用是：
+    如果遇到互相重叠的bbox, 那么会把面积较小的box进行压缩，从而避免重叠。对布局切分来说带来正反馈。
+    """
+    #all_bboxes = paper_bbox_sort(all_bboxes, abs(bound_x1-bound_x0), abs(bound_y1-bound_x0)) # 大致拍下序, 这个是基于直接遮挡的。
+    """
+    首先在垂直方向上扩展独占一行的bbox
+    """
+    for bbox in all_bboxes:
+        top_nearest_bbox = find_all_top_bbox_direct(bbox, all_bboxes) # 非扩展线
+        bottom_nearest_bbox = find_all_bottom_bbox_direct(bbox, all_bboxes)
+        if top_nearest_bbox is None and bottom_nearest_bbox is None  and not any([b[X0_IDX]<bbox[X1_IDX]<b[X1_IDX] or b[X0_IDX]<bbox[X0_IDX]<b[X1_IDX] for b in all_bboxes]): # 独占一列, 且不和其他重叠
+            bbox[X0_EXT_IDX] = bbox[X0_IDX]
+            bbox[Y0_EXT_IDX] = bound_y0
+            bbox[X1_EXT_IDX] = bbox[X1_IDX]
+            bbox[Y1_EXT_IDX] = bound_y1
+    """
+    此时独占一列的被成功扩展到指定的边界上，这个时候利用边界条件合并连续的bbox，成为一个group
+    然后合并所有连续垂直方向的bbox.
+    """
+    all_bboxes.sort(key=lambda x: x[X0_IDX])
+    # fix: 这里水平方向的列不要合并成一个行，因为需要保证返回给下游的最小block，总是可以无脑从上到下阅读文字。
+    v_bboxes = []
+    for box in all_bboxes:
+        if box[Y0_EXT_IDX] == bound_y0 and box[Y1_EXT_IDX] == bound_y1:
+            v_bboxes.append(box)
+    """
+    现在v_bboxes里面是所有的group了，每个group都是一个list
+    对v_bboxes里的每个group进行计算放回到sorted_layouts里
+    """
+    v_layouts = []
+    for vbox in v_bboxes:
+        #gp.sort(key=lambda x: x[X0_IDX])
+        # 然后计算这个group的layout_bbox，也就是最小的x0,y0, 最大的x1,y1
+        x0, y0, x1, y1 = vbox[X0_EXT_IDX], vbox[Y0_EXT_IDX], vbox[X1_EXT_IDX], vbox[Y1_EXT_IDX]
+        v_layouts.append([x0, y0, x1, y1, LAYOUT_V]) # 垂直的布局
+    """
+    接下来利用这些连续的垂直bbox的layout_bbox的x0, x1，从垂直上切分开其余的为几个部分
+    """
+    v_split_lines = [bound_x0]
+    for gp in v_bboxes:
+        x0, x1 = gp[X0_IDX], gp[X1_IDX]
+        v_split_lines.append(x0)
+        v_split_lines.append(x1)
+    v_split_lines.append(bound_x1)
+    unsplited_bboxes = []
+    for i in range(0, len(v_split_lines), 2):
+        start_x0, start_x1 = v_split_lines[i:i+2]
+        # 然后找出[start_x0, start_x1]之间的其他bbox，这些组成一个未分割板块
+        bboxes_in_block = [bbox for bbox in all_bboxes if bbox[X0_IDX]>=start_x0 and bbox[X1_IDX]<=start_x1]
+        unsplited_bboxes.append(bboxes_in_block)
+    # 接着把未处理的加入到v_layouts里
+    for bboxes_in_block in unsplited_bboxes:
+        if len(bboxes_in_block) == 0:
+            continue
+        x0, y0, x1, y1 = min([bbox[X0_IDX] for bbox in bboxes_in_block]), bound_y0, max([bbox[X1_IDX] for bbox in bboxes_in_block]), bound_y1
+        v_layouts.append([x0, y0, x1, y1, LAYOUT_UNPROC]) # 说明这篇区域未能够分析出可靠的版面
+    v_layouts.sort(key=lambda x: x[0]) # 按照x0排序, 也就是从左到右的顺序
+    for layout in v_layouts:
+        sorted_layout_blocks.append({
+            "layout_bbox": layout[:4],
+            "layout_label":layout[4],
+            "sub_layout":[],
+        })
+    """
+    至此，垂直方向切成了2种类型，其一是独占一列的，其二是未处理的。
+    下面对这些未处理的进行垂直方向切分，这个切分要切出来类似“吕”这种类型的垂直方向的布局
+    """
+    for i, layout in enumerate(sorted_layout_blocks):
+        if layout['layout_label'] == LAYOUT_UNPROC:
+            x0, y0, x1, y1 = layout['layout_bbox']
+            v_split_layouts = _vertical_align_split_v2(bboxes, [x0, y0, x1, y1])
+            sorted_layout_blocks[i] = {
+                "layout_bbox": [x0, y0, x1, y1],
+                "layout_label": LAYOUT_H,
+                "sub_layout": v_split_layouts
+            }
+            layout['layout_label'] = LAYOUT_H # 被垂线切分成了水平布局
+    return sorted_layout_blocks
+def split_layout(bboxes:list, boundry:tuple, page_num:int)-> list:
+    """
+    把bboxes切割成layout
+    return:
+    [
+        {
+            "layout_bbox": [x0, y0, x1, y1],
+            "layout_label":"u|v|h|b", 未处理|垂直|水平|BAD_LAYOUT
+            "sub_layout": [] #每个元素都是[x0, y0, x1, y1, block_content, idx_x, idx_y, content_type, ext_x0, ext_y0, ext_x1, ext_y1], 并且顺序就是阅读顺序
+        }
+    ]
+    example:
+    [
+        {
+            "layout_bbox": [0, 0, 100, 100],
+            "layout_label":"u|v|h|b",
+            "sub_layout":[
+            ]
+        },
+        {
+            "layout_bbox": [0, 0, 100, 100],
+            "layout_label":"u|v|h|b",
+            "sub_layout":[
+                {
+                    "layout_bbox": [0, 0, 100, 100],
+                    "layout_label":"u|v|h|b",
+                    "content_bboxes":[
+                        [],
+                        [],
+                        []
+                    ]
+                },
+                {
+                    "layout_bbox": [0, 0, 100, 100],
+                    "layout_label":"u|v|h|b",
+                    "sub_layout":[
+                    ]
+                }
+        }
+    ]
+    """
+    sorted_layouts = [] # 最终返回的结果
+    boundry_x0, boundry_y0, boundry_x1, boundry_y1 = boundry
+    if len(bboxes) <=1:
+        return [
+            {
+                "layout_bbox": [boundry_x0, boundry_y0, boundry_x1, boundry_y1],
+                "layout_label": LAYOUT_V,
+                "sub_layout":[]
+            }
+        ]
+    """
+    接下来按照先水平后垂直的顺序进行切分
+    """
+    bboxes = paper_bbox_sort(bboxes, boundry_x1-boundry_x0, boundry_y1-boundry_y0)
+    sorted_layouts = _horizontal_split(bboxes, boundry) # 通过水平分割出来的layout
+    for i, layout in enumerate(sorted_layouts):
+        x0, y0, x1, y1 = layout['layout_bbox']
+        layout_type = layout['layout_label']
+        if layout_type == LAYOUT_UNPROC: # 说明是非独占单行的，这些需要垂直切分
+            v_split_layouts = _vertical_split(bboxes, [x0, y0, x1, y1])
+            """
+            最后这里有个逻辑问题：如果这个函数只分离出来了一个column layout，那么这个layout分割肯定超出了算法能力范围。因为我们假定的是传进来的
+            box已经把行全部剥离了，所以这里必须十多个列才可以。如果只剥离出来一个layout，并且是多个box，那么就说明这个layout是无法分割的，标记为LAYOUT_UNPROC
+            """
+            layout_label = LAYOUT_V
+            if len(v_split_layouts) == 1:
+                if len(v_split_layouts[0]['sub_layout']) == 0:
+                    layout_label = LAYOUT_UNPROC
+                    #logger.warning(f"WARNING: pageno={page_num}, 无法分割的layout: ", v_split_layouts)
+            """
+            组合起来最终的layout
+            """
+            sorted_layouts[i] = {
+                "layout_bbox": [x0, y0, x1, y1],
+                "layout_label": layout_label,
+                "sub_layout": v_split_layouts
+            }
+            layout['layout_label'] = LAYOUT_H
+    """
+    水平和垂直方向都切分完毕了。此时还有一些未处理的，这些未处理的可能是因为水平和垂直方向都无法切分。
+    这些最后调用_try_horizontal_mult_block_split做一次水平多个block的联合切分，如果也不能切分最终就当做BAD_LAYOUT返回
+    """
+    # TODO
+    return sorted_layouts
+def get_bboxes_layout(all_boxes:list, boundry:tuple, page_id:int):
+    """
+    对利用layout排序之后的box，进行排序
+    return:
+    [
+        {
+            "layout_bbox": [x0, y0, x1, y1],
+            "layout_label":"u|v|h|b", 未处理|垂直|水平|BAD_LAYOUT
+        }，
+    ]
+    """
+    def _preorder_traversal(layout):
+        """
+        对sorted_layouts的叶子节点，也就是len(sub_layout)==0的节点进行排序。排序按照前序遍历的顺序，也就是从上到下，从左到右的顺序
+        """
+        sorted_layout_blocks = []
+        for layout in layout:
+            sub_layout = layout['sub_layout']
+            if len(sub_layout) == 0:
+                sorted_layout_blocks.append(layout)
+            else:
+                s = _preorder_traversal(sub_layout)
+                sorted_layout_blocks.extend(s)
+        return sorted_layout_blocks
+    # -------------------------------------------------------------------------------------------------------------------------
+    sorted_layouts = split_layout(all_boxes, boundry, page_id)# 先切分成layout，得到一个Tree
+    total_sorted_layout_blocks  = _preorder_traversal(sorted_layouts)
+    return total_sorted_layout_blocks, sorted_layouts
+def get_columns_cnt_of_layout(layout_tree):
+    """
+    获取一个layout的宽度
+    """
+    max_width_list = [0] # 初始化一个元素，防止max,min函数报错
+    for items in layout_tree: # 针对每一层（横切）计算列数，横着的算一列
+        layout_type = items['layout_label']
+        sub_layouts = items['sub_layout']
+        if len(sub_layouts)==0:
+            max_width_list.append(1)
+        else:
+            if layout_type == LAYOUT_H:
+                max_width_list.append(1)
+            else:
+                width = 0
+                for l in sub_layouts:
+                    if len(l['sub_layout']) == 0:
+                        width += 1
+                    else:
+                        for lay in l['sub_layout']:
+                            width += get_columns_cnt_of_layout([lay])
+                max_width_list.append(width)
+    return max(max_width_list)
+def sort_with_layout(bboxes:list, page_width, page_height) -> (list,list):
+    """
+    输入是一个bbox的list.
+    获取到输入之后，先进行layout切分，然后对这些bbox进行排序。返回排序后的bboxes
+    """
+    new_bboxes = []
+    for box in bboxes:
+        # new_bboxes.append([box[0], box[1], box[2], box[3], None, None, None, 'text', None, None, None, None])
+        new_bboxes.append([box[0], box[1], box[2], box[3], None, None, None, 'text', None, None, None, None, box[4]])
+    layout_bboxes, _ = get_bboxes_layout(new_bboxes, [0, 0, page_width, page_height], 0)
+    if any([lay['layout_label']==LAYOUT_UNPROC for lay in layout_bboxes]):
+            logger.warning(f"drop this pdf, reason: 复杂版面")
+            return None,None
+    sorted_bboxes = []
+    # 利用layout bbox每次框定一些box，然后排序
+    for layout in layout_bboxes:
+        lbox = layout['layout_bbox']
+        bbox_in_layout = get_bbox_in_boundry(new_bboxes, lbox)
+        sorted_bbox = paper_bbox_sort(bbox_in_layout, lbox[2]-lbox[0], lbox[3]-lbox[1])
+        sorted_bboxes.extend(sorted_bbox)
+    return sorted_bboxes, layout_bboxes
+def sort_text_block(text_block, layout_bboxes):
+    """
+    对一页的text_block进行排序
+    """
+    sorted_text_bbox = []
+    all_text_bbox = []
+    # 做一个box=>text的映射
+    box_to_text = {}
+    for blk in text_block:
+        box = blk['bbox']
+        box_to_text[(box[0], box[1], box[2], box[3])] = blk
+        all_text_bbox.append(box)
+    # text_blocks_to_sort = []
+    # for box in box_to_text.keys():
+    #     text_blocks_to_sort.append([box[0], box[1], box[2], box[3], None, None, None, 'text', None, None, None, None])
+    # 按照layout_bboxes的顺序，对text_block进行排序
+    for layout in layout_bboxes:
+        layout_box = layout['layout_bbox']
+        text_bbox_in_layout = get_bbox_in_boundry(all_text_bbox, [layout_box[0]-1, layout_box[1]-1, layout_box[2]+1, layout_box[3]+1])
+        #sorted_bbox = paper_bbox_sort(text_bbox_in_layout, layout_box[2]-layout_box[0], layout_box[3]-layout_box[1])
+        text_bbox_in_layout.sort(key = lambda x: x[1]) # 一个layout内部的box，按照y0自上而下排序
+        #sorted_bbox = [[b] for b in text_blocks_to_sort]
+        for sb in text_bbox_in_layout:
+            sorted_text_bbox.append(box_to_text[(sb[0], sb[1], sb[2], sb[3])])
+    return sorted_text_bbox

magic_pdf/layout/layout_spiler_recog.py ADDED Viewed

	@@ -0,0 +1,101 @@

+"""
+找到能分割布局的水平的横线、色块
+"""
+import os
+from magic_pdf.libs.commons import fitz
+from magic_pdf.libs.boxbase import _is_in_or_part_overlap
+def __rect_filter_by_width(rect, page_w, page_h):
+    mid_x = page_w/2
+    if rect[0]< mid_x < rect[2]:
+        return True
+    return False
+def __rect_filter_by_pos(rect, image_bboxes, table_bboxes):
+    """
+    不能出现在table和image的位置
+    """
+    for box in image_bboxes:
+        if _is_in_or_part_overlap(rect, box):
+            return False
+    for box in table_bboxes:
+        if _is_in_or_part_overlap(rect, box):
+            return False
+    return True
+def __debug_show_page(page, bboxes1: list,bboxes2: list,bboxes3: list,):
+    save_path = "./tmp/debug.pdf"
+    if os.path.exists(save_path):
+        # 删除已经存在的文件
+        os.remove(save_path)
+    # 创建一个新的空白 PDF 文件
+    doc = fitz.open('')
+    width = page.rect.width
+    height = page.rect.height
+    new_page = doc.new_page(width=width, height=height)
+    shape = new_page.new_shape()
+    for bbox in bboxes1:
+        # 原始box画上去
+        rect = fitz.Rect(*bbox[0:4])
+        shape = new_page.new_shape()
+        shape.draw_rect(rect)
+        shape.finish(color=fitz.pdfcolor['red'], fill=fitz.pdfcolor['blue'], fill_opacity=0.2)
+        shape.finish()
+        shape.commit()
+    for bbox in bboxes2:
+        # 原始box画上去
+        rect = fitz.Rect(*bbox[0:4])
+        shape = new_page.new_shape()
+        shape.draw_rect(rect)
+        shape.finish(color=None, fill=fitz.pdfcolor['yellow'], fill_opacity=0.2)
+        shape.finish()
+        shape.commit()
+    for bbox in bboxes3:
+        # 原始box画上去
+        rect = fitz.Rect(*bbox[0:4])
+        shape = new_page.new_shape()
+        shape.draw_rect(rect)
+        shape.finish(color=fitz.pdfcolor['red'], fill=None)
+        shape.finish()
+        shape.commit()
+    parent_dir = os.path.dirname(save_path)
+    if not os.path.exists(parent_dir):
+        os.makedirs(parent_dir)
+    doc.save(save_path)
+    doc.close()
+def get_spilter_of_page(page, image_bboxes, table_bboxes):
+    """
+    获取到色块和横线
+    """
+    cdrawings = page.get_cdrawings()
+    spilter_bbox = []
+    for block in cdrawings:
+        if 'fill' in block:
+            fill = block['fill']
+        if 'fill' in block and block['fill'] and block['fill']!=(1.0,1.0,1.0):
+            rect = block['rect']
+            if __rect_filter_by_width(rect, page.rect.width, page.rect.height) and __rect_filter_by_pos(rect, image_bboxes, table_bboxes):
+                spilter_bbox.append(list(rect))
+    """过滤、修正一下这些box。因为有时候会有一些矩形，高度为0或者为负数，造成layout计算无限循环。如果是负高度或者0高度，统一修正为高度为1"""
+    for box in spilter_bbox:
+        if box[3]-box[1] <= 0:
+            box[3] = box[1] + 1
+    #__debug_show_page(page, spilter_bbox, [], [])
+    return spilter_bbox

magic_pdf/layout/mcol_sort.py ADDED Viewed

	@@ -0,0 +1,336 @@

+"""
+This is an advanced PyMuPDF utility for detecting multi-column pages.
+It can be used in a shell script, or its main function can be imported and
+invoked as descript below.
+Features
+---------
+- Identify text belonging to (a variable number of) columns on the page.
+- Text with different background color is handled separately, allowing for
+  easier treatment of side remarks, comment boxes, etc.
+- Uses text block detection capability to identify text blocks and
+  uses the block bboxes as primary structuring principle.
+- Supports ignoring footers via a footer margin parameter.
+- Returns re-created text boundary boxes (integer coordinates), sorted ascending
+  by the top, then by the left coordinates.
+Restrictions
+-------------
+- Only supporting horizontal, left-to-right text
+- Returns a list of text boundary boxes - not the text itself. The caller is
+  expected to extract text from within the returned boxes.
+- Text written above images is ignored altogether (option).
+- This utility works as expected in most cases. The following situation cannot
+  be handled correctly:
+    * overlapping (non-disjoint) text blocks
+    * image captions are not recognized and are handled like normal text
+Usage
+------
+- As a CLI shell command use
+  python multi_column.py input.pdf footer_margin
+  Where footer margin is the height of the bottom stripe to ignore on each page.
+  This code is intended to be modified according to your need.
+- Use in a Python script as follows:
+  ----------------------------------------------------------------------------------
+  from multi_column import column_boxes
+  # for each page execute
+  bboxes = column_boxes(page, footer_margin=50, no_image_text=True)
+  # bboxes is a list of fitz.IRect objects, that are sort ascending by their y0,
+  # then x0 coordinates. Their text content can be extracted by all PyMuPDF
+  # get_text() variants, like for instance the following:
+  for rect in bboxes:
+      print(page.get_text(clip=rect, sort=True))
+  ----------------------------------------------------------------------------------
+"""
+import sys
+from magic_pdf.libs.commons import fitz
+def column_boxes(page, footer_margin=50, header_margin=50, no_image_text=True):
+    """Determine bboxes which wrap a column."""
+    paths = page.get_drawings()
+    bboxes = []
+    # path rectangles
+    path_rects = []
+    # image bboxes
+    img_bboxes = []
+    # bboxes of non-horizontal text
+    # avoid when expanding horizontal text boxes
+    vert_bboxes = []
+    # compute relevant page area
+    clip = +page.rect
+    clip.y1 -= footer_margin  # Remove footer area
+    clip.y0 += header_margin  # Remove header area
+    def can_extend(temp, bb, bboxlist):
+        """Determines whether rectangle 'temp' can be extended by 'bb'
+        without intersecting any of the rectangles contained in 'bboxlist'.
+        Items of bboxlist may be None if they have been removed.
+        Returns:
+            True if 'temp' has no intersections with items of 'bboxlist'.
+        """
+        for b in bboxlist:
+            if not intersects_bboxes(temp, vert_bboxes) and (
+                b == None or b == bb or (temp & b).is_empty
+            ):
+                continue
+            return False
+        return True
+    def in_bbox(bb, bboxes):
+        """Return 1-based number if a bbox contains bb, else return 0."""
+        for i, bbox in enumerate(bboxes):
+            if bb in bbox:
+                return i + 1
+        return 0
+    def intersects_bboxes(bb, bboxes):
+        """Return True if a bbox intersects bb, else return False."""
+        for bbox in bboxes:
+            if not (bb & bbox).is_empty:
+                return True
+        return False
+    def extend_right(bboxes, width, path_bboxes, vert_bboxes, img_bboxes):
+        """Extend a bbox to the right page border.
+        Whenever there is no text to the right of a bbox, enlarge it up
+        to the right page border.
+        Args:
+            bboxes: (list[IRect]) bboxes to check
+            width: (int) page width
+            path_bboxes: (list[IRect]) bboxes with a background color
+            vert_bboxes: (list[IRect]) bboxes with vertical text
+            img_bboxes: (list[IRect]) bboxes of images
+        Returns:
+            Potentially modified bboxes.
+        """
+        for i, bb in enumerate(bboxes):
+            # do not extend text with background color
+            if in_bbox(bb, path_bboxes):
+                continue
+            # do not extend text in images
+            if in_bbox(bb, img_bboxes):
+                continue
+            # temp extends bb to the right page border
+            temp = +bb
+            temp.x1 = width
+            # do not cut through colored background or images
+            if intersects_bboxes(temp, path_bboxes + vert_bboxes + img_bboxes):
+                continue
+            # also, do not intersect other text bboxes
+            check = can_extend(temp, bb, bboxes)
+            if check:
+                bboxes[i] = temp  # replace with enlarged bbox
+        return [b for b in bboxes if b != None]
+    def clean_nblocks(nblocks):
+        """Do some elementary cleaning."""
+        # 1. remove any duplicate blocks.
+        blen = len(nblocks)
+        if blen < 2:
+            return nblocks
+        start = blen - 1
+        for i in range(start, -1, -1):
+            bb1 = nblocks[i]
+            bb0 = nblocks[i - 1]
+            if bb0 == bb1:
+                del nblocks[i]
+        # 2. repair sequence in special cases:
+        # consecutive bboxes with almost same bottom value are sorted ascending
+        # by x-coordinate.
+        y1 = nblocks[0].y1  # first bottom coordinate
+        i0 = 0  # its index
+        i1 = -1  # index of last bbox with same bottom
+        # Iterate over bboxes, identifying segments with approx. same bottom value.
+        # Replace every segment by its sorted version.
+        for i in range(1, len(nblocks)):
+            b1 = nblocks[i]
+            if abs(b1.y1 - y1) > 10:  # different bottom
+                if i1 > i0:  # segment length > 1? Sort it!
+                    nblocks[i0 : i1 + 1] = sorted(
+                        nblocks[i0 : i1 + 1], key=lambda b: b.x0
+                    )
+                y1 = b1.y1  # store new bottom value
+                i0 = i  # store its start index
+            i1 = i  # store current index
+        if i1 > i0:  # segment waiting to be sorted
+            nblocks[i0 : i1 + 1] = sorted(nblocks[i0 : i1 + 1], key=lambda b: b.x0)
+        return nblocks
+    # extract vector graphics
+    for p in paths:
+        path_rects.append(p["rect"].irect)
+    path_bboxes = path_rects
+    # sort path bboxes by ascending top, then left coordinates
+    path_bboxes.sort(key=lambda b: (b.y0, b.x0))
+    # bboxes of images on page, no need to sort them
+    for item in page.get_images():
+        img_bboxes.extend(page.get_image_rects(item[0]))
+    # blocks of text on page
+    blocks = page.get_text(
+        "dict",
+        flags=fitz.TEXTFLAGS_TEXT,
+        clip=clip,
+    )["blocks"]
+    # Make block rectangles, ignoring non-horizontal text
+    for b in blocks:
+        bbox = fitz.IRect(b["bbox"])  # bbox of the block
+        # ignore text written upon images
+        if no_image_text and in_bbox(bbox, img_bboxes):
+            continue
+        # confirm first line to be horizontal
+        line0 = b["lines"][0]  # get first line
+        if line0["dir"] != (1, 0):  # only accept horizontal text
+            vert_bboxes.append(bbox)
+            continue
+        srect = fitz.EMPTY_IRECT()
+        for line in b["lines"]:
+            lbbox = fitz.IRect(line["bbox"])
+            text = "".join([s["text"].strip() for s in line["spans"]])
+            if len(text) > 1:
+                srect |= lbbox
+        bbox = +srect
+        if not bbox.is_empty:
+            bboxes.append(bbox)
+    # Sort text bboxes by ascending background, top, then left coordinates
+    bboxes.sort(key=lambda k: (in_bbox(k, path_bboxes), k.y0, k.x0))
+    # Extend bboxes to the right where possible
+    bboxes = extend_right(
+        bboxes, int(page.rect.width), path_bboxes, vert_bboxes, img_bboxes
+    )
+    # immediately return of no text found
+    if bboxes == []:
+        return []
+    # --------------------------------------------------------------------
+    # Join bboxes to establish some column structure
+    # --------------------------------------------------------------------
+    # the final block bboxes on page
+    nblocks = [bboxes[0]]  # pre-fill with first bbox
+    bboxes = bboxes[1:]  # remaining old bboxes
+    for i, bb in enumerate(bboxes):  # iterate old bboxes
+        check = False  # indicates unwanted joins
+        # check if bb can extend one of the new blocks
+        for j in range(len(nblocks)):
+            nbb = nblocks[j]  # a new block
+            # never join across columns
+            if bb == None or nbb.x1 < bb.x0 or bb.x1 < nbb.x0:
+                continue
+            # never join across different background colors
+            if in_bbox(nbb, path_bboxes) != in_bbox(bb, path_bboxes):
+                continue
+            temp = bb | nbb  # temporary extension of new block
+            check = can_extend(temp, nbb, nblocks)
+            if check == True:
+                break
+        if not check:  # bb cannot be used to extend any of the new bboxes
+            nblocks.append(bb)  # so add it to the list
+            j = len(nblocks) - 1  # index of it
+            temp = nblocks[j]  # new bbox added
+        # check if some remaining bbox is contained in temp
+        check = can_extend(temp, bb, bboxes)
+        if check == False:
+            nblocks.append(bb)
+        else:
+            nblocks[j] = temp
+        bboxes[i] = None
+    # do some elementary cleaning
+    nblocks = clean_nblocks(nblocks)
+    # return identified text bboxes
+    return nblocks
+if __name__ == "__main__":
+    """Only for debugging purposes, currently.
+    Draw red borders around the returned text bboxes and insert
+    the bbox number.
+    Then save the file under the name "input-blocks.pdf".
+    """
+    # get the file name
+    filename = sys.argv[1]
+    # check if footer margin is given
+    if len(sys.argv) > 2:
+        footer_margin = int(sys.argv[2])
+    else:  # use default vaue
+        footer_margin = 50
+    # check if header margin is given
+    if len(sys.argv) > 3:
+        header_margin = int(sys.argv[3])
+    else:  # use default vaue
+        header_margin = 50
+    # open document
+    doc = fitz.open(filename)
+    # iterate over the pages
+    for page in doc:
+        # remove any geometry issues
+        page.wrap_contents()
+        # get the text bboxes
+        bboxes = column_boxes(page, footer_margin=footer_margin, header_margin=header_margin)
+        # prepare a canvas to draw rectangles and text
+        shape = page.new_shape()
+        # iterate over the bboxes
+        for i, rect in enumerate(bboxes):
+            shape.draw_rect(rect)  # draw a border
+            # write sequence number
+            shape.insert_text(rect.tl + (5, 15), str(i), color=fitz.pdfcolor["red"])
+        # finish drawing / text with color red
+        shape.finish(color=fitz.pdfcolor["red"])
+        shape.commit()  # store to the page
+    # save document with text bboxes
+    doc.ez_save(filename.replace(".pdf", "-blocks.pdf"))

magic_pdf/libs/Constants.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""
+span维度自定义字段
+"""
+# span是否是跨页合并的
+CROSS_PAGE = "cross_page"
+"""
+block维度自定义字段
+"""
+# block中lines是否被删除
+LINES_DELETED = "lines_deleted"

magic_pdf/libs/MakeContentConfig.py ADDED Viewed

	@@ -0,0 +1,10 @@

+class MakeMode:
+    MM_MD = "mm_markdown"
+    NLP_MD = "nlp_markdown"
+    STANDARD_FORMAT = "standard_format"
+class DropMode:
+    WHOLE_PDF = "whole_pdf"
+    SINGLE_PAGE = "single_page"
+    NONE = "none"

magic_pdf/libs/ModelBlockTypeEnum.py ADDED Viewed

	@@ -0,0 +1,9 @@

+from enum import Enum
+class ModelBlockTypeEnum(Enum):
+    TITLE = 0
+    PLAIN_TEXT = 1
+    ABANDON = 2
+    ISOLATE_FORMULA = 8
+    EMBEDDING = 13
+    ISOLATED = 14

magic_pdf/libs/__init__.py ADDED Viewed

File without changes

magic_pdf/libs/boxbase.py ADDED Viewed

	@@ -0,0 +1,408 @@

+from loguru import logger
+import math
+def _is_in_or_part_overlap(box1, box2) -> bool:
+    """
+    两个bbox是否有部分重叠或者包含
+    """
+    if box1 is None or box2 is None:
+        return False
+    x0_1, y0_1, x1_1, y1_1 = box1
+    x0_2, y0_2, x1_2, y1_2 = box2
+    return not (x1_1 < x0_2 or  # box1在box2的左边
+                x0_1 > x1_2 or  # box1在box2的右边
+                y1_1 < y0_2 or  # box1在box2的上边
+                y0_1 > y1_2)    # box1在box2的下边
+def _is_in_or_part_overlap_with_area_ratio(box1, box2, area_ratio_threshold=0.6):
+    """
+    判断box1是否在box2里面，或者box1和box2有部分重叠，且重叠面积占box1的比例超过area_ratio_threshold
+    """
+    if box1 is None or box2 is None:
+        return False
+    x0_1, y0_1, x1_1, y1_1 = box1
+    x0_2, y0_2, x1_2, y1_2 = box2
+    if not _is_in_or_part_overlap(box1, box2):
+        return False
+    # 计算重叠面积
+    x_left = max(x0_1, x0_2)
+    y_top = max(y0_1, y0_2)
+    x_right = min(x1_1, x1_2)
+    y_bottom = min(y1_1, y1_2)
+    overlap_area = (x_right - x_left) * (y_bottom - y_top)
+    # 计算box1的面积
+    box1_area = (x1_1 - x0_1) * (y1_1 - y0_1)
+    return overlap_area / box1_area > area_ratio_threshold
+def _is_in(box1, box2) -> bool:
+    """
+    box1是否完全在box2里面
+    """
+    x0_1, y0_1, x1_1, y1_1 = box1
+    x0_2, y0_2, x1_2, y1_2 = box2
+    return (x0_1 >= x0_2 and  # box1的左边界不在box2的左边外
+            y0_1 >= y0_2 and  # box1的上边界不在box2的上边外
+            x1_1 <= x1_2 and  # box1的右边界不在box2的右边外
+            y1_1 <= y1_2)     # box1的下边界不在box2的下边外
+def _is_part_overlap(box1, box2) -> bool:
+    """
+    两个bbox是否有部分重叠，但不完全包含
+    """
+    if box1 is None or box2 is None:
+        return False
+    return _is_in_or_part_overlap(box1, box2) and not _is_in(box1, box2)
+def _left_intersect(left_box, right_box):
+    "检查两个box的左边界是否有交集，也就是left_box的右边界是否在right_box的左边界内"
+    if left_box is None or right_box is None:
+        return False
+    x0_1, y0_1, x1_1, y1_1 = left_box
+    x0_2, y0_2, x1_2, y1_2 = right_box
+    return x1_1>x0_2 and x0_1<x0_2 and (y0_1<=y0_2<=y1_1 or y0_1<=y1_2<=y1_1)
+def _right_intersect(left_box, right_box):
+    """
+    检查box是否在右侧边界有交集，也就是left_box的左边界是否在right_box的右边界内
+    """
+    if left_box is None or right_box is None:
+        return False
+    x0_1, y0_1, x1_1, y1_1 = left_box
+    x0_2, y0_2, x1_2, y1_2 = right_box
+    return x0_1<x1_2 and x1_1>x1_2 and (y0_1<=y0_2<=y1_1 or y0_1<=y1_2<=y1_1)
+def _is_vertical_full_overlap(box1, box2, x_torlence=2):
+    """
+    x方向上：要么box1包含box2, 要么box2包含box1。不能部分包含
+    y方向上：box1和box2有重叠
+    """
+    # 解析box的坐标
+    x11, y11, x12, y12 = box1  # 左上角和右下角的坐标 (x1, y1, x2, y2)
+    x21, y21, x22, y22 = box2
+    # 在x轴方向上，box1是否包含box2 或 box2包含box1
+    contains_in_x = (x11-x_torlence <= x21 and x12+x_torlence >= x22) or (x21-x_torlence <= x11 and x22+x_torlence >= x12)
+    # 在y轴方向上，box1和box2是否有重叠
+    overlap_in_y = not (y12 < y21 or y11 > y22)
+    return contains_in_x and overlap_in_y
+def _is_bottom_full_overlap(box1, box2, y_tolerance=2):
+    """
+    检查box1下方和box2的上方有轻微的重叠，轻微程度收到y_tolerance的限制
+    这个函数和_is_vertical-full_overlap的区别是，这个函数允许box1和box2在x方向上有轻微的重叠,允许一定的模糊度
+    """
+    if box1 is None or box2 is None:
+        return False
+    x0_1, y0_1, x1_1, y1_1 = box1
+    x0_2, y0_2, x1_2, y1_2 = box2
+    tolerance_margin = 2
+    is_xdir_full_overlap = ((x0_1-tolerance_margin<=x0_2<=x1_1+tolerance_margin and x0_1-tolerance_margin<=x1_2<=x1_1+tolerance_margin) or (x0_2-tolerance_margin<=x0_1<=x1_2+tolerance_margin and x0_2-tolerance_margin<=x1_1<=x1_2+tolerance_margin))
+    return y0_2<y1_1 and 0<(y1_1-y0_2)<y_tolerance and is_xdir_full_overlap
+def _is_left_overlap(box1, box2,):
+    """
+    检查box1的左侧是否和box2有重叠
+    在Y方向上可以是部分重叠或者是完全重叠。不分box1和box2的上下关系，也就是无论box1在box2下方还是box2在box1下方，都可以检测到重叠。
+    X方向上
+    """
+    def __overlap_y(Ay1, Ay2, By1, By2):
+        return max(0, min(Ay2, By2) - max(Ay1, By1))
+    if box1 is None or box2 is None:
+        return False
+    x0_1, y0_1, x1_1, y1_1 = box1
+    x0_2, y0_2, x1_2, y1_2 = box2
+    y_overlap_len = __overlap_y(y0_1, y1_1, y0_2, y1_2)
+    ratio_1 = 1.0 * y_overlap_len / (y1_1 - y0_1) if y1_1-y0_1!=0 else 0
+    ratio_2 = 1.0 * y_overlap_len / (y1_2 - y0_2) if y1_2-y0_2!=0 else 0
+    vertical_overlap_cond = ratio_1 >= 0.5 or ratio_2 >= 0.5
+    #vertical_overlap_cond = y0_1<=y0_2<=y1_1 or y0_1<=y1_2<=y1_1 or y0_2<=y0_1<=y1_2 or y0_2<=y1_1<=y1_2
+    return x0_1<=x0_2<=x1_1 and vertical_overlap_cond
+def __is_overlaps_y_exceeds_threshold(bbox1, bbox2, overlap_ratio_threshold=0.8):
+    """检查两个bbox在y轴上是否有重叠，并且该重叠区域的高度占两个bbox高度更低的那个超过80%"""
+    _, y0_1, _, y1_1 = bbox1
+    _, y0_2, _, y1_2 = bbox2
+    overlap = max(0, min(y1_1, y1_2) - max(y0_1, y0_2))
+    height1, height2 = y1_1 - y0_1, y1_2 - y0_2
+    max_height = max(height1, height2)
+    min_height = min(height1, height2)
+    return (overlap / min_height) > overlap_ratio_threshold
+def calculate_iou(bbox1, bbox2):
+    """
+    计算两个边界框的交并比(IOU)。
+    Args:
+        bbox1 (list[float]): 第一个边界框的坐标，格式为 [x1, y1, x2, y2]，其中 (x1, y1) 为左上角坐标，(x2, y2) 为右下角坐标。
+        bbox2 (list[float]): 第二个边界框的坐标，格式与 `bbox1` 相同。
+    Returns:
+        float: 两个边界框的交并比(IOU)，取值范围为 [0, 1]。
+    """
+    # Determine the coordinates of the intersection rectangle
+    x_left = max(bbox1[0], bbox2[0])
+    y_top = max(bbox1[1], bbox2[1])
+    x_right = min(bbox1[2], bbox2[2])
+    y_bottom = min(bbox1[3], bbox2[3])
+    if x_right < x_left or y_bottom < y_top:
+        return 0.0
+    # The area of overlap area
+    intersection_area = (x_right - x_left) * (y_bottom - y_top)
+    # The area of both rectangles
+    bbox1_area = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
+    bbox2_area = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])
+    # Compute the intersection over union by taking the intersection area
+    # and dividing it by the sum of both areas minus the intersection area
+    iou = intersection_area / float(bbox1_area + bbox2_area - intersection_area)
+    return iou
+def calculate_overlap_area_2_minbox_area_ratio(bbox1, bbox2):
+    """
+    计算box1和box2的重叠面积占最小面积的box的比例
+    """
+    # Determine the coordinates of the intersection rectangle
+    x_left = max(bbox1[0], bbox2[0])
+    y_top = max(bbox1[1], bbox2[1])
+    x_right = min(bbox1[2], bbox2[2])
+    y_bottom = min(bbox1[3], bbox2[3])
+    if x_right < x_left or y_bottom < y_top:
+        return 0.0
+    # The area of overlap area
+    intersection_area = (x_right - x_left) * (y_bottom - y_top)
+    min_box_area = min([(bbox1[2]-bbox1[0])*(bbox1[3]-bbox1[1]), (bbox2[3]-bbox2[1])*(bbox2[2]-bbox2[0])])
+    if min_box_area==0:
+        return 0
+    else:
+        return intersection_area / min_box_area
+def calculate_overlap_area_in_bbox1_area_ratio(bbox1, bbox2):
+    """
+    计算box1和box2的重叠面积占bbox1的比例
+    """
+    # Determine the coordinates of the intersection rectangle
+    x_left = max(bbox1[0], bbox2[0])
+    y_top = max(bbox1[1], bbox2[1])
+    x_right = min(bbox1[2], bbox2[2])
+    y_bottom = min(bbox1[3], bbox2[3])
+    if x_right < x_left or y_bottom < y_top:
+        return 0.0
+    # The area of overlap area
+    intersection_area = (x_right - x_left) * (y_bottom - y_top)
+    bbox1_area = (bbox1[2]-bbox1[0])*(bbox1[3]-bbox1[1])
+    if bbox1_area == 0:
+        return 0
+    else:
+        return intersection_area / bbox1_area
+def get_minbox_if_overlap_by_ratio(bbox1, bbox2, ratio):
+    """
+    通过calculate_overlap_area_2_minbox_area_ratio计算两个bbox重叠的面积占最小面积的box的比例
+    如果比例大于ratio，则返回小的那个bbox,
+    否则返回None
+    """
+    x1_min, y1_min, x1_max, y1_max = bbox1
+    x2_min, y2_min, x2_max, y2_max = bbox2
+    area1 = (x1_max - x1_min) * (y1_max - y1_min)
+    area2 = (x2_max - x2_min) * (y2_max - y2_min)
+    overlap_ratio = calculate_overlap_area_2_minbox_area_ratio(bbox1, bbox2)
+    if overlap_ratio > ratio:
+        if area1 <= area2:
+            return bbox1
+        else:
+            return bbox2
+    else:
+        return None
+def get_bbox_in_boundry(bboxes:list, boundry:tuple)-> list:
+    x0, y0, x1, y1 = boundry
+    new_boxes = [box for box in bboxes if box[0] >= x0 and box[1] >= y0 and box[2] <= x1 and box[3] <= y1]
+    return new_boxes
+def is_vbox_on_side(bbox, width, height, side_threshold=0.2):
+    """
+    判断一个bbox是否在pdf页面的边缘
+    """
+    x0, x1 = bbox[0], bbox[2]
+    if x1<=width*side_threshold or x0>=width*(1-side_threshold):
+        return True
+    return False
+def find_top_nearest_text_bbox(pymu_blocks, obj_bbox):
+    tolerance_margin = 4
+    top_boxes = [box for box in pymu_blocks if obj_bbox[1]-box['bbox'][3] >=-tolerance_margin and not _is_in(box['bbox'], obj_bbox)]
+    # 然后找到X方向上有互相重叠的
+    top_boxes = [box for box in top_boxes if any([obj_bbox[0]-tolerance_margin <=box['bbox'][0]<=obj_bbox[2]+tolerance_margin,
+                                                  obj_bbox[0]-tolerance_margin <=box['bbox'][2]<=obj_bbox[2]+tolerance_margin,
+                                                    box['bbox'][0]-tolerance_margin <=obj_bbox[0]<=box['bbox'][2]+tolerance_margin,
+                                                    box['bbox'][0]-tolerance_margin <=obj_bbox[2]<=box['bbox'][2]+tolerance_margin
+                                                  ])]
+    # 然后找到y1最大的那个
+    if len(top_boxes)>0:
+        top_boxes.sort(key=lambda x: x['bbox'][3], reverse=True)
+        return top_boxes[0]
+    else:
+        return None
+def find_bottom_nearest_text_bbox(pymu_blocks, obj_bbox):
+    bottom_boxes = [box for box in pymu_blocks if box['bbox'][1] - obj_bbox[3]>=-2 and not _is_in(box['bbox'], obj_bbox)]
+    # 然后找到X方向上有互相重叠的
+    bottom_boxes = [box for box in bottom_boxes if any([obj_bbox[0]-2 <=box['bbox'][0]<=obj_bbox[2]+2,
+                                                  obj_bbox[0]-2 <=box['bbox'][2]<=obj_bbox[2]+2,
+                                                    box['bbox'][0]-2 <=obj_bbox[0]<=box['bbox'][2]+2,
+                                                    box['bbox'][0]-2 <=obj_bbox[2]<=box['bbox'][2]+2
+                                                  ])]
+    # 然后找到y0最小的那个
+    if len(bottom_boxes)>0:
+        bottom_boxes.sort(key=lambda x: x['bbox'][1], reverse=False)
+        return bottom_boxes[0]
+    else:
+        return None
+def find_left_nearest_text_bbox(pymu_blocks, obj_bbox):
+    """
+    寻找左侧最近的文本block
+    """
+    left_boxes = [box for box in pymu_blocks if obj_bbox[0]-box['bbox'][2]>=-2 and not _is_in(box['bbox'], obj_bbox)]
+    # 然后找到X方向上有互相重叠的
+    left_boxes = [box for box in left_boxes if any([obj_bbox[1]-2 <=box['bbox'][1]<=obj_bbox[3]+2,
+                                                  obj_bbox[1]-2 <=box['bbox'][3]<=obj_bbox[3]+2,
+                                                    box['bbox'][1]-2 <=obj_bbox[1]<=box['bbox'][3]+2,
+                                                    box['bbox'][1]-2 <=obj_bbox[3]<=box['bbox'][3]+2
+                                                  ])]
+    # 然后找到x1最大的那个
+    if len(left_boxes)>0:
+        left_boxes.sort(key=lambda x: x['bbox'][2], reverse=True)
+        return left_boxes[0]
+    else:
+        return None
+def find_right_nearest_text_bbox(pymu_blocks, obj_bbox):
+    """
+    寻找右侧最近的文本block
+    """
+    right_boxes = [box for box in pymu_blocks if box['bbox'][0]-obj_bbox[2]>=-2 and not _is_in(box['bbox'], obj_bbox)]
+    # 然后找到X方向上有互相重叠的
+    right_boxes = [box for box in right_boxes if any([obj_bbox[1]-2 <=box['bbox'][1]<=obj_bbox[3]+2,
+                                                  obj_bbox[1]-2 <=box['bbox'][3]<=obj_bbox[3]+2,
+                                                    box['bbox'][1]-2 <=obj_bbox[1]<=box['bbox'][3]+2,
+                                                    box['bbox'][1]-2 <=obj_bbox[3]<=box['bbox'][3]+2
+                                                  ])]
+    # 然后找到x0最小的那个
+    if len(right_boxes)>0:
+        right_boxes.sort(key=lambda x: x['bbox'][0], reverse=False)
+        return right_boxes[0]
+    else:
+        return None
+def bbox_relative_pos(bbox1, bbox2):
+    """
+    判断两个矩形框的相对位置关系
+    Args:
+        bbox1: 一个四元组，表示第一个矩形框的左上角和右下角的坐标，格式为(x1, y1, x1b, y1b)
+        bbox2: 一个四元组，表示第二个矩形框的左上角和右下角的坐标，格式为(x2, y2, x2b, y2b)
+    Returns:
+        一个四元组，表示矩形框1相对于矩形框2的位置关系，格式为(left, right, bottom, top)
+        其中，left表示矩形框1是否在矩形框2的左侧，right表示矩形框1是否在矩形框2的右侧，
+        bottom表示矩形框1是否在矩形框2的下方，top表示矩形框1是否在矩形框2的上方
+    """
+    x1, y1, x1b, y1b = bbox1
+    x2, y2, x2b, y2b = bbox2
+    left = x2b < x1
+    right = x1b < x2
+    bottom = y2b < y1
+    top = y1b < y2
+    return left, right, bottom, top
+def bbox_distance(bbox1, bbox2):
+    """
+    计算两个矩形框的距离。
+    Args:
+        bbox1 (tuple): 第一个矩形框的坐标，格式为 (x1, y1, x2, y2)，其中 (x1, y1) 为左上角坐标，(x2, y2) 为右下角坐标。
+        bbox2 (tuple): 第二个矩形框的坐标，格式为 (x1, y1, x2, y2)，其中 (x1, y1) 为左上角坐标，(x2, y2) 为右下角坐标。
+    Returns:
+        float: 矩形框之间的距离。
+    """
+    def dist(point1, point2):
+            return math.sqrt((point1[0]-point2[0])**2 + (point1[1]-point2[1])**2)
+    x1, y1, x1b, y1b = bbox1
+    x2, y2, x2b, y2b = bbox2
+    left, right, bottom, top = bbox_relative_pos(bbox1, bbox2)
+    if top and left:
+        return dist((x1, y1b), (x2b, y2))
+    elif left and bottom:
+        return dist((x1, y1), (x2b, y2b))
+    elif bottom and right:
+        return dist((x1b, y1), (x2, y2b))
+    elif right and top:
+        return dist((x1b, y1b), (x2, y2))
+    elif left:
+        return x1 - x2b
+    elif right:
+        return x2 - x1b
+    elif bottom:
+        return y1 - y2b
+    elif top:
+        return y2 - y1b
+    else:             # rectangles intersect
+        return 0

magic_pdf/libs/calc_span_stats.py ADDED Viewed

	@@ -0,0 +1,239 @@

+import os
+import csv
+import json
+import pandas as pd
+from pandas import DataFrame as df
+from matplotlib import pyplot as plt
+from termcolor import cprint
+"""
+Execute this script in the following way:
+1. Make sure there are pdf_dic.json files under the directory code-clean/tmp/unittest/md/, such as the following:
+    code-clean/tmp/unittest/md/scihub/scihub_00500000/libgen.scimag00527000-00527999.zip_10.1002/app.25178/pdf_dic.json
+2. Under the directory code-clean, execute the following command:
+    $ python -m libs.calc_span_stats
+"""
+def print_green_on_red(text):
+    cprint(text, "green", "on_red", attrs=["bold"], end="\n\n")
+def print_green(text):
+    print()
+    cprint(text, "green", attrs=["bold"], end="\n\n")
+def print_red(text):
+    print()
+    cprint(text, "red", attrs=["bold"], end="\n\n")
+def safe_get(dict_obj, key, default):
+    val = dict_obj.get(key)
+    if val is None:
+        return default
+    else:
+        return val
+class SpanStatsCalc:
+    """Calculate statistics of span."""
+    def draw_charts(self, span_stats: pd.DataFrame, fig_num: int, save_path: str):
+        """Draw multiple figures in one figure."""
+        # make a canvas
+        fig = plt.figure(fig_num, figsize=(20, 20))
+        pass
+    def calc_stats_per_dict(self, pdf_dict) -> pd.DataFrame:
+        """Calculate statistics per pdf_dict."""
+        span_stats = pd.DataFrame()
+        span_stats = []
+        span_id = 0
+        for page_id, blocks in pdf_dict.items():
+            if page_id.startswith("page_"):
+                if "para_blocks" in blocks.keys():
+                    for para_block in blocks["para_blocks"]:
+                        for line in para_block["lines"]:
+                            for span in line["spans"]:
+                                span_text = safe_get(span, "text", "")
+                                span_font_name = safe_get(span, "font", "")
+                                span_font_size = safe_get(span, "size", 0)
+                                span_font_color = safe_get(span, "color", "")
+                                span_font_flags = safe_get(span, "flags", 0)
+                                span_font_flags_decoded = safe_get(span, "decomposed_flags", {})
+                                span_is_super_script = safe_get(span_font_flags_decoded, "is_superscript", False)
+                                span_is_italic = safe_get(span_font_flags_decoded, "is_italic", False)
+                                span_is_serifed = safe_get(span_font_flags_decoded, "is_serifed", False)
+                                span_is_sans_serifed = safe_get(span_font_flags_decoded, "is_sans_serifed", False)
+                                span_is_monospaced = safe_get(span_font_flags_decoded, "is_monospaced", False)
+                                span_is_proportional = safe_get(span_font_flags_decoded, "is_proportional", False)
+                                span_is_bold = safe_get(span_font_flags_decoded, "is_bold", False)
+                                span_stats.append(
+                                    {
+                                        "span_id": span_id,  # id of span
+                                        "page_id": page_id,  # page number of pdf
+                                        "span_text": span_text,  # text of span
+                                        "span_font_name": span_font_name,  # font name of span
+                                        "span_font_size": span_font_size,  # font size of span
+                                        "span_font_color": span_font_color,  # font color of span
+                                        "span_font_flags": span_font_flags,  # font flags of span
+                                        "span_is_superscript": int(
+                                            span_is_super_script
+                                        ),  # indicate whether the span is super script or not
+                                        "span_is_italic": int(span_is_italic),  # indicate whether the span is italic or not
+                                        "span_is_serifed": int(span_is_serifed),  # indicate whether the span is serifed or not
+                                        "span_is_sans_serifed": int(
+                                            span_is_sans_serifed
+                                        ),  # indicate whether the span is sans serifed or not
+                                        "span_is_monospaced": int(
+                                            span_is_monospaced
+                                        ),  # indicate whether the span is monospaced or not
+                                        "span_is_proportional": int(
+                                            span_is_proportional
+                                        ),  # indicate whether the span is proportional or not
+                                        "span_is_bold": int(span_is_bold),  # indicate whether the span is bold or not
+                                    }
+                                )
+                                span_id += 1
+        span_stats = pd.DataFrame(span_stats)
+        # print(span_stats)
+        return span_stats
+def __find_pdf_dic_files(
+    jf_name="pdf_dic.json",
+    base_code_name="code-clean",
+    tgt_base_dir_name="tmp",
+    unittest_dir_name="unittest",
+    md_dir_name="md",
+    book_names=[
+        "scihub",
+    ],  # other possible values: "zlib", "arxiv" and so on
+):
+    pdf_dict_files = []
+    curr_dir = os.path.dirname(__file__)
+    for i in range(len(curr_dir)):
+        if curr_dir[i : i + len(base_code_name)] == base_code_name:
+            base_code_dir_name = curr_dir[: i + len(base_code_name)]
+            for book_name in book_names:
+                search_dir_relative_name = os.path.join(tgt_base_dir_name, unittest_dir_name, md_dir_name, book_name)
+                if os.path.exists(base_code_dir_name):
+                    search_dir_name = os.path.join(base_code_dir_name, search_dir_relative_name)
+                    for root, dirs, files in os.walk(search_dir_name):
+                        for file in files:
+                            if file == jf_name:
+                                pdf_dict_files.append(os.path.join(root, file))
+                break
+    return pdf_dict_files
+def combine_span_texts(group_df, span_stats):
+    combined_span_texts = []
+    for _, row in group_df.iterrows():
+        curr_span_id = row.name
+        curr_span_text = row["span_text"]
+        pre_span_id = curr_span_id - 1
+        pre_span_text = span_stats.at[pre_span_id, "span_text"] if pre_span_id in span_stats.index else ""
+        next_span_id = curr_span_id + 1
+        next_span_text = span_stats.at[next_span_id, "span_text"] if next_span_id in span_stats.index else ""
+        # pointer_sign is a right arrow if the span is superscript, otherwise it is a down arrow
+        pointer_sign = "→ → → "
+        combined_text = "\n".join([pointer_sign + pre_span_text, pointer_sign + curr_span_text, pointer_sign + next_span_text])
+        combined_span_texts.append(combined_text)
+    return "\n\n".join(combined_span_texts)
+# pd.set_option("display.max_colwidth", None)  # 设置为 None 来显示完整的文本
+pd.set_option("display.max_rows", None)  # 设置为 None 来显示更多的行
+def main():
+    pdf_dict_files = __find_pdf_dic_files()
+    # print(pdf_dict_files)
+    span_stats_calc = SpanStatsCalc()
+    for pdf_dict_file in pdf_dict_files:
+        print("-" * 100)
+        print_green_on_red(f"Processing {pdf_dict_file}")
+        with open(pdf_dict_file, "r", encoding="utf-8") as f:
+            pdf_dict = json.load(f)
+            raw_df = span_stats_calc.calc_stats_per_dict(pdf_dict)
+            save_path = pdf_dict_file.replace("pdf_dic.json", "span_stats_raw.csv")
+            raw_df.to_csv(save_path, index=False)
+            filtered_df = raw_df[raw_df["span_is_superscript"] == 1]
+            if filtered_df.empty:
+                print("No superscript span found!")
+                continue
+            filtered_grouped_df = filtered_df.groupby(["span_font_name", "span_font_size", "span_font_color"])
+            combined_span_texts = filtered_grouped_df.apply(combine_span_texts, span_stats=raw_df)  # type: ignore
+            final_df = filtered_grouped_df.size().reset_index(name="count")
+            final_df["span_texts"] = combined_span_texts.reset_index(level=[0, 1, 2], drop=True)
+            print(final_df)
+            final_df["span_texts"] = final_df["span_texts"].apply(lambda x: x.replace("\n", "\r\n"))
+            save_path = pdf_dict_file.replace("pdf_dic.json", "span_stats_final.csv")
+            # 使用 UTF-8 编码并添加 BOM，确保所有字段被双引号包围
+            final_df.to_csv(save_path, index=False, encoding="utf-8-sig", quoting=csv.QUOTE_ALL)
+            # 创建一个 2x2 的图表布局
+            fig, axs = plt.subplots(2, 2, figsize=(15, 10))
+            # 按照 span_font_name 分类作图
+            final_df.groupby("span_font_name")["count"].sum().plot(kind="bar", ax=axs[0, 0], title="By Font Name")
+            # 按照 span_font_size 分类作图
+            final_df.groupby("span_font_size")["count"].sum().plot(kind="bar", ax=axs[0, 1], title="By Font Size")
+            # 按照 span_font_color 分类作图
+            final_df.groupby("span_font_color")["count"].sum().plot(kind="bar", ax=axs[1, 0], title="By Font Color")
+            # 按照 span_font_name、span_font_size 和 span_font_color 共同分类作图
+            grouped = final_df.groupby(["span_font_name", "span_font_size", "span_font_color"])
+            grouped["count"].sum().unstack().plot(kind="bar", ax=axs[1, 1], title="Combined Grouping")
+            # 调整布局
+            plt.tight_layout()
+            # 显示图表
+            # plt.show()
+            # 保存图表到 PNG 文件
+            save_path = pdf_dict_file.replace("pdf_dic.json", "span_stats_combined.png")
+            plt.savefig(save_path)
+            # 清除画布
+            plt.clf()
+if __name__ == "__main__":
+    main()

magic_pdf/libs/commons.py ADDED Viewed

	@@ -0,0 +1,204 @@

+import datetime
+import json
+import os, re, configparser
+import subprocess
+import time
+import boto3
+from loguru import logger
+from boto3.s3.transfer import TransferConfig
+from botocore.config import Config
+import fitz # 1.23.9中已经切换到rebase
+# import fitz_old as fitz  # 使用1.23.9之前的pymupdf库
+def get_delta_time(input_time):
+    return round(time.time() - input_time, 2)
+def join_path(*args):
+    return '/'.join(str(s).rstrip('/') for s in args)
+#配置全局的errlog_path，方便demo同步引用
+error_log_path = "s3://llm-pdf-text/err_logs/"
+# json_dump_path = "s3://pdf_books_temp/json_dump/" # 这条路径仅用于临时本地测试,不能提交到main
+json_dump_path = "s3://llm-pdf-text/json_dump/"
+# s3_image_save_path = "s3://mllm-raw-media/pdf2md_img/" # 基础库不应该有这些存在的路径，应该在业务代码中定义
+def get_top_percent_list(num_list, percent):
+    """
+    获取列表中前百分之多少的元素
+    :param num_list:
+    :param percent:
+    :return:
+    """
+    if len(num_list) == 0:
+        top_percent_list = []
+    else:
+        # 对imgs_len_list排序
+        sorted_imgs_len_list = sorted(num_list, reverse=True)
+        # 计算 percent 的索引
+        top_percent_index = int(len(sorted_imgs_len_list) * percent)
+        # 取前80%的元素
+        top_percent_list = sorted_imgs_len_list[:top_percent_index]
+    return top_percent_list
+def formatted_time(time_stamp):
+    dt_object = datetime.datetime.fromtimestamp(time_stamp)
+    output_time = dt_object.strftime("%Y-%m-%d-%H:%M:%S")
+    return output_time
+def mymax(alist: list):
+    if len(alist) == 0:
+        return 0  # 空是0， 0*0也是0大小q
+    else:
+        return max(alist)
+def parse_aws_param(profile):
+    if isinstance(profile, str):
+        # 解析配置文件
+        config_file = join_path(os.path.expanduser("~"), ".aws", "config")
+        credentials_file = join_path(os.path.expanduser("~"), ".aws", "credentials")
+        config = configparser.ConfigParser()
+        config.read(credentials_file)
+        config.read(config_file)
+        # 获取 AWS 账户相关信息
+        ak = config.get(profile, "aws_access_key_id")
+        sk = config.get(profile, "aws_secret_access_key")
+        if profile == "default":
+            s3_str = config.get(f"{profile}", "s3")
+        else:
+            s3_str = config.get(f"profile {profile}", "s3")
+        end_match = re.search("endpoint_url[\s]*=[\s]*([^\s\n]+)[\s\n]*$", s3_str, re.MULTILINE)
+        if end_match:
+            endpoint = end_match.group(1)
+        else:
+            raise ValueError(f"aws 配置文件中没有找到 endpoint_url")
+        style_match = re.search("addressing_style[\s]*=[\s]*([^\s\n]+)[\s\n]*$", s3_str, re.MULTILINE)
+        if style_match:
+            addressing_style = style_match.group(1)
+        else:
+            addressing_style = "path"
+    elif isinstance(profile, dict):
+        ak = profile["ak"]
+        sk = profile["sk"]
+        endpoint = profile["endpoint"]
+        addressing_style = "auto"
+    return ak, sk, endpoint, addressing_style
+def parse_bucket_key(s3_full_path: str):
+    """
+    输入 s3://bucket/path/to/my/file.txt
+    输出 bucket, path/to/my/file.txt
+    """
+    s3_full_path = s3_full_path.strip()
+    if s3_full_path.startswith("s3://"):
+        s3_full_path = s3_full_path[5:]
+    if s3_full_path.startswith("/"):
+        s3_full_path = s3_full_path[1:]
+    bucket, key = s3_full_path.split("/", 1)
+    return bucket, key
+def read_file(pdf_path: str, s3_profile):
+    if pdf_path.startswith("s3://"):
+        ak, sk, end_point, addressing_style = parse_aws_param(s3_profile)
+        cli = boto3.client(service_name="s3", aws_access_key_id=ak, aws_secret_access_key=sk, endpoint_url=end_point,
+                           config=Config(s3={'addressing_style': addressing_style}, retries={'max_attempts': 10, 'mode': 'standard'}))
+        bucket_name, bucket_key = parse_bucket_key(pdf_path)
+        res = cli.get_object(Bucket=bucket_name, Key=bucket_key)
+        file_content = res["Body"].read()
+        return file_content
+    else:
+        with open(pdf_path, "rb") as f:
+            return f.read()
+def get_docx_model_output(pdf_model_output, page_id):
+    model_output_json = pdf_model_output[page_id]
+    return model_output_json
+def list_dir(dir_path:str, s3_profile:str):
+    """
+    列出dir_path下的所有文件
+    """
+    ret = []
+    if dir_path.startswith("s3"):
+        ak, sk, end_point, addressing_style = parse_aws_param(s3_profile)
+        s3info = re.findall(r"s3:\/\/([^\/]+)\/(.*)", dir_path)
+        bucket, path = s3info[0][0], s3info[0][1]
+        try:
+            cli = boto3.client(service_name="s3", aws_access_key_id=ak, aws_secret_access_key=sk, endpoint_url=end_point,
+                                            config=Config(s3={'addressing_style': addressing_style}))
+            def list_obj_scluster():
+                marker = None
+                while True:
+                    list_kwargs = dict(MaxKeys=1000, Bucket=bucket, Prefix=path)
+                    if marker:
+                        list_kwargs['Marker'] = marker
+                    response = cli.list_objects(**list_kwargs)
+                    contents = response.get("Contents", [])
+                    yield from contents
+                    if not response.get("IsTruncated") or len(contents)==0:
+                        break
+                    marker = contents[-1]['Key']
+            for info in list_obj_scluster():
+                file_path = info['Key']
+                #size = info['Size']
+                if path!="":
+                    afile = file_path[len(path):]
+                    if afile.endswith(".json"):
+                        ret.append(f"s3://{bucket}/{file_path}")
+            return ret
+        except Exception as e:
+            logger.exception(e)
+            exit(-1)
+    else: #本地的目录，那么扫描本地目录并返会这个目录里的所有jsonl文件
+        for root, dirs, files in os.walk(dir_path):
+            for file in files:
+                if file.endswith(".json"):
+                    ret.append(join_path(root, file))
+        ret.sort()
+        return ret
+def get_img_s3_client(save_path:str, image_s3_config:str):
+    """
+    """
+    if save_path.startswith("s3://"):  # 放这里是为了最少创建一个s3 client
+        ak, sk, end_point, addressing_style = parse_aws_param(image_s3_config)
+        img_s3_client = boto3.client(
+            service_name="s3",
+            aws_access_key_id=ak,
+            aws_secret_access_key=sk,
+            endpoint_url=end_point,
+            config=Config(s3={"addressing_style": addressing_style}, retries={'max_attempts': 5, 'mode': 'standard'}),
+        )
+    else:
+        img_s3_client = None
+    return img_s3_client
+if __name__=="__main__":
+    s3_path = "s3://llm-pdf-text/layout_det/scihub/scimag07865000-07865999/10.1007/s10729-011-9175-6.pdf/"
+    s3_profile = "langchao"
+    ret = list_dir(s3_path, s3_profile)
+    print(ret)

magic_pdf/libs/config_reader.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""
+根据bucket的名字返回对应的s3 AK， SK，endpoint三元组
+"""
+import json
+import os
+from loguru import logger
+from magic_pdf.libs.commons import parse_bucket_key
+def read_config():
+    home_dir = os.path.expanduser("~")
+    config_file = os.path.join(home_dir, "magic-pdf.json")
+    if not os.path.exists(config_file):
+        raise Exception(f"{config_file} not found")
+    with open(config_file, "r") as f:
+        config = json.load(f)
+    return config
+def get_s3_config(bucket_name: str):
+    """
+    ~/magic-pdf.json 读出来
+    """
+    config = read_config()
+    bucket_info = config.get("bucket_info")
+    if bucket_name not in bucket_info:
+        access_key, secret_key, storage_endpoint = bucket_info["[default]"]
+    else:
+        access_key, secret_key, storage_endpoint = bucket_info[bucket_name]
+    if access_key is None or secret_key is None or storage_endpoint is None:
+        raise Exception("ak, sk or endpoint not found in magic-pdf.json")
+    # logger.info(f"get_s3_config: ak={access_key}, sk={secret_key}, endpoint={storage_endpoint}")
+    return access_key, secret_key, storage_endpoint
+def get_s3_config_dict(path: str):
+    access_key, secret_key, storage_endpoint = get_s3_config(get_bucket_name(path))
+    return {"ak": access_key, "sk": secret_key, "endpoint": storage_endpoint}
+def get_bucket_name(path):
+    bucket, key = parse_bucket_key(path)
+    return bucket
+def get_local_dir():
+    config = read_config()
+    return config.get("temp-output-dir", "/tmp")
+def get_local_models_dir():
+    config = read_config()
+    return config.get("models-dir", "/tmp/models")
+def get_device():
+    config = read_config()
+    return config.get("device-mode", "cpu")
+if __name__ == "__main__":
+    ak, sk, endpoint = get_s3_config("llm-raw")

magic_pdf/libs/convert_utils.py ADDED Viewed

	@@ -0,0 +1,5 @@

+def dict_to_list(input_dict):
+    items_list = []
+    for _, item in input_dict.items():
+        items_list.append(item)
+    return items_list