yujunhuinlp commited on
Commit
52e6eb8
1 Parent(s): cf23001

Upload 4 files

Browse files
360LayoutAnalysis开源模型许可证.txt ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 360LayoutAnalysis开源模型许可证
2
+
3
+ 一、定义
4
+ 1.1 “本许可证”:指本文档第一条至第五条所定义的对360LayoutAnalysis开源模型使用、复制、修改和分发的条款和条件。
5
+ 1.2 “模型”:指任何附带的基于机器学习技术的参数,包括但不限于权重、偏置、检查点及最终优化器状态(如适用)。
6
+ 1.3 “衍生模型”:指对本模型进行的修改、基于本模型的模型,或通过将本模型参数、激活、操作或输出模式迁移到其他模型而创建或初始化的任何其他机器学习模型,包括但不限于模型微调、量化和使用中间数据表示的模型蒸馏方法。
7
+ 1.4 “数据”:指为训练、预训练或评估本模型而从与本模型一起使用的数据集(包括训练、预训练或其他评估数据集)中提取的信息和/或内容集合。
8
+ 1.5 “个人信息”:指以电子或其他方式记录的与已识别或可识别自然人相关的各种信息,不包括匿名化处理后的信息。
9
+ 1.6 “输出”:指通过操作或使用本模型或其衍生模型产生的任何形式的信息内容结果。
10
+ 1.7 “分发”:指通过任何媒介向第三方传输、发布或以其他方式共享本模型、其衍生模型,包括但不限于通过API、网络访问或任何其他电子或远程方式向用户提供模型或其功能的服务("托管服务")。
11
+
12
+ 1.8 “许可方”:指对360模型及其衍生模型拥有知识产权的三六零科技集团有限公司及其关联主体。
13
+ 1.9 “被许可方”:指根据本许可证被授予许可的自然人或法人实体。
14
+ 1.10 “商业用途”:指使用本模型直接或间接为实体或个人产生收入或用于任何其他营利目的。
15
+ 二、许可
16
+ 2.1 著作权许可:根据本许可证条款和条件,许可方授予被许可方永久的、全球范围内的、免费的、非排他性和不可撤销的著作权许可,以使用、复制、创作本模型的衍生模型。但若被许可方对任何人发起著作权侵权诉讼或维权行动,主张本模型或其衍生模型构成著作权侵权,则上述著作权许可自被许可方提起诉讼或维权行动之日起终止。
17
+ 2.2 专利许可:根据本许可证条款和条件,许可方授予被许可方永久的、全球范围内的、免费的、非排他性和不可撤销的专利许可,以制造、使用、销售、许诺销售、进口本模型或其衍生模型。前述专利许可仅限于许可方现有或将来拥有或控制的、使用本模型将必然会侵犯的专利权利要求。但若被许可方对任何人发起专利侵权诉讼或维权行动,主张本模型或其衍生模型构成专利侵权,则上述专利许可自被许可方提起诉讼或维权行动之日起终止。
18
+ 2.3 其他知识产权许可:除上述著作权、专利许可外,根据本许可证条款和条件,许可方就使用、复制、分发本模型及其衍生模型将必然会侵犯的许可方就本模型及衍生模型所拥有或控制的其他知识产权(本许可证明确不授予商标权许可)授予被许可方永久的、全球范围内的、免费的、非排他性和不可撤销的许可。但若被许可方对任何人发起相关知识产权侵权诉讼或维权行动,则上述其他知识产权许可自被许可方提起诉讼或维权行动之日起终止。
19
+ 2.4 上述许可针对非商业用途使用本模型及衍生模型之目的,若需将本模型及衍生模型用于商业用途,请通过本许可证第五条所附邮箱联系许可方进行登记。
20
+ 三、使用条件
21
+ 3.1 被许可方复制、使用本模型或其衍生模型,须满足以下条件:
22
+ (1)向本模型或衍生模型接收者提供本许可证副本;
23
+ (2)对本模型或衍生模型作出修改时,须以显著方式向接收者说明修改内容;
24
+ (3)保留本模型或衍生模型中与之相关的所有著作权、专利、商标及归属声明;
25
+ (4)遵守所有适用法律法规,不得将本模型或其衍生模型用于任何违法或不当目的,包括但不限于军事目的。
26
+ 3.2 就衍生模型中的创造性贡献, 被许可方可主张相应的知识产权,并为复制或分发修改版本或整个衍生模型提供附加或不同的许可条款。
27
+ 3.3 通过托管服务方式向用户提供本模型或其衍生模型时,第3.1条(1)和(2)项规定不适用,但应遵守第3.1条(3)和(4)项规定。
28
+ 四、免责声明
29
+ 4.1 在适用法律允许的最大范围内,许可方按"现状"提供本模型,不做任何形式的明示或默示保证,包括但不限于针对所有权、不侵权、适销性、特定用途适用性或其他方面的保证。被许可方应自行判断使用本模型及其衍生模型的适当性,并自行承担使用本模型及其衍生模型的全部风险。
30
+ 4.2 被许可方应遵守相关法律法规处理本模型中可能包含的任何个人信息,并独自承担相关风险。
31
+ 4.3 在任何情况下,许可方均不对被许可方因使用本模型或其衍生模型而产生的任何直接、间接、附带、特殊、惩罚性或后果性损害赔偿负责,包括但不限于数据损失、业务中断或任何其他商业损害或损失,即使被告知有可能发生此类损害赔偿。
32
+ 五、其他
33
+ 5.1 未经许可方事先书面同意,被许可方不得在产品或服务中使用许可方的任何商标、品牌或标志。
34
+ 5.2 本许可证构成许可方与行使本许可证的被许可方之间关于本模型的完整协议。
35
+ 5.3 若需将本模型及衍生模型用于商业用途,请通过邮箱(360ailab-nlp@360.cn)联系许可方进行申请,并提供:申请人名称、代理人名称(如有)、申请人联系方式及地址、代理人联系方式(如有)、模型衍生创作情况、拟开展的具体商业用途。
LICENSE.txt ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
10
+
11
+ "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
12
+
13
+ "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
14
+
15
+ "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
16
+
17
+ "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
18
+
19
+ "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
20
+
21
+ "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
22
+
23
+ "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
24
+
25
+ "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
26
+
27
+ "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
28
+
29
+ 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
30
+
31
+ 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
32
+
33
+ 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
34
+
35
+ You must give any other recipients of the Work or Derivative Works a copy of this License; and
36
+ You must cause any modified files to carry prominent notices stating that You changed the files; and
37
+ You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
38
+ If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
39
+ You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
40
+
41
+ 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
42
+
43
+ 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
44
+
45
+ 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
46
+
47
+ 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
48
+
49
+ 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
50
+
51
+ END OF TERMS AND CONDITIONS
README.md CHANGED
@@ -1,3 +1,96 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 360LayoutAnalysis
2
+
3
+ [English](./README_EN.md)
4
+
5
+ ## 一、背景
6
+
7
+ 在当今数字化时代,**文档版式分析**是信息提取和文档理解的关键步骤之一。文档版式分析,也称为文档图像分析或文档布局分析,是指从扫描的文档图像中识别和提取文本、图像、表格和其他元素的过程。这项技术在自动化文档处理、电子数据交换、历史文档数字化等领域有着广泛的应用。传统的文档版式分析模型往往难以准确区分文档中的段落和其他布局元素,这限制了文档信息的进一步处理和利用。深度学习和模式识别技术的发展为文档版式分析带来了新的机遇。通过训练数据集,可以提高模型对文档结构的理解能力。高质量的标注数据集是训练有效模型的基础。在文档版式分析中,精细化的标注非常有必要,其中:**段落**的标注尤其关键,因为它直接影响到文本的语义理解和信息提取。
8
+
9
+ 我们团队针对不同场景,构建了多个含有段落标注的中文文档数据集,这包括了**不同类型的文档**,以确保模型的泛化能力。例如:在**论文**场景中,以往的开源数据集如:CDLA(A Chinese document layout analysis),缺乏对段落信息的标注;在**研报**场景中,我们弥补了对于研报场景的空白。利用这些标注数据集,训练了多个全新的中文文档版式分析模型。这个模型旨在能够识别文档中的段落边界,并准确区分文本、图像、表格、公式等其他元素。
10
+
11
+ 本次,我们开源了论文场景和研报场景的版面分析模型权重及相应的标签体系。
12
+
13
+ ## 二、使用
14
+
15
+ - 权重下载地址:[🤗LINK](https://huggingface.co/qihoo360)
16
+
17
+ - 使用方式:
18
+
19
+ 开源权重使用`yolov8`进行训练,预测方式如下:
20
+
21
+ ```python
22
+ from ultralytics import YOLO
23
+
24
+ image_path = '' # 待预测图片路径
25
+ model_path = '' # 权重路径
26
+ model = YOLO(model_path)
27
+
28
+ result = model(image_path, save=True, conf=0.5, save_crop=False, line_width=2)
29
+ print(result)
30
+ ```
31
+
32
+
33
+
34
+ ## 三、版面分析
35
+ ### 3.1 论文场景
36
+
37
+ - 标签类别
38
+
39
+ | 元素 | 名称 |
40
+ | -------------- | ------------ |
41
+ | Text | 正文(段落) |
42
+ | Title | 标题 |
43
+ | Figure | 图片 |
44
+ | Figure caption | 图片标题 |
45
+ | Table | 表格 |
46
+ | Table caption | 表格标题 |
47
+ | Header | 页眉 |
48
+ | Footer | 页脚 |
49
+ | Reference | 注释 |
50
+ | Equation | 公式 |
51
+
52
+ - 示例
53
+ <div align="center">
54
+ <img src="./case/paper/1.jpg" width="50%" height="50%">
55
+ <img src="./case/paper/2.jpg" width="50%" height="50%">
56
+ </div>
57
+
58
+
59
+
60
+ ### 3.2 研报场景
61
+ - 标签类别
62
+
63
+ | 元素 | 名称 |
64
+ | -------------- | ------------ |
65
+ | Text | 正文(段落) |
66
+ | Title | 标题 |
67
+ | Figure | 图片 |
68
+ | Figure caption | 图片标题 |
69
+ | Table | 表格 |
70
+ | Table caption | 表格标题 |
71
+ | Header | 页眉 |
72
+ | Footer | 页脚 |
73
+ | Toc | 目录 |
74
+
75
+
76
+
77
+ - 示例
78
+
79
+ <div align="center">
80
+ <img src="./case/report/1.jpg" width="50%" height="50%">
81
+ <img src="./case/report/2.jpg" width="50%" height="50%">
82
+ </div>
83
+
84
+
85
+
86
+
87
+
88
+ ## License
89
+
90
+ This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.The content of this project itself is licensed under the [Apache license 2.0](./LICENSE.txt).
91
+
92
+
93
+
94
+ ## 许可证
95
+
96
+ 本仓库源码遵循开源许可证Apache 2.0。360LayoutAnalysis模型开源模型支持商用,若需将本模型及衍生模型用于商业用途,请通过邮箱([360ailab-nlp@360.cn](mailto:360ailab-nlp@360.cn))联系进行申请, 具体许可协议请见[《360LayoutAnalysis模型开源模型许可证》](./360LayoutAnalysis开源模型许可证.txt)。
README_EN.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 360LayoutAnalysis
2
+
3
+ [Chinese](./README.md)
4
+
5
+ ## I. Background
6
+
7
+ In today's digital era, **Document Layout Analysis** is one of the key steps in information extraction and document understanding. In today's digital era, **Document Layout Analysis** is one of the key steps in information extraction and document understanding. Also known as document image analysis or document layout analysis, it involves the process of identifying and extracting text, images, tables, and other elements from scanned document images. This technology has a broad range of applications in automated document processing, electronic data exchange, historical document digitization, and other fields. Traditional document layout analysis models often struggle to accurately distinguish between paragraphs and other layout elements within documents, which limits further processing and utilization of document information. The advancement of deep learning and pattern recognition technologies has brought new opportunities for document layout analysis. By training datasets, the model's understanding of document structure can be enhanced. High-quality annotated datasets are fundamental to training effective models. In document layout analysis, detailed annotation is essential, particularly the annotation of **paragraphs**, as it directly affects semantic understanding and information extraction of the text.
8
+
9
+ Our team has constructed multiple Chinese document datasets with paragraph annotations for various scenarios to ensure the model's generalization capability. For example, in the **academic paper** scenario, previous open-source datasets such as CDLA (A Chinese document layout analysis) lacked annotations for paragraph information; in the **research report** scenario, we have filled the gap for this particular area. Using these annotated datasets, we have trained several new Chinese document layout analysis models. These models are designed to identify paragraph boundaries in documents and accurately distinguish between text, images, tables, formulas, and other elements.
10
+
11
+ This time, we have open-sourced the layout analysis model weights and corresponding label systems for both the academic paper and research report scenarios.
12
+
13
+ ## II. Usage
14
+
15
+ - Weights download link: [🤗LINK](https://huggingface.co/qihoo360)
16
+
17
+ - Usage:
18
+
19
+ The open-source weights are trained with `yolov8`, and the prediction method is as follows:
20
+
21
+ ```python
22
+ from ultralytics import YOLO
23
+
24
+ image_path = '' # Path to the image to be predicted
25
+ model_path = '' # Path to the weights
26
+ model = YOLO(model_path)
27
+
28
+ result = model(image_path, save=True, conf=0.5, save_crop=False, line_width=2)
29
+ print(result)
30
+ ```
31
+
32
+ ## III. Layout Analysis
33
+
34
+ ### 3.1 Academic Paper Scenario
35
+
36
+ - Label Categories
37
+
38
+ | Element | Name |
39
+ | -------------- | --------------------- |
40
+ | Text | Main Text (Paragraph) |
41
+ | Title | Title |
42
+ | Figure | Image |
43
+ | Figure caption | Image Caption |
44
+ | Table | Table |
45
+ | Table caption | Table Caption |
46
+ | Header | Header |
47
+ | Footer | Footer |
48
+ | Reference | Reference |
49
+ | Equation | Equation |
50
+
51
+ - Example
52
+
53
+ <div align="center">
54
+ <img src="./case/paper/1.jpg" width="50%" height="50%">
55
+ <img src="./case/paper/2.jpg" width="50%" height="50%">
56
+ </div>
57
+
58
+ ### 3.2 Research Report Scenario
59
+
60
+ - Label Categories
61
+
62
+ | Element | Name |
63
+ | -------------- | --------------------- |
64
+ | Text | Main Text (Paragraph) |
65
+ | Title | Title |
66
+ | Figure | Image |
67
+ | Figure caption | Image Caption |
68
+ | Table | Table |
69
+ | Table caption | Table Caption |
70
+ | Header | Header |
71
+ | Footer | Footer |
72
+ | Toc | Table of Contents |
73
+
74
+ - Example
75
+
76
+ <div align="center">
77
+ <img src="./case/report/1.jpg" width="50%" height="50%">
78
+ <img src="./case/report/2.jpg" width="50%" height="50%">
79
+ </div>
80
+
81
+ ## License
82
+
83
+ This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the [Apache license 2.0](./LICENSE.txt).
84
+
85
+ ## License
86
+
87
+ The source code of this repository follows the open-source license Apache 2.0. The 360LayoutAnalysis model open-source model supports commercial use. If you need to use this model and its derivative models for commercial purposes, please apply through the email ([360ailab-nlp@360.cn](mailto:360ailab-nlp@360.cn)), and see the specific license agreement in ["360LayoutAnalysis Model Open Source Model License"](./360LayoutAnalysis开源模型许可证.txt).