Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -1,70 +1,110 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
base_model: facebook/bart-large
|
4 |
tags:
|
5 |
-
-
|
6 |
-
|
7 |
-
-
|
|
|
|
|
|
|
|
|
8 |
model-index:
|
9 |
-
- name:
|
10 |
-
results:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
---
|
12 |
|
13 |
-
|
14 |
-
should probably proofread and complete it, then remove this comment. -->
|
15 |
|
16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
-
|
19 |
-
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
-
|
23 |
|
24 |
-
|
25 |
|
26 |
-
|
27 |
|
28 |
-
|
29 |
|
30 |
-
|
|
|
31 |
|
32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
-
|
35 |
|
36 |
-
|
37 |
|
38 |
-
|
39 |
-
- learning_rate: 5e-05
|
40 |
-
- train_batch_size: 16
|
41 |
-
- eval_batch_size: 16
|
42 |
-
- seed: 42
|
43 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
44 |
-
- lr_scheduler_type: linear
|
45 |
-
- lr_scheduler_warmup_ratio: 0.1
|
46 |
-
- num_epochs: 40
|
47 |
|
48 |
-
|
|
|
49 |
|
50 |
-
|
51 |
-
|:-------------:|:-----:|:----:|:---------------:|
|
52 |
-
| 3.704 | 1.0 | 69 | 2.3491 |
|
53 |
-
| 2.3906 | 2.0 | 138 | 2.1590 |
|
54 |
-
| 2.1402 | 3.0 | 207 | 2.0831 |
|
55 |
-
| 1.991 | 4.0 | 276 | 2.0632 |
|
56 |
-
| 1.8691 | 5.0 | 345 | 2.0446 |
|
57 |
-
| 1.7619 | 6.0 | 414 | 2.0210 |
|
58 |
-
| 1.6715 | 7.0 | 483 | 2.0236 |
|
59 |
-
| 1.5936 | 8.0 | 552 | 2.0394 |
|
60 |
-
| 1.5216 | 9.0 | 621 | 2.0337 |
|
61 |
-
| 1.4501 | 10.0 | 690 | 2.0614 |
|
62 |
-
| 1.389 | 11.0 | 759 | 2.0609 |
|
63 |
|
|
|
|
|
64 |
|
65 |
-
###
|
66 |
|
67 |
-
-
|
68 |
-
-
|
69 |
-
-
|
70 |
-
-
|
|
|
|
1 |
---
|
2 |
+
language: en
|
|
|
3 |
tags:
|
4 |
+
- summarization
|
5 |
+
- abstractive
|
6 |
+
- hybrid
|
7 |
+
- multistep
|
8 |
+
datasets: dennlinger/eur-lex-sum
|
9 |
+
pipeline_tag: summarization
|
10 |
+
base_model: BART
|
11 |
model-index:
|
12 |
+
- name: BART
|
13 |
+
results:
|
14 |
+
- task:
|
15 |
+
type: summarization
|
16 |
+
name: Long, Legal Document Summarization
|
17 |
+
dataset:
|
18 |
+
name: eur-lex-sum
|
19 |
+
type: dennlinger/eur-lex-sum
|
20 |
+
metrics:
|
21 |
+
- type: ROUGE-1
|
22 |
+
value: 0.44691280129794786
|
23 |
+
- type: ROUGE-2
|
24 |
+
value: 0.1774386577381308
|
25 |
+
- type: ROUGE-L
|
26 |
+
value: 0.21368587545058315
|
27 |
+
- type: BERTScore
|
28 |
+
value: 0.8664783139468571
|
29 |
+
- type: BARTScore
|
30 |
+
value: -3.565346535357683
|
31 |
+
- type: BLANC
|
32 |
+
value: 0.14228194404852756
|
33 |
---
|
34 |
|
35 |
+
# Model Card for LegalBERT_BART_hybrid_V1
|
|
|
36 |
|
37 |
+
## Model Details
|
38 |
+
---
|
39 |
+
### Model Description
|
40 |
+
|
41 |
+
This model is a fine-tuned version of BART. The research involves a multi-step summarization approach to long, legal documents. Many decisions in the renewables energy space are heavily dependent on regulations. But these regulations are often long and complicated. The proposed architecture first uses one or more extractive summarization steps to compress the source text, before the final summary is created by the abstractive summarization model. This fine-tuned abstractive model has been trained on a dataset, pre-processed through extractive summarization by LegalBERT with hybrid ratio. The research has used multiple extractive-abstractive model combinations, which can be found on https://huggingface.co/MikaSie. To obtain optimal results, feed the model an extractive summary as input as it was designed this way!
|
42 |
+
|
43 |
+
The dataset used by this model is the [EUR-lex-sum](https://huggingface.co/datasets/dennlinger/eur-lex-sum) dataset. The evaluation metrics can be found in the metadata of this model card.
|
44 |
+
This paper was introduced by the master thesis of Mika Sie at the University Utrecht in collaboration with Power2x. More information can be found in PAPER_LINK.
|
45 |
+
|
46 |
+
- **Developed by:** Mika Sie
|
47 |
+
- **Funded by:** University Utrecht & Power2X
|
48 |
+
- **Language (NLP):** English
|
49 |
+
- **Finetuned from model:** BART
|
50 |
+
|
51 |
+
|
52 |
+
### Model Sources
|
53 |
+
|
54 |
+
- **Repository**: https://github.com/MikaSie/Thesis
|
55 |
+
- **Paper**: PAPER_LINK
|
56 |
+
- **Streamlit demo**: STREAMLIT_LINK
|
57 |
|
58 |
+
## Uses
|
59 |
+
---
|
60 |
+
### Direct Use
|
61 |
+
|
62 |
+
This model can be directly used for summarizing long, legal documents. However, it is recommended to first use an extractive summarization tool, such as LegalBERT, to compress the source text before feeding it to this model. This model has been specifically designed to work with extractive summaries.
|
63 |
+
An example using the Huggingface pipeline could be:
|
64 |
+
|
65 |
+
```python
|
66 |
+
pip install bert-extractive-summarizer
|
67 |
+
|
68 |
+
from summarizer import Summarizer
|
69 |
+
from transformers import pipeline
|
70 |
|
71 |
+
extractive_model = Summarizer()
|
72 |
|
73 |
+
text = 'Original document text to be summarized'
|
74 |
|
75 |
+
extractive_summary = Summarizer(text)
|
76 |
|
77 |
+
abstractive_model = pipeline('summarization', model = 'MikaSie/LegalBERT_BART_hybrid_V1', tokenizer = 'MikaSie/LegalBERT_BART_hybrid_V1')
|
78 |
|
79 |
+
result = pipeline(extractive_summary)
|
80 |
+
```
|
81 |
|
82 |
+
But more information of implementation can be found in the Thesis report.
|
83 |
+
### Out-of-Scope Use
|
84 |
+
|
85 |
+
Using this model without an extractive summarization step may not yield optimal results. It is recommended to follow the proposed multi-step summarization approach outlined in the model description for best performance.
|
86 |
+
|
87 |
+
## Bias, Risks, and Limitations
|
88 |
+
---
|
89 |
|
90 |
+
### Bias
|
91 |
|
92 |
+
As with any language model, this model may inherit biases present in the training data. It is important to be aware of potential biases in the source text and to critically evaluate the generated summaries.
|
93 |
|
94 |
+
### Risks
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
95 |
|
96 |
+
- The model may not always generate accurate or comprehensive summaries, especially for complex legal documents.
|
97 |
+
- The model may not generate truthful information.
|
98 |
|
99 |
+
### Limitations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
|
101 |
+
- The model may produce summaries that are overly abstractive or fail to capture important details.
|
102 |
+
- The model's performance may vary depending on the quality and relevance of the extractive summaries used as input.
|
103 |
|
104 |
+
### Recommendations
|
105 |
|
106 |
+
- Carefully review and validate the generated summaries before relying on them for critical tasks.
|
107 |
+
- Consider using the model in conjunction with human review or other validation mechanisms to ensure the accuracy and completeness of the summaries.
|
108 |
+
- Experiment with different extractive summarization models or techniques to find the most suitable input for the abstractive model.
|
109 |
+
- Provide feedback and contribute to the ongoing research and development of the model to help improve its performance and address its limitations.
|
110 |
+
- Any actions taken based on this content are at your own risk.
|