Update README.md
Browse files
README.md
CHANGED
@@ -9,17 +9,14 @@ metrics:
|
|
9 |
- recall
|
10 |
---
|
11 |
|
12 |
-
# Legal document section extraction
|
13 |
|
14 |
-
|
15 |
|
16 |
-
## Thoughts around
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
In comparison, I was interested in the extraction of _known_ sections, therefore I may not need to perform any pairwise operations. I realized I could just use a plain old ROBERTA model with binary classifier after, but I think the downside here is that I'd have to come up with some heuristic to compensate for noise.
|
21 |
-
|
22 |
-
A key hypothesis is: Since some sections may have "neutral" sentences, they won't be classified properly, but in the siamese architecture, they would not trigger a change of section classification. This would work in favor of the siamese architecture, except for when a neutral sentence falls at a section boundary.
|
23 |
|
24 |
## About Siamese architecutre ROBERTa for clustering
|
25 |
- The Siamese architecture uses two identical ROBERTA models, performs a pooling operation over corresponding indices for each output token embedding, cats the two outputs, then runs binary classification on this single vector.
|
@@ -30,8 +27,7 @@ A key hypothesis is: Since some sections may have "neutral" sentences, they won'
|
|
30 |
- ROBERTa base output shape (batch size, seq length, hidden size)
|
31 |
- ROBERTa base hidden size = 768
|
32 |
- ROBERTa base max input seq length = 512
|
33 |
-
|
34 |
-
- Using ROBERTA for segmentation involves combining sentences A and B into single input to ROBERTA:
|
35 |
|
36 |
[cls] A [sep] B
|
37 |
|
@@ -50,3 +46,5 @@ x = torch.tanh(x)
|
|
50 |
x = self.dropout(x)
|
51 |
x = self.out_proj(x)
|
52 |
```
|
|
|
|
|
|
9 |
- recall
|
10 |
---
|
11 |
|
12 |
+
# Legal document section extraction
|
13 |
|
14 |
+
I'm interested in using encoder-based extraction of named legal document sections. The problem is challenging because often documents are published in different formats (PDF, HTML, etc.), and are not structured exactly the same across instances (same section, diff name, for example).
|
15 |
|
16 |
+
## Thoughts around two approachs, pros and cons
|
17 |
+
- I wanted to create a section classifier for legal documents. I found a pretty up-to-date paper called Structural Segmentation of Legal Documents (https://arxiv.org/pdf/2012.03619). This paper talks about segmentation of legal documents when specific sections _are not known_, and proposes a siamese architecture of ROBERTA for pairwise classification of either "same topic" or "not same topic". With this approach, I would still need to determine the classification of each section.
|
18 |
+
- In comparison, I was interested in the extraction of _known_ sections, therefore I may not need to perform any pairwise operations. I realized I could just use a plain old ROBERTA model with binary classifier after, but I think the downside here is that I'd have to come up with some heuristic to compensate for noise.
|
19 |
+
- A key hypothesis is: Since some sections may have "neutral" sentences, they won't be classified properly, but in the siamese architecture, they would not trigger a change of section classification. This would work in favor of the siamese architecture, except for when a neutral sentence falls at a section boundary.
|
|
|
|
|
|
|
20 |
|
21 |
## About Siamese architecutre ROBERTa for clustering
|
22 |
- The Siamese architecture uses two identical ROBERTA models, performs a pooling operation over corresponding indices for each output token embedding, cats the two outputs, then runs binary classification on this single vector.
|
|
|
27 |
- ROBERTa base output shape (batch size, seq length, hidden size)
|
28 |
- ROBERTa base hidden size = 768
|
29 |
- ROBERTa base max input seq length = 512
|
30 |
+
- ** Using ROBERTA for segmentation involves combining sentences A and B into single input to ROBERTA. See below:
|
|
|
31 |
|
32 |
[cls] A [sep] B
|
33 |
|
|
|
46 |
x = self.dropout(x)
|
47 |
x = self.out_proj(x)
|
48 |
```
|
49 |
+
|
50 |
+
## Results
|