rkotcher commited on
Commit
c18310f
1 Parent(s): 755a05f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -11
README.md CHANGED
@@ -9,17 +9,14 @@ metrics:
9
  - recall
10
  ---
11
 
12
- # Legal document section extraction: motivation
13
 
14
- Legal documents are often dense and filled with information that is
15
 
16
- ## Thoughts around each approach, pros and cons
17
-
18
- I wanted to create a section classifier for legal documents. I found a pretty up-to-date paper called Structural Segmentation of Legal Documents (https://arxiv.org/pdf/2012.03619). This paper talks about segmentation of legal documents when specific sections _are not known_, and proposes a siamese architecture of ROBERTA for pairwise classification of either "same topic" or "not same topic". With this approach, I would still need to determine the classification of each section.
19
-
20
- In comparison, I was interested in the extraction of _known_ sections, therefore I may not need to perform any pairwise operations. I realized I could just use a plain old ROBERTA model with binary classifier after, but I think the downside here is that I'd have to come up with some heuristic to compensate for noise.
21
-
22
- A key hypothesis is: Since some sections may have "neutral" sentences, they won't be classified properly, but in the siamese architecture, they would not trigger a change of section classification. This would work in favor of the siamese architecture, except for when a neutral sentence falls at a section boundary.
23
 
24
  ## About Siamese architecutre ROBERTa for clustering
25
  - The Siamese architecture uses two identical ROBERTA models, performs a pooling operation over corresponding indices for each output token embedding, cats the two outputs, then runs binary classification on this single vector.
@@ -30,8 +27,7 @@ A key hypothesis is: Since some sections may have "neutral" sentences, they won'
30
  - ROBERTa base output shape (batch size, seq length, hidden size)
31
  - ROBERTa base hidden size = 768
32
  - ROBERTa base max input seq length = 512
33
-
34
- - Using ROBERTA for segmentation involves combining sentences A and B into single input to ROBERTA:
35
 
36
  [cls] A [sep] B
37
 
@@ -50,3 +46,5 @@ x = torch.tanh(x)
50
  x = self.dropout(x)
51
  x = self.out_proj(x)
52
  ```
 
 
 
9
  - recall
10
  ---
11
 
12
+ # Legal document section extraction
13
 
14
+ I'm interested in using encoder-based extraction of named legal document sections. The problem is challenging because often documents are published in different formats (PDF, HTML, etc.), and are not structured exactly the same across instances (same section, diff name, for example).
15
 
16
+ ## Thoughts around two approachs, pros and cons
17
+ - I wanted to create a section classifier for legal documents. I found a pretty up-to-date paper called Structural Segmentation of Legal Documents (https://arxiv.org/pdf/2012.03619). This paper talks about segmentation of legal documents when specific sections _are not known_, and proposes a siamese architecture of ROBERTA for pairwise classification of either "same topic" or "not same topic". With this approach, I would still need to determine the classification of each section.
18
+ - In comparison, I was interested in the extraction of _known_ sections, therefore I may not need to perform any pairwise operations. I realized I could just use a plain old ROBERTA model with binary classifier after, but I think the downside here is that I'd have to come up with some heuristic to compensate for noise.
19
+ - A key hypothesis is: Since some sections may have "neutral" sentences, they won't be classified properly, but in the siamese architecture, they would not trigger a change of section classification. This would work in favor of the siamese architecture, except for when a neutral sentence falls at a section boundary.
 
 
 
20
 
21
  ## About Siamese architecutre ROBERTa for clustering
22
  - The Siamese architecture uses two identical ROBERTA models, performs a pooling operation over corresponding indices for each output token embedding, cats the two outputs, then runs binary classification on this single vector.
 
27
  - ROBERTa base output shape (batch size, seq length, hidden size)
28
  - ROBERTa base hidden size = 768
29
  - ROBERTa base max input seq length = 512
30
+ - ** Using ROBERTA for segmentation involves combining sentences A and B into single input to ROBERTA. See below:
 
31
 
32
  [cls] A [sep] B
33
 
 
46
  x = self.dropout(x)
47
  x = self.out_proj(x)
48
  ```
49
+
50
+ ## Results