Update README.md
Browse files# My notes
## Thoughts around each approach, pros and cons
I wanted to create a section classifier for legal documents. I found a pretty up-to-date paper called Structural Segmentation of Legal Documents (https://arxiv.org/pdf/2012.03619). This paper talks about segmentation of legal documents when specific sections _are not known_, and proposes a siamese architecture of ROBERTA for pairwise classification of either "same topic" or "not same topic". With this approach, I would still need to determine the classification of each section.
In comparison, I was interested in the extraction of _known_ sections, therefore I may not need to perform any pairwise operations. I realized I could just use a plain old ROBERTA model with binary classifier after, but I think the downside here is that I'd have to come up with some heuristic to compensate for noise.
A key hypothesis is: Since some sections may have "neutral" sentences, they won't be classified properly, but in the siamese architecture, they would not trigger a change of section classification. This would work in favor of the siamese architecture, except for when a neutral sentence falls at a section boundary.
## About Siamese architecutre ROBERTa for clustering
- The Siamese architecture uses two identical ROBERTA models, performs a pooling operation over corresponding indices for each output token embedding, cats the two outputs, then runs binary classification on this single vector.
- During the back pass, both models can be updated via backprop, or just the classification network.
## (THIS IS IMPLEMENTED IN THIS COLLAB DOCUMENT) About ROBERTa
- ROBERTa base input shape (batch size, seq length)
- ROBERTa base output shape (batch size, seq length, hidden size)
- ROBERTa base hidden size = 768
- ROBERTa base max input seq length = 512
- Using ROBERTA for segmentation involves combining sentences A and B into single input to ROBERTA:
[cls] A [sep] B
and the embedding for [cls] can be used in a binary classifier.
Specifically, the architecture used here (implemented via ) is:
1. standard ROBERTA model
2. classification
```
x = features[:, 0, :] # take < s > token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
```