Update README.md

# My notes

## Thoughts around each approach, pros and cons

I wanted to create a section classifier for legal documents. I found a pretty up-to-date paper called Structural Segmentation of Legal Documents (https://arxiv.org/pdf/2012.03619). This paper talks about segmentation of legal documents when specific sections _are not known_, and proposes a siamese architecture of ROBERTA for pairwise classification of either "same topic" or "not same topic". With this approach, I would still need to determine the classification of each section.

In comparison, I was interested in the extraction of _known_ sections, therefore I may not need to perform any pairwise operations. I realized I could just use a plain old ROBERTA model with binary classifier after, but I think the downside here is that I'd have to come up with some heuristic to compensate for noise.

A key hypothesis is: Since some sections may have "neutral" sentences, they won't be classified properly, but in the siamese architecture, they would not trigger a change of section classification. This would work in favor of the siamese architecture, except for when a neutral sentence falls at a section boundary.

## About Siamese architecutre ROBERTa for clustering
- The Siamese architecture uses two identical ROBERTA models, performs a pooling operation over corresponding indices for each output token embedding, cats the two outputs, then runs binary classification on this single vector.
- During the back pass, both models can be updated via backprop, or just the classification network.

## (THIS IS IMPLEMENTED IN THIS COLLAB DOCUMENT) About ROBERTa
- ROBERTa base input shape (batch size, seq length)
- ROBERTa base output shape (batch size, seq length, hidden size)
- ROBERTa base hidden size = 768
- ROBERTa base max input seq length = 512

- Using ROBERTA for segmentation involves combining sentences A and B into single input to ROBERTA:

[cls] A [sep] B

and the embedding for [cls] can be used in a binary classifier.

Specifically, the architecture used here (implemented via ) is:

1. standard ROBERTA model
2. classification

```
x = features[:, 0, :] # take < s > token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
```

Files changed (1) hide show

README.md +10 -3

README.md CHANGED Viewed

@@ -1,3 +1,10 @@
----
-license: mit
----

+---
+license: mit
+language:
+- en
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+---