Legal document section extraction

I'm interested in using encoder-based extraction of named legal document sections. The problem is challenging because often documents are published in different formats (PDF, HTML, etc.), and are not structured exactly the same across instances (same section, diff name, for example).

Thoughts around two approachs, pros and cons

I wanted to create a section classifier for legal documents. I found a pretty up-to-date paper called Structural Segmentation of Legal Documents (https://arxiv.org/pdf/2012.03619). This paper talks about segmentation of legal documents when specific sections are not known, and proposes a siamese architecture of ROBERTA for pairwise classification of either "same topic" or "not same topic". With this approach, I would still need to determine the classification of each section.
In comparison, I was interested in the extraction of known sections, therefore I may not need to perform any pairwise operations. I realized I could just use a plain old ROBERTA model with binary classifier after, but I think the downside here is that I'd have to come up with some heuristic to compensate for noise.
A key hypothesis is: Since some sections may have "neutral" sentences, they won't be classified properly, but in the siamese architecture, they would not trigger a change of section classification. This would work in favor of the siamese architecture, except for when a neutral sentence falls at a section boundary.

About Siamese architecutre ROBERTa for clustering

The Siamese architecture uses two identical ROBERTA models, performs a pooling operation over corresponding indices for each output token embedding, cats the two outputs, then runs binary classification on this single vector.
During the back pass, both models can be updated via backprop, or just the classification network.

(THIS IS IMPLEMENTED IN THIS COLLAB DOCUMENT) About ROBERTa

ROBERTa base input shape (batch size, seq length)
ROBERTa base output shape (batch size, seq length, hidden size)
ROBERTa base hidden size = 768
ROBERTa base max input seq length = 512

Using ROBERTA for segmentation involves combining sentences A and B into single input to ROBERTA. See below:

standard ROBERTA model on pairwise sentences ((512 / 2) - 3 tokens, max, per sentence)

[cls] A [sep] B [SEP]

and the embedding for [cls] can be used in a binary classifier.

But, the architecture used here is:

standard ROBERTA model, but instead input is just

[CLS] A [SEP]

classification of [CLS] token embedding:

x = features[:, 0, :]  # take < s > token (equiv. to [CLS])
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)