Hierarchical Attention Transformer (HAT) / hierarchical-transformer-base-4096

Model description

This is a Hierarchical Attention Transformer (HAT) model as presented in An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification (Chalkidis et al., 2022).

The model has been warm-started re-using the weights of RoBERTa (Liu et al., 2019), and continued pre-trained for MLM in long sequences following the paradigm of Longformer released by Beltagy et al. (2020). It supports sequences of length up to 4,096.

HAT uses hierarchical attention, which is a combination of segment-wise and cross-segment attention operations. You can think of segments as paragraphs or sentences.

Intended uses & limitations

You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for other versions of HAT or fine-tuned versions on a task that interests you.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole document to make decisions, such as document classification, sequential sentence classification, or question answering.

How to use

You can use this model directly for masked language modeling:

from transformers import AutoTokenizer, AutoModelForForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)
mlm_model = AutoModelForMaskedLM("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)

You can also fine-tune it for SequenceClassification, SequentialSentenceClassification, and MultipleChoice down-stream tasks:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)
doc_classifier = AutoModelForSequenceClassification.from_pretrained("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)

Limitations and bias

The training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral. Therefore, the model can have biased predictions.

Training procedure

Training and evaluation data

The model has been warm-started from roberta-base checkpoint and has been continued pre-trained for additional 50k steps in long sequences (> 1024 subwords) of C4 (Raffel et al., 2020).

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 2
eval_batch_size: 2
seed: 42
distributed_type: tpu
num_devices: 8
gradient_accumulation_steps: 8
total_train_batch_size: 128
total_eval_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
training_steps: 50000

Training results

Training Loss	Epoch	Step	Validation Loss
1.7437	0.2	10000	1.6370
1.6994	0.4	20000	1.6054
1.6726	0.6	30000	1.5718
1.644	0.8	40000	1.5526
1.6299	1.0	50000	1.5368

Framework versions

Transformers 4.19.0.dev0
Pytorch 1.11.0+cu102
Datasets 2.0.0
Tokenizers 0.11.6

Citing

If you use HAT in your research, please cite:

An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification. Ilias Chalkidis, Xiang Dai, Manos Fergadiotis, Prodromos Malakasiotis, and Desmond Elliott. 2022. arXiv:2210.05529 (Preprint).

@misc{chalkidis-etal-2022-hat,
  url = {https://arxiv.org/abs/2210.05529},
  author = {Chalkidis, Ilias and Dai, Xiang and Fergadiotis, Manos and Malakasiotis, Prodromos and Elliott, Desmond},
  title = {An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification},
  publisher = {arXiv},
  year = {2022},
}

kiddothe2b
/

hierarchical-transformer-base-4096