jackstanley
commited on
Commit
•
7140f4d
1
Parent(s):
5643fe7
Delete README.md
Browse files
README.md
DELETED
@@ -1,116 +0,0 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-sa-4.0
|
3 |
-
pipeline_tag: fill-mask
|
4 |
-
arxiv: 2210.05529
|
5 |
-
language: en
|
6 |
-
thumbnail: https://github.com/coastalcph/hierarchical-transformers/raw/main/data/figures/hat_encoder.png
|
7 |
-
tags:
|
8 |
-
- long-documents
|
9 |
-
datasets:
|
10 |
-
- c4
|
11 |
-
model-index:
|
12 |
-
- name: kiddothe2b/hierarchical-transformer-base-4096
|
13 |
-
results: []
|
14 |
-
---
|
15 |
-
|
16 |
-
# Hierarchical Attention Transformer (HAT) / hierarchical-transformer-base-4096
|
17 |
-
|
18 |
-
## Model description
|
19 |
-
|
20 |
-
This is a Hierarchical Attention Transformer (HAT) model as presented in [An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification (Chalkidis et al., 2022)](https://arxiv.org/abs/2210.05529).
|
21 |
-
|
22 |
-
The model has been warm-started re-using the weights of RoBERTa (Liu et al., 2019), and continued pre-trained for MLM in long sequences following the paradigm of Longformer released by Beltagy et al. (2020). It supports sequences of length up to 4,096.
|
23 |
-
|
24 |
-
HAT uses hierarchical attention, which is a combination of segment-wise and cross-segment attention operations. You can think of segments as paragraphs or sentences.
|
25 |
-
|
26 |
-
## Intended uses & limitations
|
27 |
-
|
28 |
-
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
|
29 |
-
See the [model hub](https://huggingface.co/models?filter=hierarchical-transformer) to look for other versions of HAT or fine-tuned versions on a task that interests you.
|
30 |
-
|
31 |
-
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole document to make decisions, such as document classification, sequential sentence classification, or question answering.
|
32 |
-
|
33 |
-
## How to use
|
34 |
-
|
35 |
-
You can use this model directly for masked language modeling:
|
36 |
-
|
37 |
-
```python
|
38 |
-
from transformers import AutoTokenizer, AutoModelforForMaskedLM
|
39 |
-
tokenizer = AutoTokenizer.from_pretrained("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)
|
40 |
-
mlm_model = AutoModelforForMaskedLM("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)
|
41 |
-
```
|
42 |
-
|
43 |
-
You can also fine-tune it for SequenceClassification, SequentialSentenceClassification, and MultipleChoice down-stream tasks:
|
44 |
-
|
45 |
-
```python
|
46 |
-
from transformers import AutoTokenizer, AutoModelforSequenceClassification
|
47 |
-
tokenizer = AutoTokenizer.from_pretrained("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)
|
48 |
-
doc_classifier = AutoModelforSequenceClassification("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)
|
49 |
-
```
|
50 |
-
|
51 |
-
## Limitations and bias
|
52 |
-
|
53 |
-
The training data used for this model contains a lot of unfiltered content from the internet, which is far from
|
54 |
-
neutral. Therefore, the model can have biased predictions.
|
55 |
-
|
56 |
-
|
57 |
-
## Training procedure
|
58 |
-
|
59 |
-
### Training and evaluation data
|
60 |
-
|
61 |
-
The model has been warm-started from [roberta-base](https://huggingface.co/roberta-base) checkpoint and has been continued pre-trained for additional 50k steps in long sequences (> 1024 subwords) of [C4](https://huggingface.co/datasets/c4) (Raffel et al., 2020).
|
62 |
-
|
63 |
-
|
64 |
-
### Training hyperparameters
|
65 |
-
|
66 |
-
The following hyperparameters were used during training:
|
67 |
-
- learning_rate: 0.0001
|
68 |
-
- train_batch_size: 2
|
69 |
-
- eval_batch_size: 2
|
70 |
-
- seed: 42
|
71 |
-
- distributed_type: tpu
|
72 |
-
- num_devices: 8
|
73 |
-
- gradient_accumulation_steps: 8
|
74 |
-
- total_train_batch_size: 128
|
75 |
-
- total_eval_batch_size: 16
|
76 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
77 |
-
- lr_scheduler_type: linear
|
78 |
-
- lr_scheduler_warmup_ratio: 0.1
|
79 |
-
- training_steps: 50000
|
80 |
-
|
81 |
-
### Training results
|
82 |
-
|
83 |
-
| Training Loss | Epoch | Step | Validation Loss |
|
84 |
-
|:-------------:|:-----:|:-----:|:---------------:|
|
85 |
-
| 1.7437 | 0.2 | 10000 | 1.6370 |
|
86 |
-
| 1.6994 | 0.4 | 20000 | 1.6054 |
|
87 |
-
| 1.6726 | 0.6 | 30000 | 1.5718 |
|
88 |
-
| 1.644 | 0.8 | 40000 | 1.5526 |
|
89 |
-
| 1.6299 | 1.0 | 50000 | 1.5368 |
|
90 |
-
|
91 |
-
|
92 |
-
### Framework versions
|
93 |
-
|
94 |
-
- Transformers 4.19.0.dev0
|
95 |
-
- Pytorch 1.11.0+cu102
|
96 |
-
- Datasets 2.0.0
|
97 |
-
- Tokenizers 0.11.6
|
98 |
-
|
99 |
-
|
100 |
-
## Citing
|
101 |
-
|
102 |
-
If you use HAT in your research, please cite:
|
103 |
-
|
104 |
-
[An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification](https://arxiv.org/abs/2210.05529). Ilias Chalkidis, Xiang Dai, Manos Fergadiotis, Prodromos Malakasiotis, and Desmond Elliott. 2022. arXiv:2210.05529 (Preprint).
|
105 |
-
|
106 |
-
```
|
107 |
-
@misc{chalkidis-etal-2022-hat,
|
108 |
-
url = {https://arxiv.org/abs/2210.05529},
|
109 |
-
author = {Chalkidis, Ilias and Dai, Xiang and Fergadiotis, Manos and Malakasiotis, Prodromos and Elliott, Desmond},
|
110 |
-
title = {An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification},
|
111 |
-
publisher = {arXiv},
|
112 |
-
year = {2022},
|
113 |
-
}
|
114 |
-
```
|
115 |
-
|
116 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|