File size: 5,375 Bytes
1f07c52
 
4942cd2
e30e31b
 
 
f96d682
 
 
 
 
27a1575
f96d682
 
27a1575
f96d682
 
 
 
 
1a3ee82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52eebd2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f07c52
a202aed
 
 
 
 
 
 
 
 
d0331cb
a202aed
 
 
 
d0331cb
a202aed
d2e72e4
f0b4f0d
a202aed
 
d0331cb
a202aed
d0331cb
f0b4f0d
a202aed
 
f0b4f0d
 
a202aed
f0b4f0d
a202aed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29b9d64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
license: apache-2.0
language: en
datasets:
- wikipedia
- bookcorpus
model-index:
- name: asi/albert-act-base
  results:
  - task:
      type: text-classification
      name: CoLA
    dataset:
      type: glue
      name: CoLA # General Language Understanding Evaluation benchmark (GLUE)
      split: cola
    metrics:
      - type: matthews_correlation
        value: 36.7
        name: Matthew's Corr
  - task:
      type: text-classification
      name: SST-2
    dataset:
      type: glue
      name: SST-2 # The Stanford Sentiment Treebank
      split: sst2
    metrics:
      - type: accuracy
        value: 87.8
        name: Accuracy
  - task:
      type: text-classification
      name: MRPC
    dataset:
      type: glue
      name: MRPC # Microsoft Research Paraphrase Corpus
      split: mrpc
    metrics:
      - type: accuracy
        value: 81.4
        name: Accuracy
      - type: f1
        value: 86.5
        name: F1
  - task:
      type: text-similarity
      name: STS-B
    dataset:
      type: glue
      name: STS-B # Semantic Textual Similarity Benchmark
      split: stsb
    metrics:
      - type: spearmanr
        value: 83.0
        name: Spearman Corr
      - type: pearsonr
        value: 84.2
        name: Pearson Corr
  - task:
      type: text-classification
      name: QQP
    dataset:
      type: glue
      name: QQP # Quora Question Pairs
      split: qqp
    metrics:
      - type: f1
        value: 68.5
        name: F1
      - type: accuracy
        value: 87.7
        name: Accuracy
  - task:
      type: text-classification
      name: MNLI-m
    dataset:
      type: glue
      name: MNLI-m # MultiNLI Matched
      split: mnli_matched
    metrics:
      - type: accuracy
        value: 79.9
        name: Accuracy
  - task:
      type: text-classification
      name: MNLI-mm
    dataset:
      type: glue
      name: MNLI-mm # MultiNLI Matched
      split: mnli_mismatched
    metrics:
      - type: accuracy
        value: 79.2
        name: Accuracy
  - task:
      type: text-classification
      name: QNLI
    dataset:
      type: glue
      name: QNLI # Question NLI
      split: qnli
    metrics:
      - type: accuracy
        value: 89.0
        name: Accuracy  
  - task:
      type: text-classification
      name: RTE
    dataset:
      type: glue
      name: RTE # Recognizing Textual Entailment
      split: rte
    metrics:
      - type: accuracy
        value: 63.0
        name: Accuracy  
  - task:
      type: text-classification
      name: WNLI
    dataset:
      type: glue
      name: WNLI # Winograd NLI
      split: wnli
    metrics:
      - type: accuracy
        value: 65.1
        name: Accuracy
---



# Adaptive Depth Transformers

Implementation of the paper "How Many Layers and Why? An Analysis of the Model Depth in Transformers". In this study, we investigate the role of the multiple layers in deep transformer models. We design a variant of ALBERT that dynamically adapts the number of layers for each token of the input.

## Model architecture

We augment a multi-layer transformer encoder with a halting mechanism, which dynamically adjusts the number of layers for each token.
We directly adapted this mechanism from Graves ([2016](#graves-2016)). At each iteration, we compute a probability for each token to stop updating its state.

## Model use

The architecture is not yet directly included in the Transformers library. The code used for pre-training is available in the following [github repository](https://github.com/AntoineSimoulin/adaptive-depth-transformers). So you should install the code implementation first:

```bash
!pip install git+https://github.com/AntoineSimoulin/adaptive-depth-transformers$
```

Then you can use the model directly.

```python
from act import AlbertActConfig, AlbertActModel, TFAlbertActModel
from transformers import AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained('asi/albert-act-base')
model = AlbertActModel.from_pretrained('asi/albert-act-base')
_ = model.eval()

inputs = tokenizer("a lump in the middle of the monkeys stirred and then fell quiet .", return_tensors="pt")
outputs = model(**inputs)
outputs.updates
# tensor([[[[15.,  9., 10.,  7.,  3.,  8.,  5.,  7., 12., 10.,  6.,  8.,  8.,  9., 5.,  8.]]]])
```

## Citations

### BibTeX entry and citation info

If you use our iterative transformer model for your scientific publication or your industrial applications, please cite the following [paper](https://aclanthology.org/2021.acl-srw.23/):

```bibtex
@inproceedings{simoulin-crabbe-2021-many,
    title = "How Many Layers and Why? {A}n Analysis of the Model Depth in Transformers",
    author = "Simoulin, Antoine  and
      Crabb{\'e}, Benoit",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-srw.23",
    doi = "10.18653/v1/2021.acl-srw.23",
    pages = "221--228",
}
```

### References

><div id="graves-2016">Alex Graves. 2016. <a href="https://arxiv.org/abs/1603.08983" target="_blank">Adaptive computation time for recurrent neural networks.</a> CoRR, abs/1603.08983.</div>