--- license: apache-2.0 language: en datasets: - wikipedia - bookcorpus model-index: - name: asi/albert-act-base results: - task: type: text-classification name: CoLA dataset: type: glue name: CoLA # General Language Understanding Evaluation benchmark (GLUE) split: cola metrics: - type: matthews_correlation value: 33.8 name: Matthew's Corr - task: type: text-classification name: SST-2 dataset: type: glue name: SST-2 # The Stanford Sentiment Treebank split: sst2 metrics: - type: accuracy value: 88.6 name: Accuracy - task: type: text-classification name: MRPC dataset: type: glue name: MRPC # Microsoft Research Paraphrase Corpus split: mrpc metrics: - type: accuracy value: 79.4 name: Accuracy - type: f1 value: 85.2 name: F1 - task: type: text-similarity name: STS-B dataset: type: glue name: STS-B # Semantic Textual Similarity Benchmark split: stsb metrics: - type: spearmanr value: 81.2 name: Spearman Corr - type: pearsonr value: 82.7 name: Pearson Corr - task: type: text-classification name: QQP dataset: type: glue name: QQP # Quora Question Pairs split: qqp metrics: - type: f1 value: 67.8 name: F1 - type: accuracy value: 87.4 name: Accuracy - task: type: text-classification name: MNLI-m dataset: type: glue name: MNLI-m # MultiNLI Matched split: mnli_matched metrics: - type: accuracy value: 79.5 name: Accuracy - task: type: text-classification name: MNLI-mm dataset: type: glue name: MNLI-mm # MultiNLI Matched split: mnli_mismatched metrics: - type: accuracy value: 78.5 name: Accuracy - task: type: text-classification name: QNLI dataset: type: glue name: QNLI # Question NLI split: qnli metrics: - type: accuracy value: 88.3 name: Accuracy - task: type: text-classification name: RTE dataset: type: glue name: RTE # Recognizing Textual Entailment split: rte metrics: - type: accuracy value: 61.9 name: Accuracy - task: type: text-classification name: WNLI dataset: type: glue name: WNLI # Winograd NLI split: wnli metrics: - type: accuracy value: 65.1 name: Accuracy --- # Adaptive Depth Transformers Implementation of the paper "How Many Layers and Why? An Analysis of the Model Depth in Transformers". In this study, we investigate the role of the multiple layers in deep transformer models. We design a variant of ALBERT that dynamically adapts the number of layers for each token of the input. ## Model architecture We augment a multi-layer transformer encoder with a halting mechanism, which dynamically adjusts the number of layers for each token. We directly adapted this mechanism from Graves ([2016](#graves-2016)). At each iteration, we compute a probability for each token to stop updating its state. ## Model use The architecture is not yet directly included in the Transformers library. The code used for pre-training is available in the following [github repository](https://github.com/AntoineSimoulin/adaptive-depth-transformers). So you should install the code implementation first: ```bash !pip install git+https://github.com/AntoineSimoulin/adaptive-depth-transformers$ ``` Then you can use the model directly. ```python from act import AlbertActConfig, AlbertActModel, TFAlbertActModel from transformers import AlbertTokenizer tokenizer = AlbertTokenizer.from_pretrained('asi/albert-act-base') model = AlbertActModel.from_pretrained('asi/albert-act-base') _ = model.eval() inputs = tokenizer("a lump in the middle of the monkeys stirred and then fell quiet .", return_tensors="pt") outputs = model(**inputs) outputs.updates # tensor([[[[15., 9., 10., 7., 3., 8., 5., 7., 12., 10., 6., 8., 8., 9., 5., 8.]]]]) ``` ## Citations ### BibTeX entry and citation info If you use our iterative transformer model for your scientific publication or your industrial applications, please cite the following [paper](https://aclanthology.org/2021.acl-srw.23/): ```bibtex @inproceedings{simoulin-crabbe-2021-many, title = "How Many Layers and Why? {A}n Analysis of the Model Depth in Transformers", author = "Simoulin, Antoine and Crabb{\'e}, Benoit", booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.acl-srw.23", doi = "10.18653/v1/2021.acl-srw.23", pages = "221--228", } ``` ### References >
Alex Graves. 2016. Adaptive computation time for recurrent neural networks. CoRR, abs/1603.08983.