metadata

license: apache-2.0
language: en
datasets:
  - wikipedia
  - bookcorpus
model-index:
  - name: asi/albert-act-base
    results:
      - task:
          type: text-classification
          name: CoLA
        dataset:
          type: glue
          name: CoLA
          split: cola
        metrics:
          - type: matthews_correlation
            value: 27.5
            name: Matthew's Corr
      - task:
          type: text-classification
          name: SST-2
        dataset:
          type: glue
          name: SST-2
          split: sst2
        metrics:
          - type: accuracy
            value: 87.6
            name: Accuracy
      - task:
          type: text-classification
          name: MRPC
        dataset:
          type: glue
          name: MRPC
          split: mrpc
        metrics:
          - type: accuracy
            value: 78.7
            name: Accuracy
          - type: f1
            value: 84.7
            name: F1
      - task:
          type: text-similarity
          name: STS-B
        dataset:
          type: glue
          name: STS-B
          split: stsb
        metrics:
          - type: spearmanr
            value: 79.7
            name: Spearman Corr
          - type: pearsonr
            value: 81.8
            name: Pearson Corr
      - task:
          type: text-classification
          name: QQP
        dataset:
          type: glue
          name: QQP
          split: qqp
        metrics:
          - type: f1
            value: 67.8
            name: F1
          - type: accuracy
            value: 87.5
            name: Accuracy
      - task:
          type: text-classification
          name: MNLI-m
        dataset:
          type: glue
          name: MNLI-m
          split: mnli_matched
        metrics:
          - type: accuracy
            value: 77
            name: Accuracy
      - task:
          type: text-classification
          name: MNLI-mm
        dataset:
          type: glue
          name: MNLI-mm
          split: mnli_mismatched
        metrics:
          - type: accuracy
            value: 76.8
            name: Accuracy
      - task:
          type: text-classification
          name: QNLI
        dataset:
          type: glue
          name: QNLI
          split: qnli
        metrics:
          - type: accuracy
            value: 86.4
            name: Accuracy
      - task:
          type: text-classification
          name: RTE
        dataset:
          type: glue
          name: RTE
          split: rte
        metrics:
          - type: accuracy
            value: 62
            name: Accuracy
      - task:
          type: text-classification
          name: WNLI
        dataset:
          type: glue
          name: WNLI
          split: wnli
        metrics:
          - type: accuracy
            value: 65.1
            name: Accuracy

Adaptive Depth Transformers

Implementation of the paper "How Many Layers and Why? An Analysis of the Model Depth in Transformers". In this study, we investigate the role of the multiple layers in deep transformer models. We design a variant of ALBERT that dynamically adapts the number of layers for each token of the input.

Model architecture

We augment a multi-layer transformer encoder with a halting mechanism, which dynamically adjusts the number of layers for each token. We directly adapted this mechanism from Graves (2016). At each iteration, we compute a probability for each token to stop updating its state.

Model use

The architecture is not yet directly included in the Transformers library. The code used for pre-training is available in the following github repository. So you should install the code implementation first:

!pip install git+https://github.com/AntoineSimoulin/adaptive-depth-transformers$

Then you can use the model directly.

from act import AlbertActConfig, AlbertActModel, TFAlbertActModel
from transformers import AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained('asi/albert-act-base')
model = AlbertActModel.from_pretrained('asi/albert-act-base')
_ = model.eval()

inputs = tokenizer("a lump in the middle of the monkeys stirred and then fell quiet .", return_tensors="pt")
outputs = model(**inputs)
outputs.updates
# tensor([[[[15.,  9., 10.,  7.,  3.,  8.,  5.,  7., 12., 10.,  6.,  8.,  8.,  9., 5.,  8.]]]])

Citations

BibTeX entry and citation info

If you use our iterative transformer model for your scientific publication or your industrial applications, please cite the following paper:

@inproceedings{simoulin-crabbe-2021-many,
    title = "How Many Layers and Why? {A}n Analysis of the Model Depth in Transformers",
    author = "Simoulin, Antoine  and
      Crabb{\'e}, Benoit",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-srw.23",
    doi = "10.18653/v1/2021.acl-srw.23",
    pages = "221--228",
}

References

Alex Graves. 2016. Adaptive computation time for recurrent neural networks. CoRR, abs/1603.08983.