autoevaluator's picture
Add verifyToken field to verify evaluation results are produced by Hugging Face's automatic model evaluator
a8bbb79
|
raw
history blame
4.33 kB
metadata
language: en
license: cc-by-4.0
tags:
  - question-answering
datasets:
  - squad_v2
metrics:
  - f1
  - exact
widget:
  - context: >-
      DeBERTa improves the BERT and RoBERTa models using disentangled attention
      and enhanced mask decoder. With those two improvements, DeBERTa out
      perform RoBERTa on a majority of NLU tasks with 80GB training data. In
      DeBERTa V3, we further improved the efficiency of DeBERTa using
      ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing.
      Compared to DeBERTa, our V3 version significantly improves the model
      performance on downstream tasks. You can find more technique details about
      the new model from our paper. Please check the official repository for
      more implementation details and updates.
    example_title: DeBERTa v3 Q1
    text: How is DeBERTa version 3 different than previous ones?
  - context: >-
      DeBERTa improves the BERT and RoBERTa models using disentangled attention
      and enhanced mask decoder. With those two improvements, DeBERTa out
      perform RoBERTa on a majority of NLU tasks with 80GB training data. In
      DeBERTa V3, we further improved the efficiency of DeBERTa using
      ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing.
      Compared to DeBERTa, our V3 version significantly improves the model
      performance on downstream tasks. You can find more technique details about
      the new model from our paper. Please check the official repository for
      more implementation details and updates.
    example_title: DeBERTa v3 Q2
    text: Where do I go to see new info about DeBERTa?
model-index:
  - name: DeBERTa v3 xsmall squad2
    results:
      - task:
          type: question-answering
          name: Question Answering
        dataset:
          name: SQuAD2.0
          type: question-answering
        metrics:
          - type: f1
            value: 81.5
            name: f1
          - type: exact
            value: 78.3
            name: exact
      - task:
          type: question-answering
          name: Question Answering
        dataset:
          name: squad_v2
          type: squad_v2
          config: squad_v2
          split: validation
        metrics:
          - type: exact_match
            value: 78.5341
            name: Exact Match
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiZTk0ZGQ1YjU1YmQ5NTc2M2RmNjg2OGViYjcyODZkOTc1MDBkNmI5MDc0MzEyMzZmNDg3Yzc4ZTA3ZjAwM2M5ZiIsInZlcnNpb24iOjF9.ewKF-UetUoxKDeXgnM6vqy8nBC9c3qh7dLZhdQlgSxPut3LjAhpCh2fJGir-OVcfzWzxsPhcZQEpdnxR8oZnAA
          - type: f1
            value: 81.6408
            name: F1
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOTQwZDdjY2ZlOGVhM2E5NGM3OGNkNTk2NWFkYTg1Y2Q0YWFlYWJmMGIyZWM5ZjMyYTYyODUzMDA0NWU0ZGVkZCIsInZlcnNpb24iOjF9.BHJNhS1YisUIkjcpIMdwXurTewak9dkkpGXC2vHvUB4qUEuk_p3V-orhmeFyTxzLaWRwrZVGVz-NSfqFr4n1Ag
          - type: total
            value: 11870
            name: total
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNzNiZDQ3MDAyNzljMDI4NTRlYzZiZjE4ODJhZDhmZWE2ZjcwNjg2ZWJmNjUyMTUzZDk4ODNjNDExYTk1YWNlOCIsInZlcnNpb24iOjF9.3BlfmMvbV86Ua39ToqnMmgpGS0ZTew0UFFYWGyTkS3u7jaAXCfYkFkNJXw806f2uFFkKr1hqlzzKfivV0wUjCg
      - task:
          type: question-answering
          name: Question Answering
        dataset:
          name: squad
          type: squad
          config: plain_text
          split: validation
        metrics:
          - type: exact_match
            value: 84.1741
            name: Exact Match
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYTA0MDVlYWI5NzdiNjllM2NmZTYwYmQ5YzE0ODgwOTA3MWZjZDkxNDFmZDM1OTQzMzgwNWI4NDc5NThhM2VhZSIsInZlcnNpb24iOjF9.lc2nUBxSu2_0_a5lyVsV51UAmkE8WHDTwGHvt3n9zvCbcJ1ylOg2xovF0_j0hZS16lv1DEw5XV8EW_ZS7mfvBg
          - type: f1
            value: 91.0771
            name: F1
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiODQxMjkxOWJlZTc2MmE5YzVmMjNhOTkwNDdiMDBhNWUwMDU3MDI1MmJiNDY4MjczYjIwM2U1NDhlYmZlZWQwMSIsInZlcnNpb24iOjF9.x_axHiBX5d3UIi1UbJT3kVbdX4kX9XFLQSg-l16-AAK9tiyutT-yaYJOi8LSb2lR4677tJpf3itu4eriJRU2Cg

DeBERTa v3 xsmall SQuAD 2.0

Microsoft reports that this model can get 84.8/82.0 on f1/em on the dev set.

I got 81.5/78.3 but I only did one run and I didn't use the official squad2 evaluation script. I will do some more runs and show the results on the official script soon.