lewtun's picture
lewtun HF staff
Add evaluation results on the autoevaluate--squad-sample config and test split of autoevaluate/squad-sample
f1b6db2
|
raw
history blame
No virus
2.25 kB
metadata
language: en
license: apache-2.0
datasets:
  - squad
metrics:
  - squad
model-index:
  - name: autoevaluate/distilbert-base-cased-distilled-squad
    results:
      - task:
          type: question-answering
          name: Question Answering
        dataset:
          name: autoevaluate/squad-sample
          type: autoevaluate/squad-sample
          config: autoevaluate--squad-sample
          split: test
        metrics:
          - type: f1
            value: 87.8248
            name: F1
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNTBiYzkyNGNmMWRhNDIwMDhhOTBlMzU2OTQyZTMxMzNiODM3YzA0Mjk3NzY0YTI0ZWNiZjRlMDIzM2EzY2E5OCIsInZlcnNpb24iOjF9.xEOZCY6PgAuoh1I9zox8ZNPv3uT-Nx1c9I1GCAYOtY0QJPMw47ph0f-dagKStf9tbLLtcE6XUz-72EZtK6mzAw
          - type: exact_match
            value: 84
            name: Exact Match
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYTdlODFkNDgyNDU5YjU1NDE0ZjVjYWU0Mzk5ZTA2ODRjYWRhMTg1MjFiNTg5ODA0NjE5OGVkZDc3ZjQ5N2E0NyIsInZlcnNpb24iOjF9.OPqpfiA6kRUzYEwqiYz-WVUMSBSlNn0T9v4uJUUKeQ4k0L7SKAfkqbR0LAg9xm6rh0-KadiUznFn3zDH2XUpDQ
          - type: loss
            value: 0.9980762600898743
            name: loss
            verified: true
            verifyToken: >-
              eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiM2Q0NDVkN2JlNmRmMmU1YzZhYWJiMDJhYmE0ZTUxOTA3MGEzMGMwYjI4M2EzNmYwY2FiNzMwOGIyYmUxYTIxZCIsInZlcnNpb24iOjF9.MQXuMC8iVbhMxSjgdMMOa9f1a_0ej7PoH2F_JbsOcYrfpt0c2q0KZmlasIaQFkjV-To62NhLts2oLWKgRQpwCw

DistilBERT base cased distilled SQuAD

Note: This model is a clone of distilbert-base-cased-distilled-squad for internal testing.

This model is a fine-tune checkpoint of DistilBERT-base-cased, fine-tuned using (a second step of) knowledge distillation on SQuAD v1.1. This model reaches a F1 score of 87.1 on the dev set (for comparison, BERT bert-base-cased version reaches a F1 score of 88.7).

Using the question answering Evaluator from evaluate gives:

{'exact_match': 79.54588457899716,
 'f1': 86.81181300991533,
 'latency_in_seconds': 0.008683730778997168,
 'samples_per_second': 115.15787689073015,
 'total_time_in_seconds': 91.78703433400005}

which is roughly consistent with the official score.