bconsolvo commited on
Commit
2c74822
1 Parent(s): 8f10f2b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -5
README.md CHANGED
@@ -1,10 +1,109 @@
1
  ---
 
 
 
2
  language: en
3
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
- # 80% 1x4 Block Sparse BERT-Base (uncased) Fine Tuned on SQuADv1.1
6
- This model is a result of fine-tuning a Prune OFA 80% 1x4 block sparse pre-trained BERT-Base combined with knowledge distillation.
7
- This model yields the following results on SQuADv1.1 development set:<br>
8
- `{"exact_match": 81.2867, "f1": 88.4735}`
9
 
10
- For further details see our paper, [Prune Once for All: Sparse Pre-Trained Language Models](https://arxiv.org/abs/2111.05754), and our open source implementation available [here](https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/research/prune-once-for-all).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ tags:
3
+ - question-answering
4
+ - bert
5
  language: en
6
  license: apache-2.0
7
+ model-index:
8
+ - name: dynamic-tinybert
9
+ results:
10
+ - task:
11
+ type: question-answering
12
+ name: question-answering
13
+ metrics:
14
+ - type: exact_match
15
+ value: 81.2867
16
+ - type: f1
17
+ value: 88.4735
18
  ---
19
+ ## Model Details: 80% 1x4 Block Sparse BERT-Base (uncased) Fine Tuned on SQuADv1.1
20
+ This model has been fine-tuned for the NLP task of question answering, trained on the SQuAD 1.1 dataset. It is a result of fine-tuning a Prune OFA 80% 1x4 block sparse pre-trained BERT-Base model, combined with knowledge distillation.
21
+ > We present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of 40X for the encoder with less than 1% accuracy loss.
 
22
 
23
+
24
+
25
+ | Model Detail | Description |
26
+ | ----------- | ----------- |
27
+ | Model Authors - Company | Intel |
28
+ | Model Card Authors | Intel |
29
+ | Date | February 27, 2022 |
30
+ | Version | 1 |
31
+ | Type | NLP - Question Answering |
32
+ | Architecture | "The method consists of two steps, teacher preparation and student pruning. The sparse pre-trained model we trained is the model we use for transfer learning while maintaining its sparsity pattern. We call the method Prune Once for All since we show how to fine-tune the sparse pre-trained models for several language tasks while we prune the pre-trained model only once." [(Zafrir et al., 2021)](https://arxiv.org/abs/2111.05754) |
33
+ | Paper or Other Resources | [Paper: Zafrir et al. (2021)](https://arxiv.org/abs/2111.05754); [GitHub Repo](https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/research/prune-once-for-all) |
34
+ | License | Apache 2.0 |
35
+ | Questions or Comments | [Community Tab](https://huggingface.co/Intel/bert-base-uncased-squadv1.1-sparse-80-1x4-block-pruneofa/discussions) and [Intel Developers Discord](https://discord.gg/rv2Gp55UJQ)|
36
+
37
+ Visualization of Prunce Once for All method from [Zafrir et al. (2021)](https://arxiv.org/abs/2111.05754). More details can be found in their paper.
38
+ ![Zafrir2021_Fig1.png](https://s3.amazonaws.com/moonup/production/uploads/6297f0e30bd2f58c647abb1d/nSDP62H9NHC1FA0C429Xo.png)
39
+
40
+ | Intended Use | Description |
41
+ | ----------- | ----------- |
42
+ | Primary intended uses | You can use the model for the NLP task of question answering: given a corpus of text, you can ask it a question about that text, and it will find the answer in the text. |
43
+ | Primary intended users | Anyone doing question answering |
44
+ | Out-of-scope uses | The model should not be used to intentionally create hostile or alienating environments for people.|
45
+
46
+ ### How to use
47
+
48
+ Here is how to import this model in Python:
49
+
50
+ ```python
51
+ import transformers
52
+ import model_compression_research as model_comp
53
+
54
+ model = transformers.AutoModelForQuestionAnswering.from_pretrained('Intel/bert-base-uncased-squadv1.1-sparse-80-1x4-block-pruneofa')
55
+
56
+ scheduler = mcr.pruning_scheduler_factory(model, '../../examples/transformers/question-answering/config/lock_config.json')
57
+
58
+ # Train your model...
59
+
60
+ scheduler.remove_pruning()
61
+ ```
62
+ For more code examples, refer to the [GitHub Repo](https://github.com/IntelLabs/Model-Compression-Research-Package/tree/main/research/prune-once-for-all).
63
+
64
+ | Factors | Description |
65
+ | ----------- | ----------- |
66
+ | Groups | Many Wikipedia articles with question and answer labels are contained in the training data |
67
+ | Instrumentation | - |
68
+ | Environment | - |
69
+ | Card Prompts | - |
70
+
71
+
72
+ ## Metrics (Model Performance):
73
+ | Model | Model Size | SQuADv1.1 (EM/F1) | MNLI-m (Acc) | MNLI-mm (Acc) | QQP (Acc/F1) | QNLI (Acc) | SST-2 (Acc) |
74
+ |-------------------------------|:----------:|:-----------------:|:------------:|:-------------:|:------------:|:----------:|:-----------:|
75
+ | [85% Sparse BERT-Base uncased](https://huggingface.co/Intel/bert-base-uncased-sparse-85-unstructured-pruneofa) | Medium | 81.10/88.42 | 82.71 | 83.67 | 91.15/88.00 | 90.34 | 91.46 |
76
+ | [90% Sparse BERT-Base uncased](https://huggingface.co/Intel/bert-base-uncased-sparse-90-unstructured-pruneofa) | Medium | 79.83/87.25 | 81.45 | 82.43 | 90.93/87.72 | 89.07 | 90.88 |
77
+ | [90% Sparse BERT-Large uncased](https://huggingface.co/Intel/bert-large-uncased-sparse-90-unstructured-pruneofa) | Large | 83.35/90.20 | 83.74 | 84.20 | 91.48/88.43 | 91.39 | 92.95 |
78
+ | [85% Sparse DistilBERT uncased](https://huggingface.co/Intel/distilbert-base-uncased-sparse-85-unstructured-pruneofa) | Small | 78.10/85.82 | 81.35 | 82.03 | 90.29/86.97 | 88.31 | 90.60 |
79
+ | [90% Sparse DistilBERT uncased](https://huggingface.co/Intel/distilbert-base-uncased-sparse-90-unstructured-pruneofa) | Small | 76.91/84.82 | 80.68 | 81.47 | 90.05/86.67 | 87.66 | 90.02 |
80
+
81
+ All the results are the mean of two seperate experiments with the same hyper-parameters and different seeds.
82
+
83
+ | Training and Evaluation Data | Description |
84
+ | ----------- | ----------- |
85
+ | Datasets | SQuAD1.1: "Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable." (https://huggingface.co/datasets/squad)|
86
+ | Motivation | To build an efficient and accurate model for the question answering task. |
87
+ | Preprocessing | "We use the English Wikipedia dataset (2500M words) for training the models on the pre-training task. We split the data into train (95%) and validation (5%) sets. Both sets are preprocessed as described in the models’ original papers ([Devlin et al., 2019](https://arxiv.org/abs/1810.04805), [Sanh et al., 2019](https://arxiv.org/abs/1910.01108)). We process the data to use the maximum sequence length allowed by the models, however, we allow shorter sequences at a probability of 0:1." |
88
+
89
+ | Ethical Considerations | Description |
90
+ | ----------- | ----------- |
91
+ | Data | The training data come from Wikipedia articles |
92
+ | Human life | The model is not intended to inform decisions central to human life or flourishing. It is an aggregated set of labelled Wikipedia articles. |
93
+ | Mitigations | No additional risk mitigation strategies were considered during model development. |
94
+ | Risks and harms | Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al., 2021](https://aclanthology.org/2021.acl-long.330.pdf), and [Bender et al., 2021](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups. Beyond this, the extent of the risks involved by using the model remain unknown.|
95
+ | Use cases | - |
96
+
97
+ | Caveats and Recommendations |
98
+ | ----------- |
99
+ | Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. There are no additional caveats or recommendations for this model. |
100
+
101
+ ### BibTeX entry and citation info
102
+ ```bibtex
103
+ @article{zafrir2021prune,
104
+ title={Prune Once for All: Sparse Pre-Trained Language Models},
105
+ author={Zafrir, Ofir and Larey, Ariel and Boudoukh, Guy and Shen, Haihao and Wasserblat, Moshe},
106
+ journal={arXiv preprint arXiv:2111.05754},
107
+ year={2021}
108
+ }
109
+ ```