Adding modes, graphs and metadata.

c7b5b70 over 3 years ago

4.88 kB

	---
	language: en
	thumbnail:
	license: mit
	tags:
	- question-answering
	- bert
	- bert-base
	datasets:
	- squad
	metrics:
	- squad
	widget:
	- text: "Where is the Eiffel Tower located?"
	context: "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower."
	- text: "Who is Frederic Chopin?"
	context: "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano."
	---

	## BERT-base uncased model fine-tuned on SQuAD v1

	This model was created using the [nn_pruning](https://github.com/huggingface/nn_pruning) python library: the linear layers contains 27.0% of the original weights.

	This model CANNOT be used without using nn_pruning `optimize_model` function, as it uses NoNorms instead of LayerNorms and this is not currently supported by the Transformers library.


	It uses ReLUs instead of GeLUs as in the initial BERT network, to speedup inference.
	This does not need special handling, as it is supported by the Transformers library, and flagged in the model config by the ```"hidden_act": "relu"``` entry.


	The model contains 43.0% of the original weights overall (the embeddings account for a significant part of the model, and they are not pruned by this method).

	With a simple resizing of the linear matrices it ran 1.96x as fast as BERT-base on the evaluation.
	This is possible because the pruning method lead to structured matrices: to visualize them, hover below on the plot to see the non-zero/zero parts of each matrix.

	<div class="graph"><script src="/madlag/bert-base-uncased-squadv1-x1.96-f88.3-d27-hybrid-filled-opt-v1/raw/main/model_card/density_info.js" id="a069faa9-ad3b-4c4d-b3eb-aac9a32aa6dc"></script></div>

	In terms of accuracy, its F1 is 88.33, compared with 88.5 for BERT-base, a F1 drop of 0.17.

	## Fine-Pruning details
	This model was fine-tuned from the HuggingFace [BERT](https://www.aclweb.org/anthology/N19-1423/) base uncased checkpoint on [SQuAD1.1](https://rajpurkar.github.io/SQuAD-explorer), and distilled from the model [bert-large-uncased-whole-word-masking-finetuned-squad](https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad).
	This model is case-insensitive: it does not make a difference between english and English.

	A side-effect of the block pruning is that some of the attention heads are completely removed: 55 heads were removed on a total of 144 (38.2%).
	Here is a detailed view on how the remaining heads are distributed in the network after pruning.
	<div class="graph"><script src="/madlag/bert-base-uncased-squadv1-x1.96-f88.3-d27-hybrid-filled-opt-v1/raw/main/model_card/pruning_info.js" id="2df293b0-bd3d-4a5c-8f0c-a35f8fdaef04"></script></div>

	## Details of the SQuAD1.1 dataset

	\| Dataset \| Split \| # samples \|
	\| -------- \| ----- \| --------- \|
	\| SQuAD1.1 \| train \| 90.6K \|
	\| SQuAD1.1 \| eval \| 11.1k \|

	### Fine-tuning
	- Python: `3.8.5`

	- Machine specs:

	```CPU: Intel(R) Core(TM) i7-6700K CPU
	Memory: 64 GiB
	GPUs: 1 GeForce GTX 3090, with 24GiB memory
	GPU driver: 455.23.05, CUDA: 11.1
	```

	### Results

	Pytorch model file size: `374M` (original BERT: `438M`)

	\| Metric \| # Value \| # Original ([Table 2](https://www.aclweb.org/anthology/N19-1423.pdf))\| Variation \|
	\| ------ \| --------- \| --------- \| --------- \|
	\| EM \| 81.31 \| 80.8 \| +0.51\|
	\| F1 \| 88.33 \| 88.5 \| -0.17\|

	## Example Usage
	Install nn_pruning: it contains the optimization script, which just pack the linear layers into smaller ones by removing empty rows/columns.

	`pip install git+https://github.com//huggingface/nn_pruning`

	Then you can use the `transformers library` almost as usual: you just have to call `optimize_model` when the pipeline has loaded.

	```python
	from transformers import pipeline
	from nn_pruning.inference_model_patcher import optimize_model

	qa_pipeline = pipeline(
	"question-answering",
	model="madlag/bert-base-uncased-squadv1-x1.96-f88.3-d27-hybrid-filled-opt-v1",
	tokenizer="madlag/bert-base-uncased-squadv1-x1.96-f88.3-d27-hybrid-filled-opt-v1"
	)

	print("BERT-base parameters: 110M")
	print(f"Parameters count (includes head pruning)={int(qa_pipeline.model.num_parameters() / 1E6)}M")
	qa_pipeline.model = optimize_model(qa_pipeline.model, "dense")

	print(f"Parameters count after optimization={int(qa_pipeline.model.num_parameters() / 1E6)}M")
	predictions = qa_pipeline({
	'context': "Frédéric François Chopin, born Fryderyk Franciszek Chopin (1 March 1810 – 17 October 1849), was a Polish composer and virtuoso pianist of the Romantic era who wrote primarily for solo piano.",
	'question': "Who is Frederic Chopin?",
	})
	print("Predictions", predictions)
	```