bert-large-uncased-wwm-squadv2-optimized-f16

This is an optimized model using madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1 as the base model which was created using the nn_pruning python library. This is a pruned model of madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2

Feel free to read our blog about how we optimized this model (link)

Our final optimized model weighs 579 MB, has an inference speed of 18.184 ms on a Tesla T4 and has a performance of 82.68% best F1. Below there is a comparison for each base model:

Model Weight Throughput on Tesla T4 Best F1
madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2 1275 MB 140.529 ms 86.08%
madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1 1085 MB 90.801 ms 82.67%
Our optimized model 579 MB 18.184 ms 82.68%

You can test the inference of those models on tryolabs/transformers-optimization space

Example Usage

import torch
from huggingface_hub import hf_hub_download
from onnxruntime import InferenceSession
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

MAX_SEQUENCE_LENGTH = 512

# Download the model
model= hf_hub_download(
    repo_id="tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16", filename="model.onnx"
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16")

question = "Who worked a little bit harder?"
context = "The first little pig was very lazy. He didn't want to work at all and he built his house out of straw. The second little pig worked a little bit harder but he was somewhat lazy too and he built his house out of sticks. Then, they sang and danced and played together the rest of the day."

# Generate an input
inputs = dict(
    tokenizer(
        question, context, return_tensors="np", max_length=MAX_SEQUENCE_LENGTH
    )
)

# Create session
sess = InferenceSession(
    model, providers=["CPUExecutionProvider"]
)

# Run predictions
output = sess.run(None, input_feed=inputs)

answer_start_scores, answer_end_scores = torch.tensor(output[0]), torch.tensor(
    output[1]
)

# Post process predictions
input_ids = inputs["input_ids"].tolist()[0]
answer_start = torch.argmax(answer_start_scores)
answer_end = torch.argmax(answer_end_scores) + 1
answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
)

# Output prediction
print("Answer", answer)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Dataset used to train tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16

Space using tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16 1