DistilBERT-Base-Uncased for Duplicate Question Detection

This model is a fine-tuned version of distilbert-base-uncased originally released in "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" and trained on the Quora Question Pairs dataset; part of the General Language Understanding Evaluation (GLUE) benchmark. This model was fine-tuned by the team at AssemblyAI and is released with the corresponding blog post.

Usage

To download and utilize this model for duplicate question detection please execute the following:

import torch.nn.functional as F 
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("assemblyai/distilbert-base-uncased-qqp")
model = AutoModelForSequenceClassification.from_pretrained("assemblyai/distilbert-base-uncased-qqp")

tokenized_segments = tokenizer(["How many hours does it take to fly from California to New York?"], ["What is the flight time from New York to Seattle?"], return_tensors="pt", padding=True, truncation=True)
tokenized_segments_input_ids, tokenized_segments_attention_mask = tokenized_segments.input_ids, tokenized_segments.attention_mask
model_predictions = F.softmax(model(input_ids=tokenized_segments_input_ids, attention_mask=tokenized_segments_attention_mask)['logits'], dim=1)

print("Duplicate probability: "+str(model_predictions[0][1].item()*100)+"%")
print("Non-duplicate probability: "+str(model_predictions[0][0].item()*100)+"%")

For questions about how to use this model feel free to contact the team at AssemblyAI!

Downloads last month
96
Hosted inference API
Text Classification
This model can be loaded on the Inference API on-demand.