|
# Question Difficulty Classification Model |
|
## Introduction |
|
This project aims to classify question answer pairs based on it's difficulty as easy,Medium or hard.You can pass a single question-answer pair seperated by comma or a list of question-answer pairs to the model. |
|
I have fine tuned [bert-base-cased](https://huggingface.co/bert-base-cased) model with pre-trained parameter on [Question-Answer Dataset](https://www.kaggle.com/datasets/rtatman/questionanswer-dataset) by [Carnegie Mellon University](https://www.cmu.edu/) for this task |
|
|
|
## Table of Contents |
|
- [Model Details](#model-details) |
|
- [How to Get Started With the Model](#how-to-get-started-with-the-model) |
|
- [Dependencies](#dependencies) |
|
- [Risks, Limitations and Biases](#risks-limitations-and-biases) |
|
- [Training](#training) |
|
|
|
## Model Details |
|
**Model Description:** This model is a fine-tune checkpoint of [bert-base-cased](https://huggingface.co/bert-base-cased),pretrained on a large corpus of English data in a self-supervised fashion. . |
|
This model reaches an accuracy of 95 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 97). |
|
- **Developed by:** Hugging Face |
|
- **Model Type:** Text Classification |
|
- **Language(s):** English |
|
- **License:** Apache-2.0 |
|
- **Parent Model:** For more details about lBERT, we encourage users to check out [this model card](https://huggingface.co/bert-base-cased). |
|
- **Resources for more information:** |
|
- [Model Documentation](https://huggingface.co/docs/transformers/main/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification) |
|
|
|
## Dependencies |
|
|
|
- Transformer |
|
- Python 3.7.13 |
|
- Numpy |
|
|
|
## How to use the model |
|
|
|
1. Import Essential Libraries |
|
​​ |
|
```python |
|
from transformers import TFBertModel |
|
from transformers import BertTokenizer |
|
import tensorflow as tf |
|
``` |
|
2. Load the Model and Tokenizer |
|
|
|
```python |
|
questionclassification_model = tf.keras.models.load_model(<path to the model>) |
|
tokenizer = BertTokenizer.from_pretrained('bert-base-cased') |
|
``` |
|
|
|
3. Essential Functions |
|
|
|
```python |
|
def prepare_data(input_text): |
|
|
|
token = tokenizer.batch_encode_plus( |
|
input_text, |
|
max_length=256, |
|
truncation=True, |
|
padding='max_length', |
|
add_special_tokens=True, |
|
return_tensors='tf' |
|
) |
|
return { |
|
'input_ids': tf.cast(token['input_ids'], tf.float64), |
|
'attention_mask': tf.cast(token['attention_mask'], tf.float64) |
|
} |
|
|
|
def make_prediction(model, processed_data, classes=['Easy', 'Medium', 'Hard']): |
|
outcls=[] |
|
probs = model.predict(processed_data) |
|
s=probs.argmax(axis=1) |
|
for i in range(0,len(probs)): |
|
outcls.append(classes[s[i]]) |
|
return outcls,probs; |
|
``` |
|
3.Make predictions on the list of questions-answer pairs |
|
|
|
```python |
|
input_text = ["What is gandhi commonly considered to be?,Father of the nation in india","What is the long-term warming of the planets overall temperature called?, Global Warming"] |
|
processed_data = prepare_data(input_text) |
|
result,prob = make_prediction(questionclassification_model, processed_data=processed_data) |
|
for i in range (len(result)): |
|
print(f"{result[i]} : {max(prob[i])}") |
|
``` |
|
|
|
## Risks, Limitations and Biases |
|
|
|
- The predicted outputs have only very less easy category questions. |
|
- 90% of the easy questions in the dataset are yes/no type questions. |
|
- Very few datasets are available in public for question difficulty classification. |
|
- People who are experts in a specific subject can only create a dataset for this task.Otherwise,The model will generate wrong results. |
|
|
|
# Training |
|
|
|
|
|
#### Training Data |
|
|
|
|
|
I used [Question-Answer Dataset](https://www.kaggle.com/datasets/rtatman/questionanswer-dataset) by [Carnegie Mellon University](https://www.cmu.edu/) for this task |
|
|
|
#### Training Procedure |
|
|
|
###### Fine-tuning hyper-parameters |
|
|
|
|
|
- learning_rate = 1e-5 |
|
- decay = 1e-6 |
|
- optimizer = adam |
|
- loss function = categorical cross entropy |
|
- max_length = 256 |
|
- num_train_epochs = 10 |
|
|
|
|
|
|