AbinJilson's picture
Update README.md
90f5ede

Question Difficulty Classification Model

Introduction

This project aims to classify question answer pairs based on it's difficulty as easy,Medium or hard.You can pass a single question-answer pair seperated by comma or a list of question-answer pairs to the model. I have fine tuned bert-base-cased model with pre-trained parameter on Question-Answer Dataset by Carnegie Mellon University for this task

Table of Contents

Model Details

Model Description: This model is a fine-tune checkpoint of bert-base-cased,pretrained on a large corpus of English data in a self-supervised fashion. . This model reaches an accuracy of 95 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 97).

  • Developed by: Hugging Face
  • Model Type: Text Classification
  • Language(s): English
  • License: Apache-2.0
  • Parent Model: For more details about lBERT, we encourage users to check out this model card.
  • Resources for more information:

Dependencies

  • Transformer
  • Python 3.7.13
  • Numpy

How to use the model

  1. Import Essential Libraries ​​
from transformers import TFBertModel
from transformers import BertTokenizer
import tensorflow as tf
  1. Load the Model and Tokenizer
questionclassification_model = tf.keras.models.load_model(<path to the model>)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
  1. Essential Functions
def prepare_data(input_text):
  
    token = tokenizer.batch_encode_plus(
        input_text,
        max_length=256, 
        truncation=True, 
        padding='max_length', 
        add_special_tokens=True,
        return_tensors='tf'
    )
    return {
        'input_ids': tf.cast(token['input_ids'], tf.float64),
        'attention_mask': tf.cast(token['attention_mask'], tf.float64)
    }

def make_prediction(model, processed_data, classes=['Easy', 'Medium', 'Hard']):
    outcls=[]
    probs = model.predict(processed_data)
    s=probs.argmax(axis=1)
    for i in range(0,len(probs)):
      outcls.append(classes[s[i]])
    return outcls,probs;

3.Make predictions on the list of questions-answer pairs

input_text = ["What is gandhi commonly considered to be?,Father of the nation in india","What is the long-term warming of the planets overall temperature called?, Global Warming"]
processed_data = prepare_data(input_text)
result,prob = make_prediction(questionclassification_model, processed_data=processed_data)
for i in range (len(result)):
  print(f"{result[i]} : {max(prob[i])}")

Risks, Limitations and Biases

  • The predicted outputs have only very less easy category questions.
  • 90% of the easy questions in the dataset are yes/no type questions.
  • Very few datasets are available in public for question difficulty classification.
  • People who are experts in a specific subject can only create a dataset for this task.Otherwise,The model will generate wrong results.

Training

Training Data

I used Question-Answer Dataset by Carnegie Mellon University for this task

Training Procedure

Fine-tuning hyper-parameters
  • learning_rate = 1e-5
  • decay = 1e-6
  • optimizer = adam
  • loss function = categorical cross entropy
  • max_length = 256
  • num_train_epochs = 10