Mistral 7B MCQ Generator

Model Description

This model, named Mistral 7B MCQ Generator, is a fine-tuned version of the Mistralai/Mistral-7B-v0.1 aimed at generating multiple-choice questions (MCQs) with their correct answers. Developed to aid in educational content creation, this tool is perfect for educators, e-learning content creators, and students preparing for exams. The model was fine-tuned on a combination of medical MCQs and RACE dataset, ensuring a diverse range of topics and complexities in question generation.

Intended Use

This model is intended for educational purposes, particularly for generating MCQs for studying, teaching, or content creation in the educational domain. It is designed to help in the preparation of quizzes, tests, and learning materials.

Training Data

The model was trained on a custom dataset derived from the ardneebwar/medmcqa-and-race dataset available on Hugging Face. This dataset combines medical MCQs and reading comprehension questions from various educational levels, which were cleaned and preprocessed to suit the needs of MCQ generation.

Training Procedure

The training was performed on a suitable environment supporting the demands of the Mistral 7B model. Specific settings included a learning rate of 2e-4, a batch size of 4, and a total of 3 epochs. The model underwent evaluation steps every 700 steps and used techniques like gradient accumulation and LoRA for efficient training.

How to Use

You can utilize this model directly in your Kaggle notebooks or Jupyter notebooks as follows:

import re
import torch
from transformers import pipeline, logging
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer, BitsAndBytesConfig, logging

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_name = "mistralai/Mistral-7B-v0.1"
adapters_name = "ardneebwar/mistral_7b_mcq_generator"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

m = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
)

m = PeftModel.from_pretrained(m, adapters_name)
m = m.merge_and_unload()
tok = AutoTokenizer.from_pretrained(model_name)

tok.bos_token_id = 1
stop_token_ids = [0]
logging.set_verbosity(logging.CRITICAL)

def extract_mcqs(generated_text):
    # This pattern looks for segments structured as MCQs based on your example
    # It captures text following 'question:' up to an 'answer: X' format where X is A, B, C, or D
    pattern = re.compile(r"question: (.*?) \| options: (.*?) \| answer: ([ABCD])", re.DOTALL)
    # Find all matches in the generated text
    matches = pattern.findall(generated_text)
    unique_mcqs = set()  # Using a set to avoid duplicates
    mcqs = []

    for match in matches:
        question, options, answer = match
        # Construct the MCQ string
        mcq_text = f"question: {question.strip()} | options: {options.strip()} | answer: {answer.strip()}"
        
        # Check for uniqueness before adding
        if mcq_text not in unique_mcqs:
            unique_mcqs.add(mcq_text)
            mcqs.append(mcq_text)

    return mcqs

# replace context with your own context.
prompt = "context: The Robot Operating System (ROS) is a set of software libraries and tools that help you build robot applications."
pipe = pipeline(task="text-generation", model=m, tokenizer=tok, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
generated_text = result[0]['generated_text']
unique_mcq = extract_mcqs(generated_text)
print(mcq)

Limitations and Biases

The model's performance is subject to the quality and diversity of the training data. While it has been trained on a dataset that includes a range of topics, it may exhibit biases present in the training material. Users are advised to review the generated questions and answers for potential biases before use.

References and Acknowledgments

This model was built using resources from the Hugging Face and PyTorch communities. Special thanks to the authors and contributors of the ardneebwar/medmcqa-and-race dataset and the Mistral 7B model.

License

This model is open-sourced under the Apache 2.0 license.

ardneebwar
/

mistral_7b_mcq_generator