Edit model card


Model name: STEMerald-2b

Model description: STEMerald-2b is a fine-tuned version of the Gemma-2b model, designed specifically for answering university-level STEM multiple-choice questions. This model leverages advanced fine-tuning techniques, including Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), to enhance its accuracy and reliability in providing educational support.

STEMerald picture

Model Details

Base Model: Gemma-2b

Architecture: Decoder-only Language Model (Causal)

Parameters: 2.51 billion

Quantized Version: STEMerald-2b-4bit (with 4-bit NormalFloat)

Training Framework: PyTorch with Hugging Face Transformers


The model was fine-tuned on a variety of datasets tailored for STEM education, including:

  • EPFL Preference Pairs Dataset: 1522 university-level STEM questions with 26k preference pairs, annotated by students using ChatGPT-3.5 with Chain-of-Thought (CoT).
  • Stack Exchange Dataset: Questions and answers from various topics such as math, computer science, and engineering.
  • Orca-Math: 200k grade-school math word problems to enhance reasoning capabilities.
  • EPFL MCQA Dataset: Dataset of multiple-choice questions with explanation (for CoT) extracted from the winning pairs of EPFL preference pairs.
  • ScienceQA: Multiple-choice questions on biology, physics, chemistry, economics, earth science, and engineering practices.
  • AI2 Reasoning Challenge (ARC): Grade-school level multiple-choice science questions.

Training Process

The training process for STEMerald-2b involved multiple steps:

  1. Supervised Fine-Tuning (SFT): Initial training on datasets like Orca-Math to improve reasoning abilities.
  2. Direct Preference Optimization (DPO): Training on preference pairs from EPFL and Stack Exchange datasets to align model outputs with preferred answers.
  3. MCQA Fine-Tuning: Specialization for multiple-choice question answering using datasets like ScienceQA and ARC.


The performance of STEMerald-2b was evaluated using various metrics:

  • Accuracy: The model achieved high accuracy across multiple test sets, demonstrating its effectiveness in answering STEM questions.
  • Qualitative Evaluation: The model's answers were evaluated for logical consistency, truthfulness, clarity, and coherence with the final answer.


Model Version Accuracy (Non-Quantized) Accuracy (Quantized)
it-ORCA-DPO-MCQA (STEMerald-2b) 0.750 0.720
it-DPO-MCQA 0.744 0.720
it-MCQA 0.736 0.700
it-ORCA-MCQA 0.722 0.714
MCQA 0.702 0.654
DPO-MCQA 0.694 0.674
Gemma-it-OneShot 0.546 0.520
Gemma-it 0.518 0.518

Micro-averaged accuracy over three MCQA test sets(EPFL MCQA, ScienceQA and ARC).

Use Cases

STEMerald-2b can be utilized as a STEM course assistant, providing support in areas such as:

  • Answering university-level multiple-choice STEM questions.
  • Offering detailed explanations and reasoning for answers.
  • Enhancing student engagement and learning efficiency during independent studies.

Ethical Considerations

While STEMerald-2b aims to provide accurate and helpful responses, it is important to consider potential ethical implications:

  • Over-Reliance: Students might become overly dependent on the model for answers, potentially affecting their independent learning and problem-solving skills.
  • Accuracy: Although efforts were made to ensure the truthfulness of responses, there is still a possibility of incorrect answers. Teacher supervision is crucial.


  • The model's performance may vary based on the specific context and nature of the questions.
  • Quantization reduces memory footprint but may slightly affect accuracy.


STEMerald-2b offers a promising solution for enhancing STEM education through advanced language model capabilities. By leveraging fine-tuning techniques and comprehensive datasets, it aims to provide accurate and accessible learning support for students.

How to Use

You can use the model directly with the transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("matsant01/STEMerald-2b")
model = AutoModelForCausalLM.from_pretrained("matsant01/STEMerald-2b")

input_text = "Question: What is the derivative of x^2? \nOptions: A. 4x B. 2*x^2 C. 2x D. 2\nAnswer:"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For the quantized version, use:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(

tokenizer = AutoTokenizer.from_pretrained("matsant01/STEMerald-2b-4bit")
model = AutoModelForCausalLM.from_pretrained("matsant01/STEMerald-2b-4bit", quantization_config=quantization_config)


We acknowledge the contributions of the EPFL and Stack Exchange communities for their invaluable datasets, and the Hugging Face team for their support and tools that made this project possible.


For any questions or feedback, please contact:

Downloads last month
Model size
1.55B params
Tensor type
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train matsant01/STEMerald-2b-4bit