File size: 9,101 Bytes

9c42a64
 
d7c32bb
 
 
 
 
 
 
 
9c42a64
 
6c59faa
9c42a64
6c59faa
9c42a64
6c59faa
9c42a64
 
 
 
 
02547c8
6c59faa
 
 
 
 
 
 
 
9c42a64
6c59faa
9c42a64
6c59faa
 
d7a58f3
9c42a64
6c59faa
9c42a64
 
 
 
6c59faa
d7c32bb
9c42a64
6c59faa
 
d7c32bb
9c42a64
 
6c59faa
d7c32bb
 
9c42a64
6c59faa
d7c32bb
6c59faa
9c42a64
6c59faa
 
 
 
9c42a64
 
6c59faa
 
 
9c42a64
6c59faa
9c42a64
6c59faa
9c42a64
6c59faa
 
 
 
9c42a64
6c59faa
 
9c42a64
6c59faa
 
9c42a64
6c59faa
 
 
 
 
9c42a64
6c59faa
 
9c42a64
6c59faa
 
9c42a64
6c59faa
 
9c42a64
6c59faa
 
 
 
9c42a64
6c59faa
 
9c42a64
6c59faa
 
9c42a64
6c59faa
 
9c42a64
6c59faa
 
 
 
9c42a64
6c59faa
9c42a64
6c59faa
9c42a64
6c59faa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c42a64
6c59faa
9c42a64
 
 
 
6c59faa
 
 
9c42a64
 
6c59faa
 
 
 
 
 
9c42a64
6c59faa
9c42a64
6c59faa
 
 
9c42a64
6c59faa
9c42a64
 
6c59faa
 
 
 
 
9c42a64
6c59faa
9c42a64
6c59faa
9c42a64
6c59faa
 
 
9c42a64
 
6c59faa
 
62fa8c0
9c42a64
6c59faa
9c42a64
6c59faa
62fa8c0
 
 
 
 
 
 
 
9c42a64
6c59faa
9c42a64
6c59faa
 
 
9c42a64
6c59faa
9c42a64
6c59faa
 
 
9c42a64
6c59faa
9c42a64
6c59faa
fcad89f

---
library_name: transformers
license: mit
language:
- en
metrics:
- rouge
base_model:
- microsoft/Phi-3-mini-4k-instruct
pipeline_tag: text-generation
---

# Model Card for **Phi3-Lab-Report-Coder (LoRA on Phi-3 Mini 4k Instruct)**

A lightweight LoRA-adapter fine-tune of `microsoft/Phi-3-mini-4k-instruct` for **turning structured lab contexts + observations into executable Python code** that performs the target calculations (e.g., mechanics, fluids, vibrations, basic circuits, titrations). Trained with QLoRA in 4-bit, this model is intended as an **assistive code generator** for STEM lab writeups and teaching demos—not as a certified calculator for safety-critical engineering.

---

## Model Details

### Model Description

- **Developed by:** Barghav777
- **Model type:** Causal decoder LM (instruction-tuned) + **LoRA adapter**  
- **Languages:** English  
- **License:** MIT  
- **Finetuned from:** `microsoft/Phi-3-mini-4k-instruct`  
- **Intended input format:** A structured prompt with:
  - `### CONTEXT:` (natural-language description of the experiment)
  - `### OBSERVATIONS:` (JSON-like dict with units, readings)
  - `### CODE:` (the model is trained to generate the Python solution after this tag)

### Model Sources

- **Base model:** `microsoft/Phi-3-mini-4k-instruct`  
- **Training data files:** `train.jsonl` (37 items), `eval.jsonl` (6 items)  
- **Demo/Colab basis:** Training notebook available at: https://github.com/Barghav777/AI-Lab-Report-Agent

---

## Uses

### Direct Use
- Generate **readable Python code** to compute derived quantities from lab observations (e.g., average \(g\) via pendulum, Coriolis acceleration, Ohm’s law resistances, radius of gyration, Reynolds number).
- Produce calculation pipelines with minimal plotting/printing that are easy to copy-paste and run in a notebook.

### Downstream Use
- Course assistants or lab-prep tools that auto-draft calculation code for **intro undergrad physics/mech/fluids/EE labs**.
- Auto-checkers that compare student code vs. a reference implementation (with appropriate guardrails).

### Out-of-Scope Use
- Any **safety-critical** design decisions (structural, medical, chemical process control).
- High-stakes computation without human verification.
- Domains far outside the training distribution (e.g., NLP preprocessing pipelines, advanced control systems, large-scale simulation frameworks).

---

## Bias, Risks, and Limitations

- **Small dataset (37 train / 6 eval)** → plausible overfitting; brittle generalization to unseen experiment formats.
- **Formula misuse risk:** The model may pick incorrect constants/units or silently use wrong equations.
- **Overconfidence:** Generated code may “look right” while being numerically off or unit-inconsistent.
- **JSON brittleness:** If `OBSERVATIONS` keys/units differ from training patterns, the code may break.

### Recommendations
- Always **review formulas and units**; add assertions/unit conversions in downstream systems.
- Run generated code with **test observations** and compare against hand calculations.
- For deployment, wrap outputs with **explanations and references** to the formulas used.

---

## How to Get Started

**Prompt template used in training**
```text
### CONTEXT:
{context}

### OBSERVATIONS:
{observations}

### CODE:
```

**Load base + LoRA adapter (recommended)**
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
from peft import PeftModel
import torch

base_id = "microsoft/Phi-3-mini-4k-instruct"
adapter_id = "YOUR_ADAPTER_REPO_OR_LOCAL_PATH"  # e.g., ./phi3-lab-report-coder-final

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=False)

tok = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
tok.pad_token = tok.eos_token

base = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb,
                                            trust_remote_code=True, device_map="auto")
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()

prompt = """### CONTEXT:
Experiment to determine acceleration due to gravity using a simple pendulum...

### OBSERVATIONS:
{'readings': [{'L':0.50,'T':1.42}, {'L':0.60,'T':1.55}], 'unit_L':'m', 'unit_T':'s'}

### CODE:
"""

inputs = tok(prompt, return_tensors="pt").to(model.device)
streamer = TextStreamer(tok, skip_prompt=True, skip_special_tokens=True)
_ = model.generate(**inputs, max_new_tokens=400, temperature=0.2, do_sample=False, streamer=streamer)
```

---

## Training Details

### Data
- **Files:** `train.jsonl` (list of objects), `eval.jsonl` (list of objects)  
- **Schema per example:**  
  - `context` *(str)*: experiment description  
  - `observations` *(dict)*: units + numeric readings (lists of dicts)  
  - `code` *(str)*: reference Python solution
- **Topical spread (non-exhaustive):** pendulum \(g\), Ohm’s law, titration, density via displacement, Coriolis accel., gyroscopic effect, Hartnell governor, rotating mass balancing, helical spring vibration, bi-filar suspension, etc.

**Size & basic stats**
- Train: **37** items; Eval: **6** items  
- Formatted prompt (context+observations+code) length (train):
  - mean ≈ **222** words (≈ **1,739** chars); 95th pct ≈ **311** words
- Reference code length (train):
  - mean ≈ **34** lines (min **9**, max **71**)

### Training Procedure (from notebook)
- **Approach:** QLoRA (4-bit) SFT using `trl.SFTTrainer`  
- **Quantization:** `bitsandbytes` 4-bit `nf4`, compute dtype `bfloat16`  
- **LoRA config:** `r=16`, `alpha=32`, `dropout=0.05`, `bias="none"`, targets = `q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj`  
- **Tokenizer:** right padding; `eos_token` as `pad_token`  
- **Hyperparameters (TrainingArguments):**  
  - epochs: **10**  
  - per-device train batch size: **1**  
  - gradient_accumulation_steps: **4**  
  - optimizer: **paged_adamw_32bit**  
  - learning rate: **2e-4**, weight decay: **1e-3**  
  - warmup_ratio: **0.03**, scheduler: **constant**  
  - bf16: **True** (fp16: False), group_by_length: True  
  - logging_steps: 10, save/eval every 50 steps  
  - report_to: tensorboard  
- **Saving:** `trainer.save_model("./phi3-lab-report-coder-final")` (adapter folder)

### Speeds, Sizes, Times
- **Hardware:** Google Colab **T4 GPU** (per notebook metadata)  
- **Adapter artifact:** LoRA weights only (load with the base model).  
- **Wall-clock time:** not logged in the notebook.

---

## Evaluation

### Testing Data, Factors & Metrics
- **Eval set:** `eval.jsonl` (**6** items) with same schema.  
- **Primary metric (planned):** ROUGE-L / ROUGE-1 against reference `code` (proxy for surface similarity).  
- **Recommended additional checks:** unit tests on numeric outputs; pyflakes/ruff for syntax; run-time assertions.

### Results
- No automated score recorded in the notebook.  
- **Suggested protocol:**  
  1) Generate code for each eval item using the same prompt template.  
  2) Execute safely in a sandbox with provided observations.  
  3) Compare computed scalars (e.g., average \(g\), \(R\), Reynolds number) to ground truth tolerances.  
  4) Report pass rate and ROUGE for readability/similarity.

---

## Model Examination (optional)
- Inspect token-by-token attention to `OBSERVATIONS` keys (ablation: shuffle keys to test robustness).  
- Add **unit-check helpers** (e.g., `pint`) in prompts to encourage explicit conversions.

---

## Environmental Impact
- **Hardware Type:** NVIDIA T4 (Colab)  
- **Precision:** 4-bit QLoRA with `bfloat16` compute  
- **Hours used:** Not recorded (dataset is small; expected low)  
- **Cloud Provider/Region:** Colab (unspecified)  
- **Carbon Emitted:** Not estimated (see [ML CO2 Impact calculator](https://mlco2.github.io/impact#compute))

---

## Technical Specifications

### Architecture & Objective
- **Backbone:** `Phi-3-mini-4k-instruct` (decoder-only causal LM)  
- **Objective:** Supervised fine-tuning to continue from `### CODE:` with correct, executable Python.

### Compute Infrastructure
- **Hardware:** Colab GPU (T4) + CPU RAM  
- **Software:**  
  - `transformers`, `trl`, `peft`, `bitsandbytes`, `datasets`, `accelerate`, `torch`

---

## Citation
@article{abdin2024phi3,
  title   = {Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone},
  author  = {Abdin, Marah and others},
  journal = {arXiv preprint arXiv:2404.14219},
  year    = {2024},
  doi     = {10.48550/arXiv.2404.14219},
  url     = {https://arxiv.org/abs/2404.14219}
}

---

## Glossary
- **QLoRA:** Fine-tuning with low-rank adapters on a quantized base model (saves memory/compute).  
- **LoRA (r, α):** Rank and scaling of low-rank update matrices.

---

## More Information
- For better robustness, consider augmenting data with **unit-perturbation** and **noise-in-readings** variants, and add examples across more domains (materials, thermo, optics).  
- Add **eval harness** with numeric tolerances and syntax checks.

---

## Model Card Authors
- Barghav777
---