Barghav777
/

phi3-lab-report-coder

@@ -10,202 +10,227 @@ base_model:
 pipeline_tag: text-generation
 ---
-# Model Card for Model ID
-A lightweight LoRA-adapter fine-tune of microsoft/Phi-3-mini-4k-instruct for turning structured lab contexts + observations into executable Python code that performs the target calculations (e.g., mechanics, fluids, vibrations, basic circuits, titrations). Trained with QLoRA in 4-bit, this model is intended as an assistive code generator for STEM lab writeups and teaching demos—not as a certified calculator for safety-critical engineering.
 ## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [Barghav777]
-- **Model type:** Causal decoder LM (instruction-tuned) + LoRA adapter
-- **Language(s) (NLP):** English
-- **License:** MIT
-- **Finetuned from model [optional]:** microsoft/Phi-3-mini-4k-instruct
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
-- Generate readable Python code to compute derived quantities from lab observations (e.g., average g via pendulum, Coriolis acceleration, Ohm’s law resistances, radius of gyration, Reynolds number).
 - Produce calculation pipelines with minimal plotting/printing that are easy to copy-paste and run in a notebook.
-### Downstream Use [optional]
-- Course assistants or lab-prep tools that auto-draft calculation code for intro undergrad physics/mech/fluids/EE labs.
 - Auto-checkers that compare student code vs. a reference implementation (with appropriate guardrails).
 ### Out-of-Scope Use
-- Any safety-critical design decisions (structural, medical, chemical process control).
 - High-stakes computation without human verification.
 - Domains far outside the training distribution (e.g., NLP preprocessing pipelines, advanced control systems, large-scale simulation frameworks).
-## Bias, Risks, and Limitations
-- Small dataset (37 train / 6 eval) → plausible overfitting; brittle generalization to unseen experiment formats.
-- Formula misuse risk: The model may pick incorrect constants/units or silently use wrong equations.
-- Overconfidence: Generated code may “look right” while being numerically off or unit-inconsistent.
-- JSON brittleness: If OBSERVATIONS keys/units differ from training patterns, the code may break.
 ### Recommendations
-- Always review formulas and units; add assertions/unit conversions in downstream systems.
-- Run generated code with test observations and compare against hand calculations.
-- For deployment, wrap outputs with explanations and references to the formulas used.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
 ### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
 ### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
 ## Model Card Contact
-[More Information Needed]

 pipeline_tag: text-generation
 ---
+# Model Card for **Phi3-Lab-Report-Coder (LoRA on Phi-3 Mini 4k Instruct)**
+A lightweight LoRA-adapter fine-tune of `microsoft/Phi-3-mini-4k-instruct` for **turning structured lab contexts + observations into executable Python code** that performs the target calculations (e.g., mechanics, fluids, vibrations, basic circuits, titrations). Trained with QLoRA in 4-bit, this model is intended as an **assistive code generator** for STEM lab writeups and teaching demos—not as a certified calculator for safety-critical engineering.
+---
 ## Model Details
 ### Model Description
+- **Developed by:** You (this repo/model card author)
+- **Model type:** Causal decoder LM (instruction-tuned) + **LoRA adapter**
+- **Languages:** English
+- **License:** MIT
+- **Finetuned from:** `microsoft/Phi-3-mini-4k-instruct`
+- **Intended input format:** A structured prompt with:
+  - `### CONTEXT:` (natural-language description of the experiment)
+  - `### OBSERVATIONS:` (JSON-like dict with units, readings)
+  - `### CODE:` (the model is trained to generate the Python solution after this tag)
+### Model Sources
+- **Base model:** `microsoft/Phi-3-mini-4k-instruct`
+- **Training data files:** `train.jsonl` (37 items), `eval.jsonl` (6 items)
+- **Demo/Colab basis:** Local notebook `Untitled64 (1).ipynb` (Colab, GPU=T4)
+---
 ## Uses
 ### Direct Use
+- Generate **readable Python code** to compute derived quantities from lab observations (e.g., average \(g\) via pendulum, Coriolis acceleration, Ohm’s law resistances, radius of gyration, Reynolds number).
 - Produce calculation pipelines with minimal plotting/printing that are easy to copy-paste and run in a notebook.
+### Downstream Use
+- Course assistants or lab-prep tools that auto-draft calculation code for **intro undergrad physics/mech/fluids/EE labs**.
 - Auto-checkers that compare student code vs. a reference implementation (with appropriate guardrails).
 ### Out-of-Scope Use
+- Any **safety-critical** design decisions (structural, medical, chemical process control).
 - High-stakes computation without human verification.
 - Domains far outside the training distribution (e.g., NLP preprocessing pipelines, advanced control systems, large-scale simulation frameworks).
+---
+## Bias, Risks, and Limitations
+- **Small dataset (37 train / 6 eval)** → plausible overfitting; brittle generalization to unseen experiment formats.
+- **Formula misuse risk:** The model may pick incorrect constants/units or silently use wrong equations.
+- **Overconfidence:** Generated code may “look right” while being numerically off or unit-inconsistent.
+- **JSON brittleness:** If `OBSERVATIONS` keys/units differ from training patterns, the code may break.
 ### Recommendations
+- Always **review formulas and units**; add assertions/unit conversions in downstream systems.
+- Run generated code with **test observations** and compare against hand calculations.
+- For deployment, wrap outputs with **explanations and references** to the formulas used.
+---
+## How to Get Started
+**Prompt template used in training**
+```text
+### CONTEXT:
+{context}
+### OBSERVATIONS:
+{observations}
+### CODE:
+```
+**Load base + LoRA adapter (recommended)**
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
+from peft import PeftModel
+import torch
+base_id = "microsoft/Phi-3-mini-4k-instruct"
+adapter_id = "YOUR_ADAPTER_REPO_OR_LOCAL_PATH"  # e.g., ./phi3-lab-report-coder-final
+bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
+                         bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=False)
+tok = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
+tok.pad_token = tok.eos_token
+base = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb,
+                                            trust_remote_code=True, device_map="auto")
+model = PeftModel.from_pretrained(base, adapter_id)
+model.eval()
+prompt = """### CONTEXT:
+Experiment to determine acceleration due to gravity using a simple pendulum...
+### OBSERVATIONS:
+{'readings': [{'L':0.50,'T':1.42}, {'L':0.60,'T':1.55}], 'unit_L':'m', 'unit_T':'s'}
+### CODE:
+"""
+inputs = tok(prompt, return_tensors="pt").to(model.device)
+streamer = TextStreamer(tok, skip_prompt=True, skip_special_tokens=True)
+_ = model.generate(**inputs, max_new_tokens=400, temperature=0.2, do_sample=False, streamer=streamer)
+```
+---
+## Training Details
+### Data
+- **Files:** `train.jsonl` (list of objects), `eval.jsonl` (list of objects)
+- **Schema per example:**
+  - `context` *(str)*: experiment description
+  - `observations` *(dict)*: units + numeric readings (lists of dicts)
+  - `code` *(str)*: reference Python solution
+- **Topical spread (non-exhaustive):** pendulum \(g\), Ohm’s law, titration, density via displacement, Coriolis accel., gyroscopic effect, Hartnell governor, rotating mass balancing, helical spring vibration, bi-filar suspension, etc.
+**Size & basic stats**
+- Train: **37** items; Eval: **6** items
+- Formatted prompt (context+observations+code) length (train):
+  - mean ≈ **222** words (≈ **1,739** chars); 95th pct ≈ **311** words
+- Reference code length (train):
+  - mean ≈ **34** lines (min **9**, max **71**)
+### Training Procedure (from notebook)
+- **Approach:** QLoRA (4-bit) SFT using `trl.SFTTrainer`
+- **Quantization:** `bitsandbytes` 4-bit `nf4`, compute dtype `bfloat16`
+- **LoRA config:** `r=16`, `alpha=32`, `dropout=0.05`, `bias="none"`, targets = `q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj`
+- **Tokenizer:** right padding; `eos_token` as `pad_token`
+- **Hyperparameters (TrainingArguments):**
+  - epochs: **10**
+  - per-device train batch size: **1**
+  - gradient_accumulation_steps: **4**
+  - optimizer: **paged_adamw_32bit**
+  - learning rate: **2e-4**, weight decay: **1e-3**
+  - warmup_ratio: **0.03**, scheduler: **constant**
+  - bf16: **True** (fp16: False), group_by_length: True
+  - logging_steps: 10, save/eval every 50 steps
+  - report_to: tensorboard
+- **Saving:** `trainer.save_model("./phi3-lab-report-coder-final")` (adapter folder)
+### Speeds, Sizes, Times
+- **Hardware:** Google Colab **T4 GPU** (per notebook metadata)
+- **Adapter artifact:** LoRA weights only (load with the base model).
+- **Wall-clock time:** not logged in the notebook.
+---
 ## Evaluation
 ### Testing Data, Factors & Metrics
+- **Eval set:** `eval.jsonl` (**6** items) with same schema.
+- **Primary metric (planned):** ROUGE-L / ROUGE-1 against reference `code` (proxy for surface similarity).
+- **Recommended additional checks:** unit tests on numeric outputs; pyflakes/ruff for syntax; run-time assertions.
 ### Results
+- No automated score recorded in the notebook.
+- **Suggested protocol:**
+  1) Generate code for each eval item using the same prompt template.
+  2) Execute safely in a sandbox with provided observations.
+  3) Compare computed scalars (e.g., average \(g\), \(R\), Reynolds number) to ground truth tolerances.
+  4) Report pass rate and ROUGE for readability/similarity.
+---
+## Model Examination (optional)
+- Inspect token-by-token attention to `OBSERVATIONS` keys (ablation: shuffle keys to test robustness).
+- Add **unit-check helpers** (e.g., `pint`) in prompts to encourage explicit conversions.
+---
 ## Environmental Impact
+- **Hardware Type:** NVIDIA T4 (Colab)
+- **Precision:** 4-bit QLoRA with `bfloat16` compute
+- **Hours used:** Not recorded (dataset is small; expected low)
+- **Cloud Provider/Region:** Colab (unspecified)
+- **Carbon Emitted:** Not estimated (see [ML CO2 Impact calculator](https://mlco2.github.io/impact#compute))
+---
+## Technical Specifications
+### Architecture & Objective
+- **Backbone:** `Phi-3-mini-4k-instruct` (decoder-only causal LM)
+- **Objective:** Supervised fine-tuning to continue from `### CODE:` with correct, executable Python.
 ### Compute Infrastructure
+- **Hardware:** Colab GPU (T4) + CPU RAM
+- **Software:**
+  - `transformers`, `trl`, `peft`, `bitsandbytes`, `datasets`, `accelerate`, `torch`
+  - Notebook: `Untitled64 (1).ipynb`
+---
+## Citation
+If you write about this model, please cite the base model and your repository. (Add BibTeX here if/when available.)
+---
+## Glossary
+- **QLoRA:** Fine-tuning with low-rank adapters on a quantized base model (saves memory/compute).
+- **LoRA (r, α):** Rank and scaling of low-rank update matrices.
+---
+## More Information
+- For better robustness, consider augmenting data with **unit-perturbation** and **noise-in-readings** variants, and add examples across more domains (materials, thermo, optics).
+- Add **eval harness** with numeric tolerances and syntax checks.
+---
+## Model Card Authors
+- You (model author/maintainer)
 ## Model Card Contact
+- Add your preferred contact or HF discussion link.
+---
+### Notes on Assumptions & Gaps (for rigor)
+- **Assumption:** The adapter folder `./phi3-lab-report-coder-final` contains PEFT weights (not a merged full model). The notebook’s `save_model` call supports that; loading snippet reflects this.
+- **Known gap:** No recorded objective metrics; this card avoids fabricating results. Add a small script to run eval and compute numeric accuracy + ROUGE.
+- **Risk callout:** Dataset size is modest (37/6); without stronger regularization and more variety, generalization is limited.