Update README.md
Browse files
README.md
CHANGED
|
@@ -10,202 +10,227 @@ base_model:
|
|
| 10 |
pipeline_tag: text-generation
|
| 11 |
---
|
| 12 |
|
| 13 |
-
# Model Card for
|
| 14 |
-
|
| 15 |
-
A lightweight LoRA-adapter fine-tune of microsoft/Phi-3-mini-4k-instruct for turning structured lab contexts + observations into executable Python code that performs the target calculations (e.g., mechanics, fluids, vibrations, basic circuits, titrations). Trained with QLoRA in 4-bit, this model is intended as an assistive code generator for STEM lab writeups and teaching demos—not as a certified calculator for safety-critical engineering.
|
| 16 |
|
|
|
|
| 17 |
|
|
|
|
| 18 |
|
| 19 |
## Model Details
|
| 20 |
|
| 21 |
### Model Description
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
- **
|
| 28 |
-
- **
|
| 29 |
-
-
|
| 30 |
-
-
|
| 31 |
-
-
|
| 32 |
|
| 33 |
-
### Model Sources
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
| 36 |
|
| 37 |
-
|
| 38 |
-
- **Paper [optional]:** [More Information Needed]
|
| 39 |
-
- **Demo [optional]:** [More Information Needed]
|
| 40 |
|
| 41 |
## Uses
|
| 42 |
|
| 43 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 44 |
-
|
| 45 |
### Direct Use
|
| 46 |
-
|
| 47 |
-
- Generate readable Python code to compute derived quantities from lab observations (e.g., average g via pendulum, Coriolis acceleration, Ohm’s law resistances, radius of gyration, Reynolds number).
|
| 48 |
-
|
| 49 |
- Produce calculation pipelines with minimal plotting/printing that are easy to copy-paste and run in a notebook.
|
| 50 |
|
| 51 |
-
### Downstream Use
|
| 52 |
-
|
| 53 |
-
- Course assistants or lab-prep tools that auto-draft calculation code for intro undergrad physics/mech/fluids/EE labs.
|
| 54 |
-
|
| 55 |
- Auto-checkers that compare student code vs. a reference implementation (with appropriate guardrails).
|
| 56 |
|
| 57 |
### Out-of-Scope Use
|
| 58 |
-
|
| 59 |
-
- Any safety-critical design decisions (structural, medical, chemical process control).
|
| 60 |
-
|
| 61 |
- High-stakes computation without human verification.
|
| 62 |
-
|
| 63 |
- Domains far outside the training distribution (e.g., NLP preprocessing pipelines, advanced control systems, large-scale simulation frameworks).
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
- Small dataset (37 train / 6 eval) → plausible overfitting; brittle generalization to unseen experiment formats.
|
| 68 |
-
|
| 69 |
-
- Formula misuse risk: The model may pick incorrect constants/units or silently use wrong equations.
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
-
-
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
### Recommendations
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
- Run generated code with test observations and compare against hand calculations.
|
| 80 |
|
| 81 |
-
|
| 82 |
-
## How to Get Started with the Model
|
| 83 |
|
| 84 |
-
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
-
|
|
|
|
| 87 |
|
| 88 |
-
|
|
|
|
| 89 |
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
-
|
|
|
|
| 93 |
|
| 94 |
-
|
|
|
|
| 95 |
|
| 96 |
-
|
|
|
|
| 97 |
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
-
|
|
|
|
| 101 |
|
| 102 |
-
|
|
|
|
| 103 |
|
|
|
|
|
|
|
| 104 |
|
| 105 |
-
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
-
|
| 108 |
|
| 109 |
-
|
| 110 |
|
| 111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
-
|
| 114 |
|
| 115 |
## Evaluation
|
| 116 |
|
| 117 |
-
<!-- This section describes the evaluation protocols and provides the results. -->
|
| 118 |
-
|
| 119 |
### Testing Data, Factors & Metrics
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
<!-- This should link to a Dataset Card if possible. -->
|
| 124 |
-
|
| 125 |
-
[More Information Needed]
|
| 126 |
-
|
| 127 |
-
#### Factors
|
| 128 |
-
|
| 129 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
| 130 |
-
|
| 131 |
-
[More Information Needed]
|
| 132 |
-
|
| 133 |
-
#### Metrics
|
| 134 |
-
|
| 135 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 136 |
-
|
| 137 |
-
[More Information Needed]
|
| 138 |
|
| 139 |
### Results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
#### Summary
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
## Model Examination [optional]
|
| 148 |
|
| 149 |
-
|
|
|
|
|
|
|
| 150 |
|
| 151 |
-
|
| 152 |
|
| 153 |
## Environmental Impact
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 158 |
-
|
| 159 |
-
- **Hardware Type:** [More Information Needed]
|
| 160 |
-
- **Hours used:** [More Information Needed]
|
| 161 |
-
- **Cloud Provider:** [More Information Needed]
|
| 162 |
-
- **Compute Region:** [More Information Needed]
|
| 163 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 164 |
-
|
| 165 |
-
## Technical Specifications [optional]
|
| 166 |
|
| 167 |
-
|
| 168 |
|
| 169 |
-
|
|
|
|
|
|
|
| 170 |
|
| 171 |
### Compute Infrastructure
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
#### Hardware
|
| 176 |
-
|
| 177 |
-
[More Information Needed]
|
| 178 |
-
|
| 179 |
-
#### Software
|
| 180 |
-
|
| 181 |
-
[More Information Needed]
|
| 182 |
-
|
| 183 |
-
## Citation [optional]
|
| 184 |
-
|
| 185 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 186 |
-
|
| 187 |
-
**BibTeX:**
|
| 188 |
-
|
| 189 |
-
[More Information Needed]
|
| 190 |
-
|
| 191 |
-
**APA:**
|
| 192 |
-
|
| 193 |
-
[More Information Needed]
|
| 194 |
|
| 195 |
-
##
|
|
|
|
| 196 |
|
| 197 |
-
|
| 198 |
|
| 199 |
-
|
|
|
|
|
|
|
| 200 |
|
| 201 |
-
|
| 202 |
|
| 203 |
-
|
|
|
|
|
|
|
| 204 |
|
| 205 |
-
|
| 206 |
|
| 207 |
-
|
|
|
|
| 208 |
|
| 209 |
## Model Card Contact
|
|
|
|
|
|
|
|
|
|
| 210 |
|
| 211 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
pipeline_tag: text-generation
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Model Card for **Phi3-Lab-Report-Coder (LoRA on Phi-3 Mini 4k Instruct)**
|
|
|
|
|
|
|
| 14 |
|
| 15 |
+
A lightweight LoRA-adapter fine-tune of `microsoft/Phi-3-mini-4k-instruct` for **turning structured lab contexts + observations into executable Python code** that performs the target calculations (e.g., mechanics, fluids, vibrations, basic circuits, titrations). Trained with QLoRA in 4-bit, this model is intended as an **assistive code generator** for STEM lab writeups and teaching demos—not as a certified calculator for safety-critical engineering.
|
| 16 |
|
| 17 |
+
---
|
| 18 |
|
| 19 |
## Model Details
|
| 20 |
|
| 21 |
### Model Description
|
| 22 |
|
| 23 |
+
- **Developed by:** You (this repo/model card author)
|
| 24 |
+
- **Model type:** Causal decoder LM (instruction-tuned) + **LoRA adapter**
|
| 25 |
+
- **Languages:** English
|
| 26 |
+
- **License:** MIT
|
| 27 |
+
- **Finetuned from:** `microsoft/Phi-3-mini-4k-instruct`
|
| 28 |
+
- **Intended input format:** A structured prompt with:
|
| 29 |
+
- `### CONTEXT:` (natural-language description of the experiment)
|
| 30 |
+
- `### OBSERVATIONS:` (JSON-like dict with units, readings)
|
| 31 |
+
- `### CODE:` (the model is trained to generate the Python solution after this tag)
|
| 32 |
|
| 33 |
+
### Model Sources
|
| 34 |
|
| 35 |
+
- **Base model:** `microsoft/Phi-3-mini-4k-instruct`
|
| 36 |
+
- **Training data files:** `train.jsonl` (37 items), `eval.jsonl` (6 items)
|
| 37 |
+
- **Demo/Colab basis:** Local notebook `Untitled64 (1).ipynb` (Colab, GPU=T4)
|
| 38 |
|
| 39 |
+
---
|
|
|
|
|
|
|
| 40 |
|
| 41 |
## Uses
|
| 42 |
|
|
|
|
|
|
|
| 43 |
### Direct Use
|
| 44 |
+
- Generate **readable Python code** to compute derived quantities from lab observations (e.g., average \(g\) via pendulum, Coriolis acceleration, Ohm’s law resistances, radius of gyration, Reynolds number).
|
|
|
|
|
|
|
| 45 |
- Produce calculation pipelines with minimal plotting/printing that are easy to copy-paste and run in a notebook.
|
| 46 |
|
| 47 |
+
### Downstream Use
|
| 48 |
+
- Course assistants or lab-prep tools that auto-draft calculation code for **intro undergrad physics/mech/fluids/EE labs**.
|
|
|
|
|
|
|
| 49 |
- Auto-checkers that compare student code vs. a reference implementation (with appropriate guardrails).
|
| 50 |
|
| 51 |
### Out-of-Scope Use
|
| 52 |
+
- Any **safety-critical** design decisions (structural, medical, chemical process control).
|
|
|
|
|
|
|
| 53 |
- High-stakes computation without human verification.
|
|
|
|
| 54 |
- Domains far outside the training distribution (e.g., NLP preprocessing pipelines, advanced control systems, large-scale simulation frameworks).
|
| 55 |
|
| 56 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
## Bias, Risks, and Limitations
|
| 59 |
|
| 60 |
+
- **Small dataset (37 train / 6 eval)** → plausible overfitting; brittle generalization to unseen experiment formats.
|
| 61 |
+
- **Formula misuse risk:** The model may pick incorrect constants/units or silently use wrong equations.
|
| 62 |
+
- **Overconfidence:** Generated code may “look right” while being numerically off or unit-inconsistent.
|
| 63 |
+
- **JSON brittleness:** If `OBSERVATIONS` keys/units differ from training patterns, the code may break.
|
| 64 |
|
| 65 |
### Recommendations
|
| 66 |
+
- Always **review formulas and units**; add assertions/unit conversions in downstream systems.
|
| 67 |
+
- Run generated code with **test observations** and compare against hand calculations.
|
| 68 |
+
- For deployment, wrap outputs with **explanations and references** to the formulas used.
|
| 69 |
|
| 70 |
+
---
|
|
|
|
|
|
|
| 71 |
|
| 72 |
+
## How to Get Started
|
|
|
|
| 73 |
|
| 74 |
+
**Prompt template used in training**
|
| 75 |
+
```text
|
| 76 |
+
### CONTEXT:
|
| 77 |
+
{context}
|
| 78 |
|
| 79 |
+
### OBSERVATIONS:
|
| 80 |
+
{observations}
|
| 81 |
|
| 82 |
+
### CODE:
|
| 83 |
+
```
|
| 84 |
|
| 85 |
+
**Load base + LoRA adapter (recommended)**
|
| 86 |
+
```python
|
| 87 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
|
| 88 |
+
from peft import PeftModel
|
| 89 |
+
import torch
|
| 90 |
|
| 91 |
+
base_id = "microsoft/Phi-3-mini-4k-instruct"
|
| 92 |
+
adapter_id = "YOUR_ADAPTER_REPO_OR_LOCAL_PATH" # e.g., ./phi3-lab-report-coder-final
|
| 93 |
|
| 94 |
+
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
|
| 95 |
+
bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=False)
|
| 96 |
|
| 97 |
+
tok = AutoTokenizer.from_pretrained(base_id, trust_remote_code=True)
|
| 98 |
+
tok.pad_token = tok.eos_token
|
| 99 |
|
| 100 |
+
base = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb,
|
| 101 |
+
trust_remote_code=True, device_map="auto")
|
| 102 |
+
model = PeftModel.from_pretrained(base, adapter_id)
|
| 103 |
+
model.eval()
|
| 104 |
|
| 105 |
+
prompt = """### CONTEXT:
|
| 106 |
+
Experiment to determine acceleration due to gravity using a simple pendulum...
|
| 107 |
|
| 108 |
+
### OBSERVATIONS:
|
| 109 |
+
{'readings': [{'L':0.50,'T':1.42}, {'L':0.60,'T':1.55}], 'unit_L':'m', 'unit_T':'s'}
|
| 110 |
|
| 111 |
+
### CODE:
|
| 112 |
+
"""
|
| 113 |
|
| 114 |
+
inputs = tok(prompt, return_tensors="pt").to(model.device)
|
| 115 |
+
streamer = TextStreamer(tok, skip_prompt=True, skip_special_tokens=True)
|
| 116 |
+
_ = model.generate(**inputs, max_new_tokens=400, temperature=0.2, do_sample=False, streamer=streamer)
|
| 117 |
+
```
|
| 118 |
|
| 119 |
+
---
|
| 120 |
|
| 121 |
+
## Training Details
|
| 122 |
|
| 123 |
+
### Data
|
| 124 |
+
- **Files:** `train.jsonl` (list of objects), `eval.jsonl` (list of objects)
|
| 125 |
+
- **Schema per example:**
|
| 126 |
+
- `context` *(str)*: experiment description
|
| 127 |
+
- `observations` *(dict)*: units + numeric readings (lists of dicts)
|
| 128 |
+
- `code` *(str)*: reference Python solution
|
| 129 |
+
- **Topical spread (non-exhaustive):** pendulum \(g\), Ohm’s law, titration, density via displacement, Coriolis accel., gyroscopic effect, Hartnell governor, rotating mass balancing, helical spring vibration, bi-filar suspension, etc.
|
| 130 |
+
|
| 131 |
+
**Size & basic stats**
|
| 132 |
+
- Train: **37** items; Eval: **6** items
|
| 133 |
+
- Formatted prompt (context+observations+code) length (train):
|
| 134 |
+
- mean ≈ **222** words (≈ **1,739** chars); 95th pct ≈ **311** words
|
| 135 |
+
- Reference code length (train):
|
| 136 |
+
- mean ≈ **34** lines (min **9**, max **71**)
|
| 137 |
+
|
| 138 |
+
### Training Procedure (from notebook)
|
| 139 |
+
- **Approach:** QLoRA (4-bit) SFT using `trl.SFTTrainer`
|
| 140 |
+
- **Quantization:** `bitsandbytes` 4-bit `nf4`, compute dtype `bfloat16`
|
| 141 |
+
- **LoRA config:** `r=16`, `alpha=32`, `dropout=0.05`, `bias="none"`, targets = `q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj`
|
| 142 |
+
- **Tokenizer:** right padding; `eos_token` as `pad_token`
|
| 143 |
+
- **Hyperparameters (TrainingArguments):**
|
| 144 |
+
- epochs: **10**
|
| 145 |
+
- per-device train batch size: **1**
|
| 146 |
+
- gradient_accumulation_steps: **4**
|
| 147 |
+
- optimizer: **paged_adamw_32bit**
|
| 148 |
+
- learning rate: **2e-4**, weight decay: **1e-3**
|
| 149 |
+
- warmup_ratio: **0.03**, scheduler: **constant**
|
| 150 |
+
- bf16: **True** (fp16: False), group_by_length: True
|
| 151 |
+
- logging_steps: 10, save/eval every 50 steps
|
| 152 |
+
- report_to: tensorboard
|
| 153 |
+
- **Saving:** `trainer.save_model("./phi3-lab-report-coder-final")` (adapter folder)
|
| 154 |
+
|
| 155 |
+
### Speeds, Sizes, Times
|
| 156 |
+
- **Hardware:** Google Colab **T4 GPU** (per notebook metadata)
|
| 157 |
+
- **Adapter artifact:** LoRA weights only (load with the base model).
|
| 158 |
+
- **Wall-clock time:** not logged in the notebook.
|
| 159 |
|
| 160 |
+
---
|
| 161 |
|
| 162 |
## Evaluation
|
| 163 |
|
|
|
|
|
|
|
| 164 |
### Testing Data, Factors & Metrics
|
| 165 |
+
- **Eval set:** `eval.jsonl` (**6** items) with same schema.
|
| 166 |
+
- **Primary metric (planned):** ROUGE-L / ROUGE-1 against reference `code` (proxy for surface similarity).
|
| 167 |
+
- **Recommended additional checks:** unit tests on numeric outputs; pyflakes/ruff for syntax; run-time assertions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
### Results
|
| 170 |
+
- No automated score recorded in the notebook.
|
| 171 |
+
- **Suggested protocol:**
|
| 172 |
+
1) Generate code for each eval item using the same prompt template.
|
| 173 |
+
2) Execute safely in a sandbox with provided observations.
|
| 174 |
+
3) Compare computed scalars (e.g., average \(g\), \(R\), Reynolds number) to ground truth tolerances.
|
| 175 |
+
4) Report pass rate and ROUGE for readability/similarity.
|
| 176 |
|
| 177 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
|
| 179 |
+
## Model Examination (optional)
|
| 180 |
+
- Inspect token-by-token attention to `OBSERVATIONS` keys (ablation: shuffle keys to test robustness).
|
| 181 |
+
- Add **unit-check helpers** (e.g., `pint`) in prompts to encourage explicit conversions.
|
| 182 |
|
| 183 |
+
---
|
| 184 |
|
| 185 |
## Environmental Impact
|
| 186 |
+
- **Hardware Type:** NVIDIA T4 (Colab)
|
| 187 |
+
- **Precision:** 4-bit QLoRA with `bfloat16` compute
|
| 188 |
+
- **Hours used:** Not recorded (dataset is small; expected low)
|
| 189 |
+
- **Cloud Provider/Region:** Colab (unspecified)
|
| 190 |
+
- **Carbon Emitted:** Not estimated (see [ML CO2 Impact calculator](https://mlco2.github.io/impact#compute))
|
| 191 |
|
| 192 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
|
| 194 |
+
## Technical Specifications
|
| 195 |
|
| 196 |
+
### Architecture & Objective
|
| 197 |
+
- **Backbone:** `Phi-3-mini-4k-instruct` (decoder-only causal LM)
|
| 198 |
+
- **Objective:** Supervised fine-tuning to continue from `### CODE:` with correct, executable Python.
|
| 199 |
|
| 200 |
### Compute Infrastructure
|
| 201 |
+
- **Hardware:** Colab GPU (T4) + CPU RAM
|
| 202 |
+
- **Software:**
|
| 203 |
+
- `transformers`, `trl`, `peft`, `bitsandbytes`, `datasets`, `accelerate`, `torch`
|
| 204 |
+
- Notebook: `Untitled64 (1).ipynb`
|
| 205 |
|
| 206 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
|
| 208 |
+
## Citation
|
| 209 |
+
If you write about this model, please cite the base model and your repository. (Add BibTeX here if/when available.)
|
| 210 |
|
| 211 |
+
---
|
| 212 |
|
| 213 |
+
## Glossary
|
| 214 |
+
- **QLoRA:** Fine-tuning with low-rank adapters on a quantized base model (saves memory/compute).
|
| 215 |
+
- **LoRA (r, α):** Rank and scaling of low-rank update matrices.
|
| 216 |
|
| 217 |
+
---
|
| 218 |
|
| 219 |
+
## More Information
|
| 220 |
+
- For better robustness, consider augmenting data with **unit-perturbation** and **noise-in-readings** variants, and add examples across more domains (materials, thermo, optics).
|
| 221 |
+
- Add **eval harness** with numeric tolerances and syntax checks.
|
| 222 |
|
| 223 |
+
---
|
| 224 |
|
| 225 |
+
## Model Card Authors
|
| 226 |
+
- You (model author/maintainer)
|
| 227 |
|
| 228 |
## Model Card Contact
|
| 229 |
+
- Add your preferred contact or HF discussion link.
|
| 230 |
+
|
| 231 |
+
---
|
| 232 |
|
| 233 |
+
### Notes on Assumptions & Gaps (for rigor)
|
| 234 |
+
- **Assumption:** The adapter folder `./phi3-lab-report-coder-final` contains PEFT weights (not a merged full model). The notebook’s `save_model` call supports that; loading snippet reflects this.
|
| 235 |
+
- **Known gap:** No recorded objective metrics; this card avoids fabricating results. Add a small script to run eval and compute numeric accuracy + ROUGE.
|
| 236 |
+
- **Risk callout:** Dataset size is modest (37/6); without stronger regularization and more variety, generalization is limited.
|