Instructions to use jdsb06/lifestack-grpo-v4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- Unsloth Studio
How to use jdsb06/lifestack-grpo-v4 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jdsb06/lifestack-grpo-v4 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jdsb06/lifestack-grpo-v4 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jdsb06/lifestack-grpo-v4 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="jdsb06/lifestack-grpo-v4", max_seq_length=2048, )
πͺ LifeStack-GRPO-v4
Qwen2.5-1.5B fine-tuned with episodic GRPO to resolve multi-domain life crises
Built for the Meta Γ HuggingFace PyTorch OpenEnv Hackathon 2026 Team BholeChature β Scaler School of Technology, Bangalore
The Friday 6:00 PM Problem
It's Friday evening. Your flight home was just cancelled. You open your banking app to rebook β your card is declined due to a "security flag." Simultaneously, a Slack notification: your boss moved Monday's 9:00 AM deadline to Sunday afternoon. You have $200 in cash, five hours of usable energy, and four different people expecting you in different places.
You turn to your AI assistant. It finds a cheaper flight β a 12-hour layover that kills your weekend. You ask it to message your boss, but the tone sounds defensive, triggering a "clarification" meeting that eats more of your time. Every "solution" applied in isolation creates a new wound elsewhere.
This is a Life Problem β cascading, interconnected, resource-constrained. No existing AI environment was built to handle it. We built LifeStack.
What is This Model?
This is v4 of our LoRA adapter for Qwen/Qwen2.5-1.5B-Instruct, fine-tuned using episodic GRPO (Group Relative Policy Optimization) via TRL 0.15.1 + Unsloth 2026.4.8 on the LifeStack environment β an OpenEnv-compatible RL environment with 23 interdependent metrics across 6 life-metric domains, connected by a 32-edge directed dependency graph. Training tasks are sampled across 8 task domains (the 6 life domains + transport_crisis + code_merge_crisis).
Given a structured life crisis, the model outputs a single JSON action plan:
{
"action_type": "negotiate",
"target_domain": "career",
"metric_changes": {"career.workload": -10.0, "mental_wellbeing.stress_level": -5.0},
"resource_cost": {"time": 0.5, "money": 0.0, "energy": 5.0},
"reasoning": "Request deadline extension before financial situation worsens"
}
v4 is the active production model powering the live HF Space demo.
Training Evidence
Reward Curve (v4 β real run from train_run_v4.log)
Loss Curve
Per-Function Reward Components
4-Panel Training Summary
Raw training log: train_run_v4.log (109 kB β full Unsloth/TRL output)
v4 Key Training Metrics (Stage 3 final)
| Metric | Value |
|---|---|
| Peak GRPO reward | 0.856 |
| Format reward (peak) | +0.660 |
| Episode return (avg) | +0.140 |
| Natural EOS terminations | First appeared (model self-terminating after JSON) |
| Training time | ~2 hrs (3 stages Γ ~39 min on A100 80GB) |
| Training hardware | A100 80GB |
| Framework | TRL 0.15.1 + Unsloth 2026.4.8 + Transformers 5.5.0 |
The Full Training Story
We did not get here in one clean run. Here is the honest story of every iteration.
| Run | Key Fix | Reward |
|---|---|---|
| Base (no LoRA) | β | β0.07 eval |
| Run 1 | β (broken JSON parser) | β0.47 eval β 571% worse than base |
| Run 2 | Shorter completions | β0.41 eval |
| Run 3 | Greedy regex re.search(r'\{.*\}', text, re.DOTALL) |
β0.010 β first real learning |
| Run 4 (v1) | 5-stage curriculum | β0.100 eval β consistency β |
| v3 | Episodic GRPO horizon=3 | +0.140 ep. return β new multi-step capability |
| v4 | Dropped dead-weight reward, doubled episode return weight | 0.856 peak |
The single fix that unlocked learning (Run 3):
# Before β always failed on trailing text:
data = json.loads(completion)
# After β extract first complete JSON object (greedy = critical for nested objects):
match = re.search(r'\{.*\}', completion, re.DOTALL)
data = json.loads(match.group())
Why greedy (.*) and not non-greedy (.*?)? Non-greedy stops at the first } β breaking on any nested object like "resource_cost": {"time": 1}. Greedy takes the outermost {...}.
Model Architecture
| Property | Value |
|---|---|
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Adapter type | LoRA (r=16, alpha=16) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | 18,464,768 / 1,562,179,072 (1.18%) |
| Adapter size | ~74 MB |
| Training hardware | A100 80GB |
| Precision | BF16 |
| Framework | TRL 0.15.1 + Unsloth 2026.4.8 + Transformers 5.5.0 |
Training: 3-Stage Episodic GRPO Curriculum
v4 uses episodic training β the model sees a 3-step sequence, observes the environment state change after each action, and is scored on discounted trajectory return. This is fundamentally harder than single-step scoring.
| Stage | Difficulty | Episodes | Episode Horizon | Avg Reward |
|---|---|---|---|---|
| 1 | 1 (easy) | 60 | 3 | β (not captured in log) |
| 2 | 2 (medium) | 60 | 3 | 0.687 |
| 3 | 3 (hard) | 60 | 3 | 0.717 (peak: 0.856) |
Reward Function Stack (v4)
| Function | Weight | Role |
|---|---|---|
reward_episode_format_fn |
1.0 | JSON validity + required keys + valid action_type |
reward_clean_eos_fn |
0.5 | Penalises trailing text after } |
reward_episode_plausibility_fn |
0.5 | Blocks zero-cost massive metric claims |
reward_episode_return_fn |
2.0 | 3-step discounted environment trajectory reward |
Removed in v4: reward_compact_fn β empirically constant β0.5 across all v3 steps, std=0 β zero gradient contribution. Removing it was as important as any hyperparameter tweak.
The LifeStack Environment
Trained inside LifeStack, an OpenEnv-compatible environment:
- 8 task domains: career, finances, relationships, physical_health, mental_wellbeing, time, transport_crisis, code_merge_crisis
- 32-edge directed dependency graph: stress propagates across connected metrics with 0.6Γ dampening per hop (Starcke & Brand, 2012)
- 23 sub-metrics with baselines and cascade dampening
- 5 personality profiles (Big Five / OCEAN) β same action scores differently for an anxious introvert vs a confident executive
- Exogenous events: probabilistic mid-episode disruptions
- ResourceBudget: time_hours, money_dollars, energy_units β scarcity degrades action effectiveness (Mullainathan & Shafir, 2013)
from core.lifestack_env import LifeStackEnv, LifeStackAction
env = LifeStackEnv()
obs = env.reset(task=task)
action = LifeStackAction(
action_type="negotiate",
target="career",
metric_changes={"career.workload": -15.0},
resource_cost={"time": 0.5, "money": 0.0, "energy": 5.0},
actions_taken=1
)
obs = env.step(action)
print(obs.reward) # float reward from environment simulation
How to Load
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-1.5B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("jdsb06/lifestack-grpo-v4")
model = PeftModel.from_pretrained(base, "jdsb06/lifestack-grpo-v4")
model.eval()
Or with Unsloth (2Γ faster):
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="jdsb06/lifestack-grpo-v4",
max_seq_length=1024,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
Research Grounding
| Principle | Source | Implementation |
|---|---|---|
| Stress cascade dampening (0.6/hop) | Starcke & Brand (2012) | DependencyGraph.cascade() |
| Scarcity bandwidth tax | Mullainathan & Shafir (2013) | Budget depletion blocks actions |
| Multi-objective reward weighting | Roijers et al. (2013) | 4 non-overlapping reward signals |
| Retrieval-augmented moderation | RAM / RAG | ChromaDB LifeStackMemory |
Model Versions
| Model | Description | Best for |
|---|---|---|
| jdsb06/lifestack-grpo | v1 β 5-stage single-step curriculum | Consistency (45/50 non-failing) |
| jdsb06/lifestack-grpo-v3 | v3 β episodic, EOS-aware | Multi-step baseline |
| jdsb06/lifestack-grpo-v4 | v4 β episodic curriculum, peak 0.856 | Production β active demo |
Citation
@misc{lifestack2026,
title = {LifeStack: Training AI to Handle Life's Cascading Crises},
author = {Team BholeChature, Scaler School of Technology},
year = {2026},
howpublished = {Meta Γ HuggingFace PyTorch OpenEnv Hackathon 2026},
url = {https://github.com/oki-dokii/Meta-R2}
}



