Llama-3.1-8B-Viveka

LoRA adapter trained on the Viveka OpenEnv with TRL GRPO + Unsloth 4-bit QLoRA. Six-component deterministic reward over mocked Indian DPI services (UPI, DigiLocker, IRCTC, Banking, Telecom). 200 episodes, tier mix 1:0.4 / 2:0.4 / 4:0.2.

Base model: meta-llama/Llama-3.1-8B-Instruct

Notes: Cross-family scale test. Eval uses Unsloth's open 4-bit mirror because meta-llama is gated; the LoRA was trained against the canonical weights (via Unsloth's auto-redirect).