Instructions to use anicka/nla-phi4-av-arnative-grpo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use anicka/nla-phi4-av-arnative-grpo with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("microsoft/phi-4") model = PeftModel.from_pretrained(base_model, "anicka/nla-phi4-av-arnative-grpo") - Transformers
How to use anicka/nla-phi4-av-arnative-grpo with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="anicka/nla-phi4-av-arnative-grpo") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("anicka/nla-phi4-av-arnative-grpo", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use anicka/nla-phi4-av-arnative-grpo with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "anicka/nla-phi4-av-arnative-grpo" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "anicka/nla-phi4-av-arnative-grpo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/anicka/nla-phi4-av-arnative-grpo
- SGLang
How to use anicka/nla-phi4-av-arnative-grpo with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "anicka/nla-phi4-av-arnative-grpo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "anicka/nla-phi4-av-arnative-grpo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "anicka/nla-phi4-av-arnative-grpo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "anicka/nla-phi4-av-arnative-grpo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use anicka/nla-phi4-av-arnative-grpo with Docker Model Runner:
docker model run hf.co/anicka/nla-phi4-av-arnative-grpo
NLA Activation Verbalizer β Phi-4 (14B), AR-native GRPO
LoRA adapter that turns a residual-stream activation vector from Phi-4 into a natural-language description of what the model is computing at that layer. Trained with AR-native GRPO (Group Relative Policy Optimization): the reward signal is the Activation Reconstructor's cosine similarity, so the adapter directly optimizes for descriptions that carry geometric information about the activation β not for descriptions that sound good.
This is a refinement of the SL-trained AV. Same architecture, same injection protocol, but the training objective is different: instead of imitating frontier-LLM descriptions (supervised learning), this adapter learns to produce text that a separate AR network can reconstruct the original activation from.
Part of the nla-at-home project.
What changed (SL β GRPO)
The supervised adapter scored 0.474 mean-subtracted cosine on round-trip eval (AV generates description β AR reconstructs β cosine with ground truth). This adapter scores 0.585 β a 23% improvement that closes 77% of the gap to the AR ceiling (0.619).
On 2 of 9 evaluation layers (L13, L22), the GRPO adapter produces descriptions that reconstruct better than the ground-truth descriptions the SL adapter was imitating. The AR-native reward found output patterns that frontier-LLM descriptions never used.
Qualitative difference: the SL adapter produced descriptions with correct style but vague content ("forward-looking sentiment," "narrative setup"). The GRPO adapter names specific tokens, identifies task directives, and catches processing tensions ("'never' tokens vs 'surrender' token"). It trades fluency for discriminative signal.
Usage
Same injection protocol as the SL version:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained("microsoft/phi-4", torch_dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(base, "anicka/nla-phi4-av-arnative-grpo").eval()
tokenizer = AutoTokenizer.from_pretrained("anicka/nla-phi4-av-arnative-grpo")
INJECTION_CHAR = "β
" # token_id 27347
INJECTION_SCALE = 150.0
def make_prompt(depth_pct):
return (
"You are a meticulous AI researcher conducting an important investigation "
"into activation vectors from a language model. Your overall task is to "
"describe the semantic content of that activation vector.\n\n"
"We will pass the vector enclosed in <concept> tags into your context, "
"along with the network depth where it was extracted. "
"You must then produce an explanation for the vector, enclosed within "
"<explanation> tags. The explanation consists of 2-3 text snippets "
"describing that vector.\n\n"
f"Here is the vector from depth {depth_pct}% of the network:\n\n"
f"<concept>{INJECTION_CHAR}</concept>\n\n"
"Please provide an explanation.\n\n"
"<explanation>"
)
# Wrap in chat template before tokenizing
prompt = make_prompt(depth_pct=55)
chat_str = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
tokenize=False, add_generation_prompt=True)
tokens = tokenizer.encode(chat_str, add_special_tokens=False)
# Find injection position and replace with activation
inject_pos = tokens.index(27347) # β
token
input_ids = torch.tensor([tokens], device="cuda")
embeddings = model.get_input_embeddings()(input_ids).clone()
# activation: shape (5120,), from the layer you want to describe
norm = activation.float().norm().clamp_min(1e-12)
normalized = activation * (INJECTION_SCALE / norm)
embeddings[0, inject_pos, :] = normalized.to(embeddings.dtype)
# Generate
output = model.generate(
inputs_embeds=embeddings,
attention_mask=torch.ones_like(input_ids),
max_new_tokens=150, do_sample=False,
pad_token_id=tokenizer.eos_token_id,
return_dict_in_generate=True)
text = tokenizer.decode(output.sequences[0], skip_special_tokens=True)
description = text.split("</explanation>")[0].strip()
Training
- Base: anicka/nla-phi4-universal-av-v2 (SL-pretrained LoRA)
- Method: GRPO with AR-native reward
- Reward: centered cosine similarity between AR-reconstructed and ground-truth activation (mean-subtracted)
- AR: anicka/nla-phi4-universal-ar-v2 (frozen during GRPO)
- Curriculum: 8 epochs, tau decreasing from 0.40 β 0.10 (easy examples first, progressively harder)
- Samples per epoch: 300
- KL penalty: adaptive, final ~1.67
- Hardware: NVIDIA GB10 (DGX Spark), ~17 hours total
- Final metrics: cos=0.567, reward=0.637, spec=185
Curriculum progression
| Epoch | Ο (difficulty) | cos | reward |
|---|---|---|---|
| 1 | 0.40 | 0.391 | 0.433 |
| 2 | 0.36 | 0.551 | 0.604 |
| 3 | 0.31 | 0.539 | 0.593 |
| 4 | 0.27 | 0.559 | 0.622 |
| 5 | 0.23 | 0.564 | 0.630 |
| 6 | 0.19 | 0.570 | 0.633 |
| 7 | 0.15 | 0.569 | 0.636 |
| 8 | 0.10 | 0.567 | 0.637 |
Evaluation
Double-holdout round-trip eval (49 texts unseen by both AV and AR):
| Layer | Round-trip cos (GRPO) | Round-trip cos (SL) | AR ceiling |
|---|---|---|---|
| L13 (32%) | 0.599 | 0.482 | 0.585 |
| L16 (40%) | 0.610 | 0.496 | 0.616 |
| L19 (47%) | 0.632 | 0.486 | 0.647 |
| L22 (55%) | 0.639 | 0.519 | 0.608 |
| L25 (63%) | 0.610 | 0.471 | 0.625 |
| L28 (71%) | 0.601 | 0.536 | 0.660 |
| L32 (80%) | 0.578 | 0.482 | 0.609 |
| L36 (90%) | 0.558 | 0.413 | 0.616 |
| L38 (96%) | 0.437 | 0.378 | 0.604 |
| Mean | 0.585 | 0.474 | 0.619 |
On L13 and L22: GRPO exceeds the GT ceiling β the adapter found description patterns that reconstruct better than the human-written targets.
Companion models
- AR (reconstructor): anicka/nla-phi4-universal-ar-v2
- SL baseline (prior version): anicka/nla-phi4-universal-av-v2
- Demo: anicka/nla-demo (Phi-4 Mini version)
Limitations
- Trained on Phi-4 activations only. Does not transfer to other architectures.
- L38 (96% depth) remains weak β response-strategy representations are harder to verbalize faithfully.
- Descriptions optimize for AR reconstructability, not human readability. Some outputs are terse or oddly structured.
- The AR ceiling (0.619) limits how much further AV improvements can register on this metric. Improving the AR is now the bottleneck.
License
MIT
- Downloads last month
- 24
Model tree for anicka/nla-phi4-av-arnative-grpo
Base model
microsoft/phi-4