Instructions to use barryke/rnj-1-instruct-FP8-DYNAMIC with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use barryke/rnj-1-instruct-FP8-DYNAMIC with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="barryke/rnj-1-instruct-FP8-DYNAMIC")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("barryke/rnj-1-instruct-FP8-DYNAMIC")
model = AutoModelForMultimodalLM.from_pretrained("barryke/rnj-1-instruct-FP8-DYNAMIC")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use barryke/rnj-1-instruct-FP8-DYNAMIC with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "barryke/rnj-1-instruct-FP8-DYNAMIC"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "barryke/rnj-1-instruct-FP8-DYNAMIC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/barryke/rnj-1-instruct-FP8-DYNAMIC

SGLang

How to use barryke/rnj-1-instruct-FP8-DYNAMIC with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "barryke/rnj-1-instruct-FP8-DYNAMIC" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "barryke/rnj-1-instruct-FP8-DYNAMIC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "barryke/rnj-1-instruct-FP8-DYNAMIC" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "barryke/rnj-1-instruct-FP8-DYNAMIC",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use barryke/rnj-1-instruct-FP8-DYNAMIC with Docker Model Runner:
```
docker model run hf.co/barryke/rnj-1-instruct-FP8-DYNAMIC
```

rnj-1-instruct-FP8-DYNAMIC

Model Description

This is an FP8 dynamic quantized version of EssentialAI/rnj-1-instruct, an 8.3B parameter dense language model trained from scratch by Essential AI, optimized for code and STEM tasks with strong agentic and tool-calling capabilities.

Quantization was performed using LLM Compressor v0.11.0 via a post-training one-shot method (no calibration data required). The checkpoint is saved in the compressed-tensors format, natively supported by vLLM and transformers.

Quantization Details

Property	Value
Base model	`EssentialAI/rnj-1-instruct`
Quantization method	`compressed-tensors` (via LLM Compressor `oneshot`)
Scheme	`FP8_DYNAMIC`
Weight quantization	FP8 (float-quantized), per-channel, symmetric
Activation quantization	FP8 (float-quantized), per-token, dynamic
Targets	All `Linear` layers
Ignored layers	`lm_head` (kept in original precision)
LLM Compressor version	0.11.0
compressed-tensors version	0.16.0
Calibration data	None required (dynamic activations)
Shard count	3
Total size on disk	~5.6 GB (down from ~11.2 GB original, ~50% reduction)

Quantization Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: FP8_DYNAMIC
      bypass_divisibility_checks: false

Quantization Code

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EssentialAI/rnj-1-instruct", dtype="auto")
tokenizer = AutoTokenizer.from_pretrained("EssentialAI/rnj-1-instruct")

recipe = QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
oneshot(model=model, recipe=recipe)

model.save_pretrained("./rnj-1-instruct-FP8-DYNAMIC")
tokenizer.save_pretrained("./rnj-1-instruct-FP8-DYNAMIC")

Why FP8 DYNAMIC?

No calibration data needed — dynamic activation quantization computes scales at runtime per-token, so no representative dataset is required during quantization.
Near-lossless accuracy — FP8 preserves the full dynamic range of the original model with minimal degradation.
~50% size reduction — FP8 weights halve the storage and memory footprint vs. the original BF16/FP32 model.
Hardware acceleration — natively supported on NVIDIA Hopper (H100), Ada Lovelace (L40S / RTX 4090), and Blackwell GPUs.

Model Architecture

Based on the Gemma 3 text-only architecture with full global attention and YaRN RoPE scaling for long-context extrapolation.

Hyperparameter	Value
Architecture	`Gemma3ForCausalLM`
Model type	`gemma3_text`
Total parameters	8,837,345,280 (~8.8B)
Layers	32
Hidden size	4096
MLP intermediate size	16384
Attention heads	32
KV heads (GQA)	8
Head dimension	128
Vocabulary size	128,256
Max position embeddings	32,768 (32K)
Sliding window	32,768
Activation	GeGLU (`gelu_pytorch_tanh`)
RoPE theta	10,000
RoPE scaling	YaRN (factor=4.0, original_max_position_embeddings=8192)
Final logit softcapping	30.0
RMS norm epsilon	1e-6
Tied embeddings	Yes (`lm_head.weight` = `model.embed_tokens.weight`)

Long-Context Extrapolation (up to 128K)

Like the original model, this quantized checkpoint supports extrapolation to 128K context via YaRN RoPE scaling. Update config.json:

-  "max_position_embeddings": 32768,
+  "max_position_embeddings": 131072,
-  "sliding_window": 32768,
+  "sliding_window": 131072,
   "rope_scaling": {
-    "factor": 4.0,
+    "factor": 16.0,
     ...
   }

Capabilities

This quantized model preserves the capabilities of the original rnj-1-instruct:

Code generation — strong on HumanEval+, MBPP+, BigCodeBench, LiveCodeBench v6, and multi-language generation (MultiPL-E).
Agentic coding — 20.8% on SWE-bench Verified (bash-only), competitive with much larger models.
Tool calling — structured tool use via Hermes-compatible <tool_call>/</tool_call> tags with vllm serve --enable-auto-tool-choice --tool-call-parser hermes.
Math and science — strong on GSM8k, Minerva-MATH-500, AIME '24/'25, GPQA-Diamond, and SuperGPQA.
Code infilling (FIM) — supports fill-in-the-middle with <|pre_fim|>, <|suf_fim|>, <|mid_fim|> tokens.

How to Use

vLLM (recommended for production)

pip install vllm
vllm serve barryke/rnj-1-instruct-FP8-DYNAMIC

With tool-calling support:

vllm serve barryke/rnj-1-instruct-FP8-DYNAMIC \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "barryke/rnj-1-instruct-FP8-DYNAMIC"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

output_ids = model.generate(
    input_ids,
    max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.2,
)

response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

SGLang

pip install sglang
python3 -m sglang.launch_server \
  --model-path barryke/rnj-1-instruct-FP8-DYNAMIC \
  --host 0.0.0.0 \
  --port 30000

Recommendations

Always use a system prompt — e.g., "You are a helpful assistant.". Omitting it can cause truncated outputs or unprompted code generation.
Use temperature in [0, 0.2] — higher temperatures may degrade coherence.
Hardware requirement — FP8 inference requires an NVIDIA GPU with compute capability >= 8.9 (Ada Lovelace / Hopper / Blackwell). For other GPUs, use the original BF16 model or an INT4 quantization variant.

Known Limitations

Hallucinations — the base model is primarily a coding/STEM model and is not optimized for factual recovery.
Identity confusion — may occasionally misidentify itself as another model provider.
No knowledge cutoff — the model was not trained with a specific knowledge cutoff date and may hallucinate dates when asked.

License

This model inherits the Apache License 2.0 from the base model.

Citation

@misc{rnj1_instruct,
  title  = {{Rnj-1-Instruct}},
  author = {Ashish Vaswani and Mike Callahan and Adarsh Chaluvaraju and Aleksa Gordic and Devaansh Gupta and Yash Jain and Divya Mansingka and Philip Monk and Khoi Nguyen and Mohit Parmar and Michael Pust and Tim Romanski and Peter Rushton and Ali Shehper and Divya Shivaprasad and Somanshu Singla and Kurt Smith and Saurabh Srivastava and Anil Thomas and Alok Tripathy and Yash Vanjani and Ameya Velingker and {{Essential AI}}},
  year   = {2025},
  url    = {https://huggingface.co/EssentialAI/rnj-1-instruct},
  note   = {Instruction-tuned model release}
}

Downloads last month: 30

Safetensors

Model size

9B params

Tensor type

F32

F8_E4M3

Model tree for barryke/rnj-1-instruct-FP8-DYNAMIC

Base model

EssentialAI/rnj-1

Finetuned

EssentialAI/rnj-1-instruct

Quantized

(26)

this model