Instructions to use nur-dev/farabi-1.7b-agent-rag with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nur-dev/farabi-1.7b-agent-rag with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nur-dev/farabi-1.7b-agent-rag")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("nur-dev/farabi-1.7b-agent-rag")
model = AutoModelForMultimodalLM.from_pretrained("nur-dev/farabi-1.7b-agent-rag")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

HERMES

How to use nur-dev/farabi-1.7b-agent-rag with HERMES:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nur-dev/farabi-1.7b-agent-rag with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nur-dev/farabi-1.7b-agent-rag"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-1.7b-agent-rag",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nur-dev/farabi-1.7b-agent-rag

SGLang

How to use nur-dev/farabi-1.7b-agent-rag with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nur-dev/farabi-1.7b-agent-rag" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-1.7b-agent-rag",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nur-dev/farabi-1.7b-agent-rag" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-1.7b-agent-rag",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nur-dev/farabi-1.7b-agent-rag with Docker Model Runner:
```
docker model run hf.co/nur-dev/farabi-1.7b-agent-rag
```

Farabi-1.7B Agent-RAG

A 1.7B-parameter, Qwen3-architecture instruction model for retrieval-augmented generation (RAG) and agentic tool use in Kazakh, Russian, and English. It is OpenAI-API compatible and emits Hermes-style tool calls, so it drops directly into vLLM and the OpenAI Agents SDK.

Capabilities

Multilingual (kk / ru / en). Understands and answers in Kazakh, Russian, and English, including mixed-language prompts.
Grounded RAG. Answers from provided passages/documents, ties claims to the supplied evidence, and abstains when the context is insufficient instead of hallucinating.
Agentic tool calling (Hermes / function calling). Decides whether a tool is needed, asks for missing required arguments, confirms before destructive or mutating actions, emits a valid tool call, and grounds the final answer in the tool result.
Multi-step tool chaining & error recovery. Sequences dependent calls without answering prematurely, and recovers gracefully from not_found / denied / empty results.
Numeric & rule reasoning. Table/fee arithmetic, deadline/eligibility/business-day rules, and structured-output / slot-completion tasks.
Clean, no-think outputs. Trainable targets are final answers and tool calls (no exposed chain-of-thought), so responses are production-ready.

How to use

Serve with vLLM (OpenAI-compatible, Hermes tool calls)

vllm serve nur-dev/farabi-1.7b-agent-rag \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 8192

The chat template (chat_template.jinja) ships with the model. If your vLLM version does not auto-apply it, add --chat-template chat_template.jinja.

Chat (OpenAI Python SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

resp = client.chat.completions.create(
    model="nur-dev/farabi-1.7b-agent-rag",
    messages=[
        {"role": "user", "content": "Алматыдағы ауа райы қандай болады ертең?"},
    ],
)
print(resp.choices[0].message.content)

Tool calling (function calling)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="nur-dev/farabi-1.7b-agent-rag",
    messages=[{"role": "user", "content": "What's the weather in Astana?"}],
    tools=tools,
)
msg = resp.choices[0].message
# msg.tool_calls -> [{function: {name: "get_weather", arguments: '{"city": "Astana"}'}}]
# Run the tool, append the tool result as a {"role": "tool", ...} message,
# then call the API again to get the grounded final answer.

RAG (answer from provided context)

context = """[1] The library is open 09:00–18:00 on weekdays.
[2] On Saturdays it closes at 14:00. It is closed on Sundays."""

resp = client.chat.completions.create(
    model="nur-dev/farabi-1.7b-agent-rag",
    messages=[
        {"role": "system", "content": "Answer only from the provided context. "
                                       "If the context is insufficient, say so."},
        {"role": "user", "content": f"{context}\n\nWhen does the library close on Saturday?"},
    ],
)
print(resp.choices[0].message.content)

Inference notes

Architecture: Qwen3-compatible causal LM (1.7B), bfloat16.
Context length: 8192 tokens.
Tool-call format: Hermes (--tool-call-parser hermes).
Works with the OpenAI Agents SDK via base_url + any placeholder api_key.

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

BF16