Farabi-1.7B Agent-RAG

A 1.7B-parameter, Qwen3-architecture instruction model for retrieval-augmented generation (RAG) and agentic tool use in Kazakh, Russian, and English. It is OpenAI-API compatible and emits Hermes-style tool calls, so it drops directly into vLLM and the OpenAI Agents SDK.

Capabilities

  • Multilingual (kk / ru / en). Understands and answers in Kazakh, Russian, and English, including mixed-language prompts.
  • Grounded RAG. Answers from provided passages/documents, ties claims to the supplied evidence, and abstains when the context is insufficient instead of hallucinating.
  • Agentic tool calling (Hermes / function calling). Decides whether a tool is needed, asks for missing required arguments, confirms before destructive or mutating actions, emits a valid tool call, and grounds the final answer in the tool result.
  • Multi-step tool chaining & error recovery. Sequences dependent calls without answering prematurely, and recovers gracefully from not_found / denied / empty results.
  • Numeric & rule reasoning. Table/fee arithmetic, deadline/eligibility/business-day rules, and structured-output / slot-completion tasks.
  • Clean, no-think outputs. Trainable targets are final answers and tool calls (no exposed chain-of-thought), so responses are production-ready.

How to use

Serve with vLLM (OpenAI-compatible, Hermes tool calls)

vllm serve nur-dev/farabi-1.7b-agent-rag \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --max-model-len 8192

The chat template (chat_template.jinja) ships with the model. If your vLLM version does not auto-apply it, add --chat-template chat_template.jinja.

Chat (OpenAI Python SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

resp = client.chat.completions.create(
    model="nur-dev/farabi-1.7b-agent-rag",
    messages=[
        {"role": "user", "content": "Алматыдағы ауа райы қандай болады ертең?"},
    ],
)
print(resp.choices[0].message.content)

Tool calling (function calling)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="nur-dev/farabi-1.7b-agent-rag",
    messages=[{"role": "user", "content": "What's the weather in Astana?"}],
    tools=tools,
)
msg = resp.choices[0].message
# msg.tool_calls -> [{function: {name: "get_weather", arguments: '{"city": "Astana"}'}}]
# Run the tool, append the tool result as a {"role": "tool", ...} message,
# then call the API again to get the grounded final answer.

RAG (answer from provided context)

context = """[1] The library is open 09:00–18:00 on weekdays.
[2] On Saturdays it closes at 14:00. It is closed on Sundays."""

resp = client.chat.completions.create(
    model="nur-dev/farabi-1.7b-agent-rag",
    messages=[
        {"role": "system", "content": "Answer only from the provided context. "
                                       "If the context is insufficient, say so."},
        {"role": "user", "content": f"{context}\n\nWhen does the library close on Saturday?"},
    ],
)
print(resp.choices[0].message.content)

Inference notes

  • Architecture: Qwen3-compatible causal LM (1.7B), bfloat16.
  • Context length: 8192 tokens.
  • Tool-call format: Hermes (--tool-call-parser hermes).
  • Works with the OpenAI Agents SDK via base_url + any placeholder api_key.
Downloads last month
-
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support