OkeyMeta commited on
Commit
5348cd5
·
verified ·
1 Parent(s): 52da7b7

Add-openai-compatible-runtime-docs

Browse files
README.md CHANGED
@@ -100,6 +100,14 @@ Then send one JSON object per line:
100
  {"prompt":"Who won the most recent mayoral runoff in Rivergate?","tool_results":[{"name":"web.search","ok":true,"source":{"title":"Local Civic Wire","url":"https://example.org/rivergate-runoff","snippet":"Mara Ibekwe won the Rivergate mayoral runoff with 52.4 percent of the vote."}}],"max_tokens":80}
101
  ```
102
 
 
 
 
 
 
 
 
 
103
  ## OpenAI-Style Tool Format
104
 
105
  Reframr v2 can consume OpenAI-style `messages` and tool results through the included `compose_generation_context` helper. The model does not browse by itself from static weights; your app provides tool outputs, and Reframr writes the final answer from that evidence.
 
100
  {"prompt":"Who won the most recent mayoral runoff in Rivergate?","tool_results":[{"name":"web.search","ok":true,"source":{"title":"Local Civic Wire","url":"https://example.org/rivergate-runoff","snippet":"Mara Ibekwe won the Rivergate mayoral runoff with 52.4 percent of the vote."}}],"max_tokens":80}
101
  ```
102
 
103
+ For OpenAI-style chat completion JSON:
104
+
105
+ ```bash
106
+ python -m reframr chat-completion --model model.safetensors < request.json
107
+ ```
108
+
109
+ Set `"stream": true` in the request to receive SSE-style `data: ...` chunks ending with `data: [DONE]`. See `docs/openai_compat.md` for chat, streaming, and host-side tool-loop examples.
110
+
111
  ## OpenAI-Style Tool Format
112
 
113
  Reframr v2 can consume OpenAI-style `messages` and tool results through the included `compose_generation_context` helper. The model does not browse by itself from static weights; your app provides tool outputs, and Reframr writes the final answer from that evidence.
docs/openai_compat.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reframr OpenAI-Compatible Runtime
2
+
3
+ Reframr v3 runtime work includes an OpenAI-style adapter so apps can plug Reframr into existing chat, support, and tool orchestration systems without writing custom prompt glue.
4
+
5
+ ## Chat Completion
6
+
7
+ ```python
8
+ from pathlib import Path
9
+
10
+ from reframr import ReframrModel, build_chat_completion_response
11
+
12
+ model = ReframrModel.load(Path("model.safetensors"))
13
+
14
+ response = build_chat_completion_response(
15
+ model,
16
+ {
17
+ "model": "reframr-v3",
18
+ "messages": [
19
+ {"role": "system", "content": "Be concise and cite sources when tool results are provided."},
20
+ {"role": "user", "content": "Summarize this customer support issue."},
21
+ ],
22
+ "max_tokens": 160,
23
+ "temperature": 0.58,
24
+ },
25
+ )
26
+
27
+ print(response["choices"][0]["message"]["content"])
28
+ ```
29
+
30
+ ## Streaming
31
+
32
+ ```python
33
+ from reframr.openai_compat import iter_sse_chat_completion
34
+
35
+ for event in iter_sse_chat_completion(model, request):
36
+ send_to_browser(event)
37
+ ```
38
+
39
+ The stream emits OpenAI-style `chat.completion.chunk` SSE events and ends with:
40
+
41
+ ```text
42
+ data: [DONE]
43
+ ```
44
+
45
+ ## Tool Loop
46
+
47
+ Register real tools in the host application. Reframr can request a tool with `<tool_call>`, the host executes the function, and the result is fed back as `<tool_result>` / `<source>` evidence.
48
+
49
+ ```python
50
+ from reframr.openai_compat import run_tool_loop
51
+
52
+ def web_search(arguments: dict[str, object]) -> dict[str, object]:
53
+ query = str(arguments["query"])
54
+ result = your_search_client.search(query)
55
+ return {
56
+ "ok": True,
57
+ "source": {
58
+ "title": result.title,
59
+ "url": result.url,
60
+ "snippet": result.snippet,
61
+ },
62
+ }
63
+
64
+ response = run_tool_loop(
65
+ model,
66
+ {
67
+ "model": "reframr-v3",
68
+ "messages": [
69
+ {"role": "user", "content": "What changed in the latest official release notes?"}
70
+ ],
71
+ },
72
+ tools={"web.search": web_search},
73
+ max_rounds=3,
74
+ )
75
+ ```
76
+
77
+ If a tool is missing or fails, the adapter sends the failure back as a tool result instead of crashing. That lets Reframr answer honestly, retry with a different tool if the model requests one, or ask the user for source evidence.
78
+
79
+ ## CLI
80
+
81
+ ```bash
82
+ python -m reframr chat-completion --model model.safetensors < request.json
83
+ ```
84
+
85
+ For SSE output:
86
+
87
+ ```json
88
+ {
89
+ "model": "reframr-v3",
90
+ "stream": true,
91
+ "messages": [
92
+ {"role": "user", "content": "Write a short support reply."}
93
+ ]
94
+ }
95
+ ```
96
+
97
+ ## Deployment Notes
98
+
99
+ - Keep real tools outside the model runtime and pass their outputs back as data.
100
+ - Treat source quality as part of the product: validate URLs, timestamps, permissions, and user access.
101
+ - Do not let the model fabricate tool results. If no tool result exists for a fresh fact, the app should ask for retrieval or return an uncertainty-aware answer.
102
+ - Use `session_id` with `python -m reframr serve` when you want conversation memory in the JSONL server.
reframr/__init__.py CHANGED
@@ -13,6 +13,7 @@ from .config import ReframrConfig
13
  from .embeddings import EmbeddingModel, fit_ppmi_embedding
14
  from .hippo import AnalyticalMemoryUnit, hippo_legs_matrix
15
  from .model import ReframrModel
 
16
  from .reasoning import REASONING_CONTROL_TOKENS, REASONING_PROFILES, TOKENIZER_NAME
17
  from .tokenizer import NativeTokenizer
18
 
@@ -25,8 +26,10 @@ __all__ = [
25
  "ReframrConfig",
26
  "ReframrModel",
27
  "TOKENIZER_NAME",
 
28
  "fit_ppmi_embedding",
29
  "hippo_legs_matrix",
30
  "inspect_checkpoint",
31
  "read_safetensor_file",
 
32
  ]
 
13
  from .embeddings import EmbeddingModel, fit_ppmi_embedding
14
  from .hippo import AnalyticalMemoryUnit, hippo_legs_matrix
15
  from .model import ReframrModel
16
+ from .openai_compat import build_chat_completion_response, run_tool_loop
17
  from .reasoning import REASONING_CONTROL_TOKENS, REASONING_PROFILES, TOKENIZER_NAME
18
  from .tokenizer import NativeTokenizer
19
 
 
26
  "ReframrConfig",
27
  "ReframrModel",
28
  "TOKENIZER_NAME",
29
+ "build_chat_completion_response",
30
  "fit_ppmi_embedding",
31
  "hippo_legs_matrix",
32
  "inspect_checkpoint",
33
  "read_safetensor_file",
34
+ "run_tool_loop",
35
  ]
reframr/cli.py CHANGED
@@ -261,6 +261,17 @@ def build_parser() -> argparse.ArgumentParser:
261
  help="Override the checkpoint's default reasoning-control profile.",
262
  )
263
 
 
 
 
 
 
 
 
 
 
 
 
264
  trace = subparsers.add_parser("trace", help="Trace REFRAMR reasoning components through generation steps.")
265
  trace.add_argument("--model", required=True, help="Path to a serialized REFRAMR model.")
266
  trace.add_argument("--context", required=True, help="Prompt or starting context text.")
@@ -1204,6 +1215,29 @@ def command_serve(args: argparse.Namespace) -> int:
1204
  return 0
1205
 
1206
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1207
  def command_trace(args: argparse.Namespace) -> int:
1208
  model = ReframrModel.load(args.model)
1209
  payload = model.trace_generation(
@@ -1452,6 +1486,8 @@ def main(argv: list[str] | None = None) -> int:
1452
  return command_generate_batch(args)
1453
  if args.command == "serve":
1454
  return command_serve(args)
 
 
1455
  if args.command == "trace":
1456
  return command_trace(args)
1457
  if args.command == "inspect":
 
261
  help="Override the checkpoint's default reasoning-control profile.",
262
  )
263
 
264
+ chat_completion = subparsers.add_parser(
265
+ "chat-completion",
266
+ help="Run one OpenAI-compatible chat completion request from stdin or a JSON file.",
267
+ )
268
+ chat_completion.add_argument("--model", required=True, help="Path to a serialized REFRAMR model.")
269
+ chat_completion.add_argument(
270
+ "--request",
271
+ default="",
272
+ help="Optional path to a JSON request. Defaults to stdin.",
273
+ )
274
+
275
  trace = subparsers.add_parser("trace", help="Trace REFRAMR reasoning components through generation steps.")
276
  trace.add_argument("--model", required=True, help="Path to a serialized REFRAMR model.")
277
  trace.add_argument("--context", required=True, help="Prompt or starting context text.")
 
1215
  return 0
1216
 
1217
 
1218
+ def command_chat_completion(args: argparse.Namespace) -> int:
1219
+ from .openai_compat import build_chat_completion_response, iter_sse_chat_completion
1220
+
1221
+ request_path = str(getattr(args, "request", "")).strip()
1222
+ if request_path:
1223
+ request_text = Path(request_path).read_text(encoding="utf-8")
1224
+ else:
1225
+ request_text = sys.stdin.read()
1226
+ request = json.loads(request_text)
1227
+ if not isinstance(request, dict):
1228
+ raise ValueError("chat-completion request must be a JSON object")
1229
+ model = ReframrModel.load(args.model)
1230
+ if bool(request.get("stream", False)):
1231
+ for event in iter_sse_chat_completion(model, request):
1232
+ sys.stdout.write(event)
1233
+ sys.stdout.flush()
1234
+ return 0
1235
+ response = build_chat_completion_response(model, request)
1236
+ sys.stdout.write(json.dumps(response, ensure_ascii=False, separators=(",", ":")) + "\n")
1237
+ sys.stdout.flush()
1238
+ return 0
1239
+
1240
+
1241
  def command_trace(args: argparse.Namespace) -> int:
1242
  model = ReframrModel.load(args.model)
1243
  payload = model.trace_generation(
 
1486
  return command_generate_batch(args)
1487
  if args.command == "serve":
1488
  return command_serve(args)
1489
+ if args.command == "chat-completion":
1490
+ return command_chat_completion(args)
1491
  if args.command == "trace":
1492
  return command_trace(args)
1493
  if args.command == "inspect":
reframr/openai_compat.py ADDED
@@ -0,0 +1,253 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import time
5
+ import uuid
6
+ from typing import Any, Callable
7
+
8
+ from .cli import compose_generation_context
9
+
10
+
11
+ def build_chat_completion_response(model: Any, request: dict[str, Any]) -> dict[str, Any]:
12
+ """Run a Reframr model behind an OpenAI-style chat-completions shape."""
13
+
14
+ model_name = str(request.get("model", "reframr"))
15
+ context = compose_generation_context(
16
+ str(request.get("prompt", "")),
17
+ system=str(request.get("system", "")),
18
+ messages=request.get("messages"),
19
+ tool_results=request.get("tool_results", request.get("toolResults")),
20
+ )
21
+ generated_text = str(
22
+ model.generate_text(
23
+ context,
24
+ max_tokens=int(request.get("max_tokens", request.get("max_completion_tokens", 120))),
25
+ reasoning_mode=request.get("reasoning_mode", request.get("reasoningMode")),
26
+ temperature=float(request.get("temperature", 0.58)),
27
+ top_k=int(request.get("top_k", request.get("decode_top_k", 64))),
28
+ top_p=float(request.get("top_p", request.get("decode_top_p", 0.92))),
29
+ repetition_penalty=float(request.get("repetition_penalty", 1.25)),
30
+ )
31
+ ).strip()
32
+ tool_call = parse_tool_call(generated_text)
33
+ if tool_call is None:
34
+ message = {"role": "assistant", "content": generated_text}
35
+ finish_reason = "stop"
36
+ else:
37
+ message = {"role": "assistant", "content": "", "tool_calls": [tool_call]}
38
+ finish_reason = "tool_calls"
39
+ prompt_tokens = _approx_token_count(context)
40
+ completion_tokens = _approx_token_count(generated_text)
41
+ return {
42
+ "id": f"chatcmpl-{uuid.uuid4().hex}",
43
+ "object": "chat.completion",
44
+ "created": int(time.time()),
45
+ "model": model_name,
46
+ "choices": [
47
+ {
48
+ "index": 0,
49
+ "message": message,
50
+ "finish_reason": finish_reason,
51
+ }
52
+ ],
53
+ "usage": {
54
+ "prompt_tokens": prompt_tokens,
55
+ "completion_tokens": completion_tokens,
56
+ "total_tokens": prompt_tokens + completion_tokens,
57
+ },
58
+ }
59
+
60
+
61
+ def iter_chat_completion_chunks(
62
+ model: Any,
63
+ request: dict[str, Any],
64
+ *,
65
+ chunk_size: int = 12,
66
+ ) -> Any:
67
+ """Yield OpenAI-style streaming chunk dictionaries for a Reframr response."""
68
+
69
+ full_response = build_chat_completion_response(model, request)
70
+ model_name = str(full_response["model"])
71
+ response_id = str(full_response["id"])
72
+ created = int(full_response["created"])
73
+ choice = full_response["choices"][0]
74
+ message = choice["message"]
75
+ yield _stream_chunk(
76
+ response_id,
77
+ model_name,
78
+ created,
79
+ {"role": "assistant"},
80
+ finish_reason=None,
81
+ )
82
+ tool_calls = message.get("tool_calls") if isinstance(message, dict) else None
83
+ if isinstance(tool_calls, list) and tool_calls:
84
+ yield _stream_chunk(
85
+ response_id,
86
+ model_name,
87
+ created,
88
+ {"tool_calls": tool_calls},
89
+ finish_reason=None,
90
+ )
91
+ else:
92
+ content = str(message.get("content", "")) if isinstance(message, dict) else ""
93
+ for part in _split_stream_content(content, chunk_size=max(1, int(chunk_size))):
94
+ yield _stream_chunk(
95
+ response_id,
96
+ model_name,
97
+ created,
98
+ {"content": part},
99
+ finish_reason=None,
100
+ )
101
+ yield _stream_chunk(
102
+ response_id,
103
+ model_name,
104
+ created,
105
+ {},
106
+ finish_reason=str(choice.get("finish_reason", "stop")),
107
+ )
108
+
109
+
110
+ def iter_sse_chat_completion(
111
+ model: Any,
112
+ request: dict[str, Any],
113
+ *,
114
+ chunk_size: int = 12,
115
+ ) -> Any:
116
+ for chunk in iter_chat_completion_chunks(model, request, chunk_size=chunk_size):
117
+ yield f"data: {json.dumps(chunk, ensure_ascii=False, separators=(',', ':'))}\n\n"
118
+ yield "data: [DONE]\n\n"
119
+
120
+
121
+ def run_tool_loop(
122
+ model: Any,
123
+ request: dict[str, Any],
124
+ *,
125
+ tools: dict[str, Callable[[dict[str, Any]], Any]],
126
+ max_rounds: int = 3,
127
+ ) -> dict[str, Any]:
128
+ """Run chat completions, executing registered tools when the model asks."""
129
+
130
+ messages = [dict(message) for message in request.get("messages", []) if isinstance(message, dict)]
131
+ current_request = dict(request)
132
+ last_response: dict[str, Any] | None = None
133
+ for _ in range(max(1, int(max_rounds))):
134
+ current_request["messages"] = messages
135
+ last_response = build_chat_completion_response(model, current_request)
136
+ choice = last_response["choices"][0]
137
+ message = choice["message"]
138
+ if choice.get("finish_reason") != "tool_calls":
139
+ return last_response
140
+ tool_calls = message.get("tool_calls", [])
141
+ if not isinstance(tool_calls, list) or not tool_calls:
142
+ return last_response
143
+ messages.append({"role": "assistant", "content": "", "tool_calls": tool_calls})
144
+ for tool_call in tool_calls:
145
+ tool_result = _execute_tool_call(tool_call, tools)
146
+ function_payload = tool_call.get("function", {}) if isinstance(tool_call, dict) else {}
147
+ tool_name = str(function_payload.get("name", "tool"))
148
+ messages.append(
149
+ {
150
+ "role": "tool",
151
+ "tool_call_id": str(tool_call.get("id", "")) if isinstance(tool_call, dict) else "",
152
+ "name": tool_name,
153
+ "content": json.dumps(tool_result, ensure_ascii=False, separators=(",", ":")),
154
+ }
155
+ )
156
+ return last_response if last_response is not None else build_chat_completion_response(model, request)
157
+
158
+
159
+ def parse_tool_call(text: str) -> dict[str, Any] | None:
160
+ stripped = text.strip()
161
+ marker = "<tool_call>"
162
+ if not stripped.startswith(marker):
163
+ return None
164
+ payload = stripped[len(marker) :].strip()
165
+ if not payload:
166
+ return _tool_call_payload("tool", {})
167
+ name, _, raw_arguments = payload.partition(" ")
168
+ name = name.strip() or "tool"
169
+ arguments = _normalize_tool_arguments(raw_arguments.strip())
170
+ return _tool_call_payload(name, arguments)
171
+
172
+
173
+ def _execute_tool_call(
174
+ tool_call: Any,
175
+ tools: dict[str, Callable[[dict[str, Any]], Any]],
176
+ ) -> dict[str, Any]:
177
+ if not isinstance(tool_call, dict):
178
+ return {"ok": False, "error": "tool_call must be an object"}
179
+ function_payload = tool_call.get("function", {})
180
+ function = function_payload if isinstance(function_payload, dict) else {}
181
+ tool_name = str(function.get("name", ""))
182
+ arguments = _normalize_tool_arguments(str(function.get("arguments", "")))
183
+ tool = tools.get(tool_name)
184
+ if tool is None:
185
+ return {"ok": False, "error": f"tool not registered: {tool_name}"}
186
+ try:
187
+ result = tool(arguments)
188
+ except Exception as exc: # pragma: no cover - defensive surface for app tools.
189
+ return {"ok": False, "error": str(exc)}
190
+ if isinstance(result, dict):
191
+ return result
192
+ return {"ok": True, "content": result}
193
+
194
+
195
+ def _tool_call_payload(name: str, arguments: dict[str, Any]) -> dict[str, Any]:
196
+ return {
197
+ "id": f"call_{uuid.uuid4().hex[:12]}",
198
+ "type": "function",
199
+ "function": {
200
+ "name": name,
201
+ "arguments": json.dumps(arguments, ensure_ascii=False, separators=(",", ":")),
202
+ },
203
+ }
204
+
205
+
206
+ def _stream_chunk(
207
+ response_id: str,
208
+ model_name: str,
209
+ created: int,
210
+ delta: dict[str, Any],
211
+ *,
212
+ finish_reason: str | None,
213
+ ) -> dict[str, Any]:
214
+ return {
215
+ "id": response_id,
216
+ "object": "chat.completion.chunk",
217
+ "created": created,
218
+ "model": model_name,
219
+ "choices": [
220
+ {
221
+ "index": 0,
222
+ "delta": delta,
223
+ "finish_reason": finish_reason,
224
+ }
225
+ ],
226
+ }
227
+
228
+
229
+ def _split_stream_content(content: str, *, chunk_size: int) -> list[str]:
230
+ if not content:
231
+ return []
232
+ chunks: list[str] = []
233
+ start = 0
234
+ while start < len(content):
235
+ chunks.append(content[start : start + chunk_size])
236
+ start += chunk_size
237
+ return chunks
238
+
239
+
240
+ def _normalize_tool_arguments(raw_arguments: str) -> dict[str, Any]:
241
+ if not raw_arguments:
242
+ return {}
243
+ try:
244
+ parsed = json.loads(raw_arguments)
245
+ except json.JSONDecodeError:
246
+ return {"input": raw_arguments}
247
+ if isinstance(parsed, dict):
248
+ return parsed
249
+ return {"input": parsed}
250
+
251
+
252
+ def _approx_token_count(text: str) -> int:
253
+ return len([part for part in text.split() if part])