broken-model (fixed)

HuggingFace Repo: https://huggingface.co/suyashdb/broken-model-fixed/tree/main

Changes Made

1. README.mdbase_model corrected

  • Before: meta-llama/Meta-Llama-3.1-8B
  • After: Qwen/Qwen3-8B
  • Why: The model architecture (Qwen3ForCausalLM), tokenizer class (Qwen2Tokenizer), vocabulary size (151936), and all config values exactly match Qwen3-8B, not Llama-3.1-8B. The wrong base_model declaration was misleading but not the functional blocker.

2. tokenizer_config.jsonchat_template added

  • Before: The chat_template field was entirely absent from tokenizer_config.json.
  • After: Added the full Jinja2 chat template from the canonical Qwen/Qwen3-8B model.
  • Why this broke inference: Any OpenAI-compatible inference server (vLLM, TGI, FriendliAI engine) calls tokenizer.apply_chat_template() to convert the messages array in a /chat/completions request into a single prompt string. Without a chat_template, this call raises "No chat template is set for this tokenizer" and the server cannot process any request. The model weights themselves are intact — only the tokenizer configuration was missing this critical field.

The added template handles:

  • System / user / assistant message formatting using <|im_start|> / <|im_end|> tokens
  • Tool call formatting (<tool_call> / <tool_response>)
  • Thinking mode: when enable_thinking=False is passed, the template injects <think>\n\n</think> to suppress chain-of-thought output
  • Multi-turn reasoning content (reasoning_content field on assistant messages)

3. Vocab/tokenizer files added

  • vocab.json, tokenizer.json, and special_tokens_map.json were uploaded from the canonical Qwen/Qwen3-8B model.
  • The original broken repo was missing these, making it impossible to load the tokenizer standalone.

Verification

You can verify the fix without model weights — just the tokenizer:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("suyashdb/broken-model-fixed")

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)
# Expected output:
# <|im_start|>user
# What is 2+2?<|im_end|>
# <|im_start|>assistant

Part B — Why reasoning_effort Does Nothing

If you've tried passing reasoning_effort: "low" or reasoning_effort: "high" in your requests and noticed zero difference in the output — you're not imagining it. Here's why.

The short answer

This model has no idea what reasoning_effort means. It was never trained to respond to it.

The longer answer

reasoning_effort is a parameter from OpenAI's o-series API (o1, o3, o4). The idea is that you can tell the model how hard to think — "low" means give me a quick answer, "high" means really work through it. Those models were specifically trained with a concept called budget-forcing: during training, they were given a token budget and rewarded for getting the right answer within that budget. Over time they learned to actually compress or expand their reasoning based on the hint.

Qwen3-8B was not trained that way. It has two modes — thinking (where it produces a <think>...</think> block before answering) and non-thinking (where it skips that entirely). That's a binary on/off switch, not a dial. When you send reasoning_effort: "medium", the model receives it, doesn't recognize it, and ignores it. The output is identical regardless of what value you pass.

What would need to change to make it work

  1. The model needs to be retrained with budget-forcing. During fine-tuning, you'd prepend a budget token to each prompt (something like <budget>512</budget>) and train the model to produce correct answers within that many tokens. This teaches it to actually reason more efficiently when the budget is tight, rather than just cutting off mid-thought.

  2. The inference server needs to translate reasoning_effort into a concrete token limit and either inject it into the prompt in a format the model understands, or hard-stop the <think> block after N tokens by force-injecting </think>. The second approach is blunt — it truncates reasoning but doesn't make the model reason smarter.

  3. The API layer (whatever sits between the client and the model) needs to map "low" / "medium" / "high" to actual numbers and pass them through correctly. Right now most serving stacks just forward unknown parameters to the model, which silently ignores them.

  4. Realistically, the easiest path is to use a model that already supports this natively — like a Qwen3 variant served through FriendliAI's serverless API which exposes max_thinking_tokens, or OpenAI's o-series which was purpose-built for reasoning_effort. Retrofitting budget-forcing onto an existing model requires retraining, not just a config change.

Downloads last month
5
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
Input a message to start chatting with suyashdb/broken-model-fixed.

Model tree for suyashdb/broken-model-fixed

Finetuned
Qwen/Qwen3-8B
Finetuned
(1667)
this model