Instructions to use suyashdb/broken-model-fixed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use suyashdb/broken-model-fixed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="suyashdb/broken-model-fixed") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("suyashdb/broken-model-fixed") model = AutoModelForCausalLM.from_pretrained("suyashdb/broken-model-fixed") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use suyashdb/broken-model-fixed with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "suyashdb/broken-model-fixed" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "suyashdb/broken-model-fixed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/suyashdb/broken-model-fixed
- SGLang
How to use suyashdb/broken-model-fixed with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "suyashdb/broken-model-fixed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "suyashdb/broken-model-fixed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "suyashdb/broken-model-fixed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "suyashdb/broken-model-fixed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use suyashdb/broken-model-fixed with Docker Model Runner:
docker model run hf.co/suyashdb/broken-model-fixed
broken-model (fixed)
HuggingFace Repo: https://huggingface.co/suyashdb/broken-model-fixed/tree/main
Changes Made
1. README.md — base_model corrected
- Before:
meta-llama/Meta-Llama-3.1-8B - After:
Qwen/Qwen3-8B - Why: The model architecture (
Qwen3ForCausalLM), tokenizer class (Qwen2Tokenizer), vocabulary size (151936), and all config values exactly match Qwen3-8B, not Llama-3.1-8B. The wrong base_model declaration was misleading but not the functional blocker.
2. tokenizer_config.json — chat_template added
- Before: The
chat_templatefield was entirely absent fromtokenizer_config.json. - After: Added the full Jinja2 chat template from the canonical
Qwen/Qwen3-8Bmodel. - Why this broke inference: Any OpenAI-compatible inference server (vLLM, TGI, FriendliAI engine) calls
tokenizer.apply_chat_template()to convert themessagesarray in a/chat/completionsrequest into a single prompt string. Without achat_template, this call raises"No chat template is set for this tokenizer"and the server cannot process any request. The model weights themselves are intact — only the tokenizer configuration was missing this critical field.
The added template handles:
- System / user / assistant message formatting using
<|im_start|>/<|im_end|>tokens - Tool call formatting (
<tool_call>/<tool_response>) - Thinking mode: when
enable_thinking=Falseis passed, the template injects<think>\n\n</think>to suppress chain-of-thought output - Multi-turn reasoning content (
reasoning_contentfield on assistant messages)
3. Vocab/tokenizer files added
vocab.json,tokenizer.json, andspecial_tokens_map.jsonwere uploaded from the canonicalQwen/Qwen3-8Bmodel.- The original broken repo was missing these, making it impossible to load the tokenizer standalone.
Verification
You can verify the fix without model weights — just the tokenizer:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("suyashdb/broken-model-fixed")
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)
# Expected output:
# <|im_start|>user
# What is 2+2?<|im_end|>
# <|im_start|>assistant
Part B — Why reasoning_effort Does Nothing
If you've tried passing reasoning_effort: "low" or reasoning_effort: "high" in your requests and noticed zero difference in the output — you're not imagining it. Here's why.
The short answer
This model has no idea what reasoning_effort means. It was never trained to respond to it.
The longer answer
reasoning_effort is a parameter from OpenAI's o-series API (o1, o3, o4). The idea is that you can tell the model how hard to think — "low" means give me a quick answer, "high" means really work through it. Those models were specifically trained with a concept called budget-forcing: during training, they were given a token budget and rewarded for getting the right answer within that budget. Over time they learned to actually compress or expand their reasoning based on the hint.
Qwen3-8B was not trained that way. It has two modes — thinking (where it produces a <think>...</think> block before answering) and non-thinking (where it skips that entirely). That's a binary on/off switch, not a dial. When you send reasoning_effort: "medium", the model receives it, doesn't recognize it, and ignores it. The output is identical regardless of what value you pass.
What would need to change to make it work
The model needs to be retrained with budget-forcing. During fine-tuning, you'd prepend a budget token to each prompt (something like
<budget>512</budget>) and train the model to produce correct answers within that many tokens. This teaches it to actually reason more efficiently when the budget is tight, rather than just cutting off mid-thought.The inference server needs to translate
reasoning_effortinto a concrete token limit and either inject it into the prompt in a format the model understands, or hard-stop the<think>block after N tokens by force-injecting</think>. The second approach is blunt — it truncates reasoning but doesn't make the model reason smarter.The API layer (whatever sits between the client and the model) needs to map
"low" / "medium" / "high"to actual numbers and pass them through correctly. Right now most serving stacks just forward unknown parameters to the model, which silently ignores them.Realistically, the easiest path is to use a model that already supports this natively — like a Qwen3 variant served through FriendliAI's serverless API which exposes
max_thinking_tokens, or OpenAI's o-series which was purpose-built forreasoning_effort. Retrofitting budget-forcing onto an existing model requires retraining, not just a config change.
- Downloads last month
- 5