Instructions to use suyashdb/broken-model-fixed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use suyashdb/broken-model-fixed with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="suyashdb/broken-model-fixed")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("suyashdb/broken-model-fixed")
model = AutoModelForCausalLM.from_pretrained("suyashdb/broken-model-fixed")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use suyashdb/broken-model-fixed with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "suyashdb/broken-model-fixed"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "suyashdb/broken-model-fixed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/suyashdb/broken-model-fixed

SGLang

How to use suyashdb/broken-model-fixed with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "suyashdb/broken-model-fixed" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "suyashdb/broken-model-fixed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "suyashdb/broken-model-fixed" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "suyashdb/broken-model-fixed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use suyashdb/broken-model-fixed with Docker Model Runner:
```
docker model run hf.co/suyashdb/broken-model-fixed
```

broken-model (fixed)

HuggingFace Repo: https://huggingface.co/suyashdb/broken-model-fixed/tree/main

Changes Made

1. `README.md` — `base_model` corrected

Before: meta-llama/Meta-Llama-3.1-8B
After: Qwen/Qwen3-8B
Why: The model architecture (Qwen3ForCausalLM), tokenizer class (Qwen2Tokenizer), vocabulary size (151936), and all config values exactly match Qwen3-8B, not Llama-3.1-8B. The wrong base_model declaration was misleading but not the functional blocker.

2. `tokenizer_config.json` — `chat_template` added

Before: The chat_template field was entirely absent from tokenizer_config.json.
After: Added the full Jinja2 chat template from the canonical Qwen/Qwen3-8B model.
Why this broke inference: Any OpenAI-compatible inference server (vLLM, TGI, FriendliAI engine) calls tokenizer.apply_chat_template() to convert the messages array in a /chat/completions request into a single prompt string. Without a chat_template, this call raises "No chat template is set for this tokenizer" and the server cannot process any request. The model weights themselves are intact — only the tokenizer configuration was missing this critical field.

The added template handles:

System / user / assistant message formatting using <|im_start|> / <|im_end|> tokens
Tool call formatting (<tool_call> / <tool_response>)
Thinking mode: when enable_thinking=False is passed, the template injects <think>\n\n</think> to suppress chain-of-thought output
Multi-turn reasoning content (reasoning_content field on assistant messages)

3. Vocab/tokenizer files added

vocab.json, tokenizer.json, and special_tokens_map.json were uploaded from the canonical Qwen/Qwen3-8B model.
The original broken repo was missing these, making it impossible to load the tokenizer standalone.

Verification

You can verify the fix without model weights — just the tokenizer:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("suyashdb/broken-model-fixed")

messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)
# Expected output:
# <|im_start|>user
# What is 2+2?<|im_end|>
# <|im_start|>assistant

Part B — Why `reasoning_effort` Does Nothing

If you've tried passing reasoning_effort: "low" or reasoning_effort: "high" in your requests and noticed zero difference in the output — you're not imagining it. Here's why.

The short answer

This model has no idea what reasoning_effort means. It was never trained to respond to it.

The longer answer

reasoning_effort is a parameter from OpenAI's o-series API (o1, o3, o4). The idea is that you can tell the model how hard to think — "low" means give me a quick answer, "high" means really work through it. Those models were specifically trained with a concept called budget-forcing: during training, they were given a token budget and rewarded for getting the right answer within that budget. Over time they learned to actually compress or expand their reasoning based on the hint.

Qwen3-8B was not trained that way. It has two modes — thinking (where it produces a <think>...</think> block before answering) and non-thinking (where it skips that entirely). That's a binary on/off switch, not a dial. When you send reasoning_effort: "medium", the model receives it, doesn't recognize it, and ignores it. The output is identical regardless of what value you pass.

What would need to change to make it work

The model needs to be retrained with budget-forcing. During fine-tuning, you'd prepend a budget token to each prompt (something like <budget>512</budget>) and train the model to produce correct answers within that many tokens. This teaches it to actually reason more efficiently when the budget is tight, rather than just cutting off mid-thought.
The inference server needs to translate reasoning_effort into a concrete token limit and either inject it into the prompt in a format the model understands, or hard-stop the <think> block after N tokens by force-injecting </think>. The second approach is blunt — it truncates reasoning but doesn't make the model reason smarter.
The API layer (whatever sits between the client and the model) needs to map "low" / "medium" / "high" to actual numbers and pass them through correctly. Right now most serving stacks just forward unknown parameters to the model, which silently ignores them.
Realistically, the easiest path is to use a model that already supports this natively — like a Qwen3 variant served through FriendliAI's serverless API which exposes max_thinking_tokens, or OpenAI's o-series which was purpose-built for reasoning_effort. Retrofitting budget-forcing onto an existing model requires retraining, not just a config change.

Downloads last month: 5

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for suyashdb/broken-model-fixed

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B