Instructions to use transformers-community/group-beam-search with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use transformers-community/group-beam-search with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="transformers-community/group-beam-search")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("transformers-community/group-beam-search")
model = AutoModelForCausalLM.from_pretrained("transformers-community/group-beam-search")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use transformers-community/group-beam-search with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "transformers-community/group-beam-search"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "transformers-community/group-beam-search",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/transformers-community/group-beam-search

SGLang

How to use transformers-community/group-beam-search with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "transformers-community/group-beam-search" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "transformers-community/group-beam-search",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "transformers-community/group-beam-search" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "transformers-community/group-beam-search",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use transformers-community/group-beam-search with Docker Model Runner:
```
docker model run hf.co/transformers-community/group-beam-search
```

Custom group-beam-search changes deterministic output when cache is enabled

by lavrenko - opened 7 days ago

Discussion

lavrenko

7 days ago

Hi! I think there is a cache / beam-state incompatibility in the current transformers-community/group-beam-search custom generation implementation.

With deterministic decoding (do_sample=False), the same model, same prompt, and same grouped-beam parameters produce different token IDs depending on whether cache is enabled. Since cache should be an optimization, use_cache=True and use_cache=False should not change generation semantics.

Minimal Colab reproduction, using the current Colab environment without installing a custom transformers version:

import torch, transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Qwen/Qwen2.5-0.5B-Instruct"

tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
).eval()

messages = [{"role": "user", "content": "List three different fruits."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)

common = dict(
    custom_generate="transformers-community/group-beam-search",
    trust_remote_code=True,
    do_sample=False,
    max_new_tokens=24,
    num_beams=3,
    num_beam_groups=3,
    num_return_sequences=3,
    diversity_penalty=1.0,
    pad_token_id=tok.eos_token_id,
)

def run(use_cache):
    kwargs = {} if use_cache is None else {"use_cache": use_cache}
    with torch.inference_mode():
        return model.generate(**inputs, **common, **kwargs).cpu()

cached = run(None)
nocache = run(False)

print("transformers:", transformers.__version__)
print("torch:", torch.__version__)
print("same token ids:", torch.equal(cached, nocache))

n = inputs.input_ids.shape[1]

def show(title, out):
    print(f"\n=== {title} ===")
    for i, seq in enumerate(out):
        txt = tok.decode(seq[n:], skip_special_tokens=True).replace("\n", " ")
        print(f"[beam {i}] {txt}")

show("cached/default", cached)
show("use_cache=False", nocache)

Observed output:

transformers: 5.12.0
torch: 2.11.0+cu128
same token ids: False

=== cached/default ===
[beam 0] Three different fruits
[beam 1] Sure,Apple, Appleshee, Applessee, I'myewsystem 1. I can beavers
[beam 2] Sure, Kiwi, I'm, apples, You (Apple, Hello, I'm, you can you can you

=== use_cache=False ===
[beam 0] Sure, here are three different fruits:  1. Apple 2. Orange 3. Kiwi
[beam 1] Three different fruits are:  1. Apple 2. Banana 3. Orange
[beam 2] Sure! Here are three different fruits:  1. **Strawberry** - A sweet, juicy red fruit with a

The cached/default outputs appear corrupted, while use_cache=False produces coherent outputs. More importantly, the token IDs differ under deterministic decoding.

Optional stability check:

cached2 = run(None)
nocache2 = run(False)

print("cached stable:", torch.equal(cached, cached2))
print("nocache stable:", torch.equal(nocache, nocache2))
print("cache == nocache:", torch.equal(cached, nocache))

Expected behavior:

use_cache=True and use_cache=False should produce the same token IDs under deterministic decoding, or at least should not produce corrupted outputs only in the cached path.

Likely cause:

The custom group-beam-search loop appears to manually manage cache / cache position / beam reordering, and this may be incompatible with recent transformers cache behavior.

Possible minimal correctness-first fix:

def generate(model, *args, **kwargs):
    kwargs.setdefault("use_cache", False)

    generation_outputs = GenerationMixin.generate(
        model, *args, custom_generate=_group_beam_search, **kwargs
    )

    return generation_outputs

This may not be the optimal performance fix, but it seems like a safe compatibility fix if the cached path currently changes deterministic generation semantics.

I am happy to prepare a PR if this diagnosis makes sense.

RaushanTurganbay

Transformers Community org 7 days ago

This might be caused by one of the many recent changes in cache structure. The outputs can be slightly different sometimes due to numerical precision when using cache or not, but it shouldn't cause garbage output. I will take a look a bit later, I dont' think forcing cache to be False is a good solution.

lavrenko

6 days ago

Thanks — I agree that forcing use_cache=False is only a workaround, not the right fix.

I opened #4 with a more targeted change. Instead of disabling the cache, the PR updates the custom generation loop to pass next_sequence_length to prepare_inputs_for_generation: it keeps the full sequence for the first step / no-cache decoding, and passes next_sequence_length=1 for cached non-first decoding steps.

This seems to match the recent Transformers cache handling more directly while preserving use_cache=True. I tested it locally with the reproducer above on both transformers==4.57.1 and transformers==5.12.1.

RaushanTurganbay

Transformers Community org 6 days ago

Closing as resolved

RaushanTurganbay changed discussion status to closed 6 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment