Instructions to use transformers-community/group-beam-search with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use transformers-community/group-beam-search with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="transformers-community/group-beam-search") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("transformers-community/group-beam-search") model = AutoModelForCausalLM.from_pretrained("transformers-community/group-beam-search") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Local Apps Settings
- vLLM
How to use transformers-community/group-beam-search with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "transformers-community/group-beam-search" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "transformers-community/group-beam-search", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/transformers-community/group-beam-search
- SGLang
How to use transformers-community/group-beam-search with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "transformers-community/group-beam-search" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "transformers-community/group-beam-search", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "transformers-community/group-beam-search" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "transformers-community/group-beam-search", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use transformers-community/group-beam-search with Docker Model Runner:
docker model run hf.co/transformers-community/group-beam-search
Custom group-beam-search changes deterministic output when cache is enabled
Hi! I think there is a cache / beam-state incompatibility in the current transformers-community/group-beam-search custom generation implementation.
With deterministic decoding (do_sample=False), the same model, same prompt, and same grouped-beam parameters produce different token IDs depending on whether cache is enabled. Since cache should be an optimization, use_cache=True and use_cache=False should not change generation semantics.
Minimal Colab reproduction, using the current Colab environment without installing a custom transformers version:
import torch, transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
).eval()
messages = [{"role": "user", "content": "List three different fruits."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
common = dict(
custom_generate="transformers-community/group-beam-search",
trust_remote_code=True,
do_sample=False,
max_new_tokens=24,
num_beams=3,
num_beam_groups=3,
num_return_sequences=3,
diversity_penalty=1.0,
pad_token_id=tok.eos_token_id,
)
def run(use_cache):
kwargs = {} if use_cache is None else {"use_cache": use_cache}
with torch.inference_mode():
return model.generate(**inputs, **common, **kwargs).cpu()
cached = run(None)
nocache = run(False)
print("transformers:", transformers.__version__)
print("torch:", torch.__version__)
print("same token ids:", torch.equal(cached, nocache))
n = inputs.input_ids.shape[1]
def show(title, out):
print(f"\n=== {title} ===")
for i, seq in enumerate(out):
txt = tok.decode(seq[n:], skip_special_tokens=True).replace("\n", " ")
print(f"[beam {i}] {txt}")
show("cached/default", cached)
show("use_cache=False", nocache)
Observed output:
transformers: 5.12.0
torch: 2.11.0+cu128
same token ids: False
=== cached/default ===
[beam 0] Three different fruits
[beam 1] Sure,Apple, Appleshee, Applessee, I'myewsystem 1. I can beavers
[beam 2] Sure, Kiwi, I'm, apples, You (Apple, Hello, I'm, you can you can you
=== use_cache=False ===
[beam 0] Sure, here are three different fruits: 1. Apple 2. Orange 3. Kiwi
[beam 1] Three different fruits are: 1. Apple 2. Banana 3. Orange
[beam 2] Sure! Here are three different fruits: 1. **Strawberry** - A sweet, juicy red fruit with a
The cached/default outputs appear corrupted, while use_cache=False produces coherent outputs. More importantly, the token IDs differ under deterministic decoding.
Optional stability check:
cached2 = run(None)
nocache2 = run(False)
print("cached stable:", torch.equal(cached, cached2))
print("nocache stable:", torch.equal(nocache, nocache2))
print("cache == nocache:", torch.equal(cached, nocache))
Expected behavior:
use_cache=True and use_cache=False should produce the same token IDs under deterministic decoding, or at least should not produce corrupted outputs only in the cached path.
Likely cause:
The custom group-beam-search loop appears to manually manage cache / cache position / beam reordering, and this may be incompatible with recent transformers cache behavior.
Possible minimal correctness-first fix:
def generate(model, *args, **kwargs):
kwargs.setdefault("use_cache", False)
generation_outputs = GenerationMixin.generate(
model, *args, custom_generate=_group_beam_search, **kwargs
)
return generation_outputs
This may not be the optimal performance fix, but it seems like a safe compatibility fix if the cached path currently changes deterministic generation semantics.
I am happy to prepare a PR if this diagnosis makes sense.
This might be caused by one of the many recent changes in cache structure. The outputs can be slightly different sometimes due to numerical precision when using cache or not, but it shouldn't cause garbage output. I will take a look a bit later, I dont' think forcing cache to be False is a good solution.
Thanks — I agree that forcing use_cache=False is only a workaround, not the right fix.
I opened #4 with a more targeted change. Instead of disabling the cache, the PR updates the custom generation loop to pass next_sequence_length to prepare_inputs_for_generation: it keeps the full sequence for the first step / no-cache decoding, and passes next_sequence_length=1 for cached non-first decoding steps.
This seems to match the recent Transformers cache handling more directly while preserving use_cache=True. I tested it locally with the reproducer above on both transformers==4.57.1 and transformers==5.12.1.
Closing as resolved