Instructions to use hotchpotch/query-crafter-multilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use hotchpotch/query-crafter-multilingual with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="hotchpotch/query-crafter-multilingual") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("hotchpotch/query-crafter-multilingual") model = AutoModelForMultimodalLM.from_pretrained("hotchpotch/query-crafter-multilingual") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use hotchpotch/query-crafter-multilingual with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "hotchpotch/query-crafter-multilingual" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hotchpotch/query-crafter-multilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/hotchpotch/query-crafter-multilingual
- SGLang
How to use hotchpotch/query-crafter-multilingual with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "hotchpotch/query-crafter-multilingual" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hotchpotch/query-crafter-multilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "hotchpotch/query-crafter-multilingual" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hotchpotch/query-crafter-multilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use hotchpotch/query-crafter-multilingual with Docker Model Runner:
docker model run hf.co/hotchpotch/query-crafter-multilingual
query-crafter-multilingual
hotchpotch/query-crafter-multilingual is a multilingual query generation model that produces search-oriented outputs from a given passage. It is built on top of Qwen3/Qwen3-1.7B and tuned for fast generation of short retrieval-style texts such as natural-language questions, keyword queries, FAQ prompts, and compact summaries. Despite its relatively small 1.7B parameter size, it is designed to generate simulated search queries quickly and efficiently from large text collections.
The model is intended for large-scale synthetic query generation from raw corpora. A typical use case is to take a large text collection and create many virtual search queries that could plausibly retrieve each passage.
Query Types
The model supports the following type values:
query: A natural-language question.alt_query: A rephrased version ofquerythat preserves the same meaning while changing wording.keywords: A compact keyword-style query, typically around three salient terms.synonym_keywords: A paraphrased keyword-style query using alternate wording rather than literal lexical overlap.title: A short title representing the passage as a whole.faq: A short FAQ-style question whose answer is the passage.summary: A very short summary containing only the core fact.
Supported Languages
The model was trained for the following 21 languages:
- Arabic
- Chinese
- Czech
- Danish
- Dutch
- English
- French
- German
- Greek
- Hungarian
- Indonesian
- Italian
- Japanese
- Persian
- Polish
- Portuguese
- Russian
- Spanish
- Swedish
- Turkish
- Vietnamese
Input Format
Training and inference use the same logical chat structure. In practice, it is easiest to think of the input as the following JSON-style payload before applying the model's chat template:
{
"system_prompt": "language: english\ntype: query\n",
"user_prompt": "Lithium iron phosphate batteries are widely used in electric buses and grid storage systems because they offer strong thermal stability and long cycle life."
}
The corresponding chat messages are:
[
{
"role": "system",
"content": "language: english\ntype: query\n"
},
{
"role": "user",
"content": "Lithium iron phosphate batteries are widely used in electric buses and grid storage systems because they offer strong thermal stability and long cycle life."
}
]
language should be one of the 21 supported languages written as a lowercase English language name, and type should be one of the 7 supported query types listed above.
Limitations
- The model was trained primarily on passages in the
70-700token range, measured with the Qwen3 tokenizer. Performance may degrade outside that range. - Long documents should be chunked before generation rather than passed as a single input.
- When the model cannot produce a grounded output reliably, it may emit the literal string
NONE. - The model is optimized for short retrieval-oriented outputs, not for long-form generation, open-ended dialogue, or role-play.
Synthetic Query Bias
The model tends to generate queries that are closely aligned with the surface form of the input passage and are often simpler than real-world user queries. In particular, it has limited ability to produce queries that require implicit reasoning, background knowledge, or multi-hop associations to identify the relevant document.
As a result, the distribution of generated queries may differ from that of real user queries. Training retrieval models solely on queries produced by this model may introduce bias toward shallow or lexically grounded matching, potentially reducing robustness to more complex, ambiguous, or context-dependent information needs.
To mitigate this effect, it is recommended to combine these synthetic queries with complementary data sources or apply additional query diversification strategies.
Inference Example
The following example uses transformers for single-machine inference. The configuration below assumes an NVIDIA GPU environment with the flash_attention_2 library available. For large-scale or production-style generation, using a fast inference server such as vLLM is strongly recommended.
#!/usr/bin/env python3
from __future__ import annotations
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# This example assumes:
# - an NVIDIA GPU
# - torch with CUDA support
# - FlashAttention 2 installed and available via `flash_attention_2`
MODEL_NAME = "hotchpotch/query-crafter-multilingual"
LANGUAGES = {
"arabic",
"chinese",
"czech",
"danish",
"dutch",
"english",
"french",
"german",
"greek",
"hungarian",
"indonesian",
"italian",
"japanese",
"persian",
"polish",
"portuguese",
"russian",
"spanish",
"swedish",
"turkish",
"vietnamese",
}
TYPES = {
"query",
"alt_query",
"faq",
"title",
"summary",
"keywords",
"synonym_keywords",
}
def normalize_language(language: str) -> str:
language = language.strip().lower()
if language not in LANGUAGES:
raise ValueError(f"unsupported language: {language}")
return language
def normalize_type(output_type: str) -> str:
output_type = output_type.strip().lower()
if output_type not in TYPES:
raise ValueError(f"unsupported output_type: {output_type}")
return output_type
def build_messages(text: str, language: str, output_type: str) -> list[dict[str, str]]:
language = normalize_language(language)
output_type = normalize_type(output_type)
return [
{
"role": "system",
"content": f"language: {language}\ntype: {output_type}\n",
},
{
"role": "user",
"content": text,
},
]
@torch.inference_mode()
def main() -> None:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
trust_remote_code=True,
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
model.eval()
text = (
"Lithium iron phosphate batteries are widely used in electric buses and grid "
"storage systems because they offer strong thermal stability, long cycle life, "
"and lower risk of thermal runaway than some nickel-rich chemistries. However, "
"their lower electronic conductivity can reduce high-rate performance. To address "
"this, researchers coated LiFePO4 particles with a thin carbon layer, reduced "
"particle size to shorten lithium diffusion paths, and optimized electrode "
"porosity to improve ion transport. In a 2024 pilot study, a modified cell design "
"retained more than 90 percent of its initial capacity after 2,000 cycles under "
"controlled temperature conditions while also improving fast-charging behavior."
)
messages = build_messages(
text=text,
language="english",
output_type="query",
)
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
output = model.generate(
input_ids=input_ids,
max_new_tokens=128,
do_sample=False,
pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated = tokenizer.decode(
output[0, input_ids.shape[1] :],
skip_special_tokens=True,
).strip()
print(generated)
if __name__ == "__main__":
main()
How The Model And Data Were Built
This model was developed through a multi-stage multilingual synthetic data pipeline designed for retrieval-oriented supervision rather than general instruction following.
1. Source Corpora
The training pipeline starts from two web-scale text sources on Hugging Face:
- FineWeb2-HQ: https://huggingface.co/datasets/epfml/FineWeb2-HQ
- fineweb-edu: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
English data comes from fineweb-edu, while the remaining 20 languages come from FineWeb2-HQ. Each language subset is processed independently so that multilingual balance can be controlled during later stages.
2. Passage Filtering And Preprocessing
The raw corpora are converted into language-specific subsets and normalized into a common schema. During preprocessing:
- each passage is truncated to at most 10,000 characters before token measurement
- token length is measured with the
Qwen3/Qwen3-1.7Btokenizer - passages are filtered to a target token range of
70-700 - Greek is a special case and may require a higher upper bound because tokenization can become inflated due to byte-level fallback behavior
- metadata such as
row_id,id,date,dump,file_path,token_count, and a best-effort qualityscoreare retained for provenance
The purpose of this stage is to keep passages long enough to support grounded query generation, while excluding extremely short or overly long documents that are less suitable for retrieval-style supervision.
3. Instruction Assignment
Each passage is paired with one of seven retrieval-oriented generation targets:
queryalt_querykeywordssynonym_keywordstitlefaqsummary
These instruction types are designed to capture different retrieval behaviors. Some are explicitly question-like, while others are more lexical or summarization-oriented. This lets the model generate multiple useful retrieval formulations for the same document rather than only one canonical question.
4. Teacher-Generated Synthetic Targets
After preprocessing, a teacher model is used to generate the target outputs for each passage. The generation prompt is written in English and includes explicit quality rules:
- outputs must stay grounded in the source passage
- entities, dates, and facts must not be invented
- outputs should remain useful as standalone retrieval queries
- if a grounded output is not possible, the teacher is allowed to return
NONE
The teacher is asked to produce short retrieval-oriented outputs rather than verbose answers. This stage creates the synthetic supervision signal that is later used for fine-tuning.
5. Supervised Fine-Tuning Dataset Construction
Teacher outputs are merged back with passage metadata into an SFT dataset using the final system + user format. The resulting dataset keeps both the generated target and the original passage text so that each example is directly usable for instruction-style fine-tuning.
The main synthetic dataset used in this project is published here:
This dataset is the central training artifact for the released model family.
6. Model Fine-Tuning
The final model is obtained by supervised fine-tuning Qwen3/Qwen3-1.7B with LoRA on the synthetic multilingual dataset. After training, the adapter is merged into a standalone checkpoint for inference and release.
The design goal was not to maximize general chat capability, but to produce a compact model that can reliably generate short retrieval-friendly outputs across many languages.
Evaluation Method
Evaluation is based on semantic alignment between the generated query and the source passage.
Embedding Model
All evaluation is performed with BAAI/bge-m3:
Procedure
For each held-out example:
- The generated query is embedded with BGE-M3.
- The source passage is embedded separately with the same model.
- Cosine similarity is computed between the two embeddings.
Rows whose output is empty or equal to NONE are excluded from scoring.
This evaluation does not measure usefulness in a full retrieval stack directly. Instead, it serves as an offline proxy for semantic faithfulness: if the generated output remains close to its source passage in embedding space, it is more likely to function as a retrieval-oriented reformulation of that passage.
It is important not to interpret these scores as end-to-end retrieval quality. BGE-M3 cosine similarity is a convenient offline semantic-alignment signal, but it is not the same as measuring search metrics such as recall, MRR, or nDCG in a real retrieval system.
It is also important to note that this score measures similarity between the generated query and the source passage, but does not assess whether the query is truly effective or well-formed for retrieval (e.g., being concise yet precise). In particular, queries that are overly similar or simply restate the passage may receive high scores, even if they are not optimal for retrieval purposes.
Evaluation Setup
The evaluation compares the teacher-generated reference data and multiple fine-tuned model variants under the same preprocessing pipeline. The reported analysis includes:
- overall mean cosine similarity
- per-language mean similarity
- manual inspection of low-scoring examples
The same framework is used to compare smaller and larger Qwen3-based variants.
Summary Results
On the held-out test split, the reported mean cosine similarities are:
| Model | Mean cosine similarity |
|---|---|
deepseek-v3.2-chat |
0.7811 |
qwen3-1.7b-bs32 |
0.7808 |
qwen3-4b-bs32 |
0.7820 |
The larger 4B variant performs slightly better numerically, but the margin is very small. In practice, the 1.7B model was chosen as the main release candidate because it provides nearly the same semantic alignment at a lower inference cost.
These numbers should be read as comparative offline diagnostics, not as direct evidence that one model will always retrieve better in production.
Per-Language Behavior
Language-level means remain relatively stable across the 21 supported languages. In the current evaluation record, the best and worst average language scores differ only by a modest margin, suggesting that the model preserves multilingual coverage rather than collapsing toward a few dominant languages.
Manual Analysis
Low-scoring cases tend to cluster around difficult passages such as narrative text, poetic text, ambiguous FAQ-style content, or passages where a concise retrieval-oriented formulation is inherently hard to express. Manual inspection is therefore used alongside cosine similarity to distinguish genuinely poor generations from semantically valid but embedding-challenging examples.
Recommended Use
This model is a good fit for:
- multilingual synthetic query generation
- document-to-query expansion
- retrieval dataset creation
- multilingual IR and reranking experiments
- converting raw passages into short search-style prompts
License
Apache License 2.0
Author
Yuichi Tateno (@hotchpotch)
- Downloads last month
- 8
