Text Generation
Transformers
Safetensors
qwen3
conversational
text-generation-inference

query-crafter-multilingual

query-crafter-multilingual logo

hotchpotch/query-crafter-multilingual is a multilingual query generation model that produces search-oriented outputs from a given passage. It is built on top of Qwen3/Qwen3-1.7B and tuned for fast generation of short retrieval-style texts such as natural-language questions, keyword queries, FAQ prompts, and compact summaries. Despite its relatively small 1.7B parameter size, it is designed to generate simulated search queries quickly and efficiently from large text collections.

The model is intended for large-scale synthetic query generation from raw corpora. A typical use case is to take a large text collection and create many virtual search queries that could plausibly retrieve each passage.

Query Types

The model supports the following type values:

  • query: A natural-language question.
  • alt_query: A rephrased version of query that preserves the same meaning while changing wording.
  • keywords: A compact keyword-style query, typically around three salient terms.
  • synonym_keywords: A paraphrased keyword-style query using alternate wording rather than literal lexical overlap.
  • title: A short title representing the passage as a whole.
  • faq: A short FAQ-style question whose answer is the passage.
  • summary: A very short summary containing only the core fact.

Supported Languages

The model was trained for the following 21 languages:

  • Arabic
  • Chinese
  • Czech
  • Danish
  • Dutch
  • English
  • French
  • German
  • Greek
  • Hungarian
  • Indonesian
  • Italian
  • Japanese
  • Persian
  • Polish
  • Portuguese
  • Russian
  • Spanish
  • Swedish
  • Turkish
  • Vietnamese

Input Format

Training and inference use the same logical chat structure. In practice, it is easiest to think of the input as the following JSON-style payload before applying the model's chat template:

{
  "system_prompt": "language: english\ntype: query\n",
  "user_prompt": "Lithium iron phosphate batteries are widely used in electric buses and grid storage systems because they offer strong thermal stability and long cycle life."
}

The corresponding chat messages are:

[
  {
    "role": "system",
    "content": "language: english\ntype: query\n"
  },
  {
    "role": "user",
    "content": "Lithium iron phosphate batteries are widely used in electric buses and grid storage systems because they offer strong thermal stability and long cycle life."
  }
]

language should be one of the 21 supported languages written as a lowercase English language name, and type should be one of the 7 supported query types listed above.

Limitations

  • The model was trained primarily on passages in the 70-700 token range, measured with the Qwen3 tokenizer. Performance may degrade outside that range.
  • Long documents should be chunked before generation rather than passed as a single input.
  • When the model cannot produce a grounded output reliably, it may emit the literal string NONE.
  • The model is optimized for short retrieval-oriented outputs, not for long-form generation, open-ended dialogue, or role-play.

Synthetic Query Bias

The model tends to generate queries that are closely aligned with the surface form of the input passage and are often simpler than real-world user queries. In particular, it has limited ability to produce queries that require implicit reasoning, background knowledge, or multi-hop associations to identify the relevant document.

As a result, the distribution of generated queries may differ from that of real user queries. Training retrieval models solely on queries produced by this model may introduce bias toward shallow or lexically grounded matching, potentially reducing robustness to more complex, ambiguous, or context-dependent information needs.

To mitigate this effect, it is recommended to combine these synthetic queries with complementary data sources or apply additional query diversification strategies.

Inference Example

The following example uses transformers for single-machine inference. The configuration below assumes an NVIDIA GPU environment with the flash_attention_2 library available. For large-scale or production-style generation, using a fast inference server such as vLLM is strongly recommended.

#!/usr/bin/env python3
from __future__ import annotations

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# This example assumes:
# - an NVIDIA GPU
# - torch with CUDA support
# - FlashAttention 2 installed and available via `flash_attention_2`

MODEL_NAME = "hotchpotch/query-crafter-multilingual"

LANGUAGES = {
    "arabic",
    "chinese",
    "czech",
    "danish",
    "dutch",
    "english",
    "french",
    "german",
    "greek",
    "hungarian",
    "indonesian",
    "italian",
    "japanese",
    "persian",
    "polish",
    "portuguese",
    "russian",
    "spanish",
    "swedish",
    "turkish",
    "vietnamese",
}

TYPES = {
    "query",
    "alt_query",
    "faq",
    "title",
    "summary",
    "keywords",
    "synonym_keywords",
}


def normalize_language(language: str) -> str:
    language = language.strip().lower()
    if language not in LANGUAGES:
        raise ValueError(f"unsupported language: {language}")
    return language


def normalize_type(output_type: str) -> str:
    output_type = output_type.strip().lower()
    if output_type not in TYPES:
        raise ValueError(f"unsupported output_type: {output_type}")
    return output_type


def build_messages(text: str, language: str, output_type: str) -> list[dict[str, str]]:
    language = normalize_language(language)
    output_type = normalize_type(output_type)
    return [
        {
            "role": "system",
            "content": f"language: {language}\ntype: {output_type}\n",
        },
        {
            "role": "user",
            "content": text,
        },
    ]


@torch.inference_mode()
def main() -> None:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        trust_remote_code=True,
        dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
        device_map="auto",
    )
    model.eval()

    text = (
        "Lithium iron phosphate batteries are widely used in electric buses and grid "
        "storage systems because they offer strong thermal stability, long cycle life, "
        "and lower risk of thermal runaway than some nickel-rich chemistries. However, "
        "their lower electronic conductivity can reduce high-rate performance. To address "
        "this, researchers coated LiFePO4 particles with a thin carbon layer, reduced "
        "particle size to shorten lithium diffusion paths, and optimized electrode "
        "porosity to improve ion transport. In a 2024 pilot study, a modified cell design "
        "retained more than 90 percent of its initial capacity after 2,000 cycles under "
        "controlled temperature conditions while also improving fast-charging behavior."
    )

    messages = build_messages(
        text=text,
        language="english",
        output_type="query",
    )

    input_ids = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)

    output = model.generate(
        input_ids=input_ids,
        max_new_tokens=128,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

    generated = tokenizer.decode(
        output[0, input_ids.shape[1] :],
        skip_special_tokens=True,
    ).strip()
    print(generated)


if __name__ == "__main__":
    main()

How The Model And Data Were Built

This model was developed through a multi-stage multilingual synthetic data pipeline designed for retrieval-oriented supervision rather than general instruction following.

1. Source Corpora

The training pipeline starts from two web-scale text sources on Hugging Face:

English data comes from fineweb-edu, while the remaining 20 languages come from FineWeb2-HQ. Each language subset is processed independently so that multilingual balance can be controlled during later stages.

2. Passage Filtering And Preprocessing

The raw corpora are converted into language-specific subsets and normalized into a common schema. During preprocessing:

  • each passage is truncated to at most 10,000 characters before token measurement
  • token length is measured with the Qwen3/Qwen3-1.7B tokenizer
  • passages are filtered to a target token range of 70-700
  • Greek is a special case and may require a higher upper bound because tokenization can become inflated due to byte-level fallback behavior
  • metadata such as row_id, id, date, dump, file_path, token_count, and a best-effort quality score are retained for provenance

The purpose of this stage is to keep passages long enough to support grounded query generation, while excluding extremely short or overly long documents that are less suitable for retrieval-style supervision.

3. Instruction Assignment

Each passage is paired with one of seven retrieval-oriented generation targets:

  • query
  • alt_query
  • keywords
  • synonym_keywords
  • title
  • faq
  • summary

These instruction types are designed to capture different retrieval behaviors. Some are explicitly question-like, while others are more lexical or summarization-oriented. This lets the model generate multiple useful retrieval formulations for the same document rather than only one canonical question.

4. Teacher-Generated Synthetic Targets

After preprocessing, a teacher model is used to generate the target outputs for each passage. The generation prompt is written in English and includes explicit quality rules:

  • outputs must stay grounded in the source passage
  • entities, dates, and facts must not be invented
  • outputs should remain useful as standalone retrieval queries
  • if a grounded output is not possible, the teacher is allowed to return NONE

The teacher is asked to produce short retrieval-oriented outputs rather than verbose answers. This stage creates the synthetic supervision signal that is later used for fine-tuning.

5. Supervised Fine-Tuning Dataset Construction

Teacher outputs are merged back with passage metadata into an SFT dataset using the final system + user format. The resulting dataset keeps both the generated target and the original passage text so that each example is directly usable for instruction-style fine-tuning.

The main synthetic dataset used in this project is published here:

This dataset is the central training artifact for the released model family.

6. Model Fine-Tuning

The final model is obtained by supervised fine-tuning Qwen3/Qwen3-1.7B with LoRA on the synthetic multilingual dataset. After training, the adapter is merged into a standalone checkpoint for inference and release.

The design goal was not to maximize general chat capability, but to produce a compact model that can reliably generate short retrieval-friendly outputs across many languages.

Evaluation Method

Evaluation is based on semantic alignment between the generated query and the source passage.

Embedding Model

All evaluation is performed with BAAI/bge-m3:

Procedure

For each held-out example:

  1. The generated query is embedded with BGE-M3.
  2. The source passage is embedded separately with the same model.
  3. Cosine similarity is computed between the two embeddings.

Rows whose output is empty or equal to NONE are excluded from scoring.

This evaluation does not measure usefulness in a full retrieval stack directly. Instead, it serves as an offline proxy for semantic faithfulness: if the generated output remains close to its source passage in embedding space, it is more likely to function as a retrieval-oriented reformulation of that passage.

It is important not to interpret these scores as end-to-end retrieval quality. BGE-M3 cosine similarity is a convenient offline semantic-alignment signal, but it is not the same as measuring search metrics such as recall, MRR, or nDCG in a real retrieval system.

It is also important to note that this score measures similarity between the generated query and the source passage, but does not assess whether the query is truly effective or well-formed for retrieval (e.g., being concise yet precise). In particular, queries that are overly similar or simply restate the passage may receive high scores, even if they are not optimal for retrieval purposes.

Evaluation Setup

The evaluation compares the teacher-generated reference data and multiple fine-tuned model variants under the same preprocessing pipeline. The reported analysis includes:

  • overall mean cosine similarity
  • per-language mean similarity
  • manual inspection of low-scoring examples

The same framework is used to compare smaller and larger Qwen3-based variants.

Summary Results

On the held-out test split, the reported mean cosine similarities are:

Model Mean cosine similarity
deepseek-v3.2-chat 0.7811
qwen3-1.7b-bs32 0.7808
qwen3-4b-bs32 0.7820

The larger 4B variant performs slightly better numerically, but the margin is very small. In practice, the 1.7B model was chosen as the main release candidate because it provides nearly the same semantic alignment at a lower inference cost.

These numbers should be read as comparative offline diagnostics, not as direct evidence that one model will always retrieve better in production.

Per-Language Behavior

Language-level means remain relatively stable across the 21 supported languages. In the current evaluation record, the best and worst average language scores differ only by a modest margin, suggesting that the model preserves multilingual coverage rather than collapsing toward a few dominant languages.

Manual Analysis

Low-scoring cases tend to cluster around difficult passages such as narrative text, poetic text, ambiguous FAQ-style content, or passages where a concise retrieval-oriented formulation is inherently hard to express. Manual inspection is therefore used alongside cosine similarity to distinguish genuinely poor generations from semantically valid but embedding-challenging examples.

Recommended Use

This model is a good fit for:

  • multilingual synthetic query generation
  • document-to-query expansion
  • retrieval dataset creation
  • multilingual IR and reranking experiments
  • converting raw passages into short search-style prompts

License

Apache License 2.0

Author

Yuichi Tateno Yuichi Tateno (@hotchpotch)

Downloads last month
8
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hotchpotch/query-crafter-multilingual

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(809)
this model