Instructions to use GX-XinGao/Qwen2.5-7B-R-Select-100k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use GX-XinGao/Qwen2.5-7B-R-Select-100k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="GX-XinGao/Qwen2.5-7B-R-Select-100k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GX-XinGao/Qwen2.5-7B-R-Select-100k")
model = AutoModelForCausalLM.from_pretrained("GX-XinGao/Qwen2.5-7B-R-Select-100k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use GX-XinGao/Qwen2.5-7B-R-Select-100k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "GX-XinGao/Qwen2.5-7B-R-Select-100k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GX-XinGao/Qwen2.5-7B-R-Select-100k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/GX-XinGao/Qwen2.5-7B-R-Select-100k

SGLang

How to use GX-XinGao/Qwen2.5-7B-R-Select-100k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "GX-XinGao/Qwen2.5-7B-R-Select-100k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GX-XinGao/Qwen2.5-7B-R-Select-100k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "GX-XinGao/Qwen2.5-7B-R-Select-100k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GX-XinGao/Qwen2.5-7B-R-Select-100k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use GX-XinGao/Qwen2.5-7B-R-Select-100k with Docker Model Runner:
```
docker model run hf.co/GX-XinGao/Qwen2.5-7B-R-Select-100k
```

Qwen2.5-7B-R-Select-100k

Qwen2.5-7B-R-Select-100k is a supervised fine-tuned (SFT) model built on top of Qwen2.5-7B-Base, trained with R-Select-100k.

🧠 Model Summary

Base Model: Qwen/Qwen2.5-7B-Base
Training Data: OpenDataArena/R-Select-100k
Domain Coverage: General, Math, Code, Reasoning
Scale (selected training set): 100K samples

📄 About R-Select

R-Select: A Robust Multi-Metric Data Selection Approach for Fine-Tuning Large Language Models is a KDD 2026 paper that studies how to select high-quality SFT data from large heterogeneous instruction-tuning pools. R-Select is designed to move beyond single-metric filtering or manually designed aggregation rules by formulating data selection as a multi-metric weight optimization problem.

Specifically, R-Select annotates each sample with 30 quality metrics, clusters correlated metrics into functional groups, and learns a hierarchical selection policy through proxy-model validation. The learned policy is then applied to a source pool of over 3.4M samples to select 100K high-value samples for downstream SFT.

Code: https://github.com/OpenDataArena/R-Select

⚙️ Data Curation Pipeline

Overview of R-Select.

R-Select-100k is built by applying the R-Select data selection framework to a large-scale heterogeneous SFT data pool.

1️⃣ Source Pool Construction

We construct a candidate pool from 21 publicly available instruction-tuning datasets, covering diverse domains and paradigms such as reasoning, code, math, and general instruction following. The source pool includes datasets such as OpenThoughts3, AM-Thinking-v1-Distilled, OpenThoughts, OmniThought, LIMO.

2️⃣ Multi-Metric Annotation

Each sample is annotated with 30 complementary data-quality metrics using the OpenDataArena toolkit. These metrics cover three broad categories:

Model-based metrics: e.g., IFD, PPL, SkyworkRM.
Heuristic metrics: e.g., token length, token entropy, MTLD.
LLM-as-Judge: e.g., complexity.

Before optimization, metric values are standardized through outlier mitigation, Min-Max normalization, and score inversion for metrics where lower values indicate better quality. The scored version of the data has also been released at: OpenDataArena/OpenDataArena-scored-data-2603. Detailed metric information can also be found there.

3️⃣ Hierarchical Bayesian Optimization

R-Select formulates data selection as a learnable multi-metric weighting problem. Instead of using a single metric or manually designed aggregation rule, it learns how to combine metrics automatically.

The optimization process has two stages:

Intra-group refinement: correlated metrics are clustered with Ward hierarchical clustering, and local weights are optimized within each metric group.
Inter-group integration: group-level signals are combined through a second-stage optimization to balance global quality dimensions.

The search uses a lightweight proxy model, Qwen3-1.7B-Base, and the Optuna TPE optimizer. The proxy validation set contains 2,255 samples from Omni-Math, GPQA, MMLU, BigCodeBench, and MBPP.

4️⃣ Final Selection

After optimization, the learned weight vector is applied to the full candidate pool. Each sample receives a final scalar quality score, and the top 100K samples are selected to form R-Select-100k.

📚 Source Composition

Source	Selected_count	Percentage
AM-Thinking-v1-Distilled-math	28,434	28.43%
OpenThoughts	22,904	22.90%
AM-Thinking-v1-Distilled-code	12,670	12.67%
Tulu-3-Persona-MATH	8,693	8.69%
OpenO1-SFT	6,670	6.67%
OpenThoughts3	4,062	4.06%
NuminaMath-TIR	3,825	3.83%
Raiden-DeepSeek-R1	3,104	3.10%
OmniThought	1,735	1.74%
Magpie-Pro-GPT4o-mini	1,576	1.58%
Fast-Math-R1-SFT	1,493	1.49%
Tulu-3-Persona-Python	1,469	1.47%
Tulu-3-Persona-IF	1,403	1.40%
SYNTHETIC-2-SFT-verified	1,014	1.01%
Tulu-3-Persona-Algebra	428	0.43%
Tulu-3-Persona-GSM	190	0.19%
FLAN-v2	105	0.11%
LIMO	99	0.10%
Evol-CodeAlpaca	80	0.08%
SciRIFF	46	0.05%
No-Robots	0	0.00%

📚 Data Format

{
  "id": "unique_identifier",
  "source": "source_dataset",
  "instruction": "textual question or instruction",
  "outpt": "textual response",
  "scores": [
    "AtheneScore": ...,
    "CleanlinessScore": ...,
    ...
  ],
  "overall_score":...
}

📈 Performance

R-Select-100k is evaluated as an SFT corpus for both Qwen2.5-7B-Base and Qwen3-8B-Base. Evaluation is conducted with OpenCompass on unseen benchmarks across four domains:

General: DROP, IFEval, MMLU-Pro
Math: MATH500, OlympiadBench, AIME2024
Code: HumanEval, HumanEval+, LiveCodeBench v5
Reasoning: ARC-C, BBH, KOR-Bench

R-Select-100k achieves the best reported average performance among the compared open-source SFT datasets while using only 100K samples.

Comparison between our R-Select and representative open-source high-quality SFT datasets, Best scores in **bold**, second-best underlined.
Dataset	Size	DROP	IFEval	MMLU-P	MATH500	OLYMP	AIME'24	HE	HE+	LCB	ARC-C	BBH	K.B.	AVG
Qwen2.5-7B-Base
Qwen2.5-7B-Base	-	68.3	35.5	44.2	50.2	35.9	6.7	77.4	43.3	8.2	36.6	69.5	33.3	45.8
LIMO	817	73.5	53.3	54.7	66.8	34.9	4.6	83.5	59.8	17.6	90.9	55.6	48.3	53.6
Light-R1-SFT	79k	79.4	38.5	46.3	88.0	60.2	38.3	45.7	40.2	3.2	78.0	72.7	52.3	53.6
SYNTHETIC-2-SFT	105k	64.4	54.9	35.5	90.0	67.4	45.0	42.1	40.2	11.1	93.2	81.1	58.8	57.0
OpenThoughts	114k	68.5	41.2	55.6	83.6	53.3	22.5	68.9	68.9	17.2	90.5	72.4	51.0	57.8
OmniThought	365k	52.8	34.3	38.1	89.8	68.1	50.4	57.9	51.8	17.9	90.5	76.8	58.9	57.2
MiroMind	719k	82.3	30.6	38.6	91.6	66.3	55.0	32.3	26.2	5.4	84.1	74.9	51.4	53.2
Tulu3-sft-mixture	939k	62.3	72.1	46.0	55.5	29.0	4.4	79.1	66.7	12.3	80.7	61.9	48.5	51.5
R-Select	100k	65.7	46.6	46.9	86.6	64.4	31.7	65.9	63.6	19.0	87.5	66.8	52.2	58.1
Qwen3-8B-Base
Qwen3-8B-Base	-	71.5	45.9	56.2	79.6	47.2	6.7	82.9	34.2	16.9	37.3	78.1	46.6	50.3
LIMO	817	72.0	52.3	49.8	69.0	31.3	12.5	81.1	62.2	13.6	83.4	48.7	45.7	51.8
Light-R1-SFT	79k	83.4	46.8	57.7	92.6	69.7	54.6	81.7	65.9	14.7	92.2	84.5	60.4	67.0
SYNTHETIC-2-SFT	105k	38.7	67.6	56.8	93.8	71.5	58.8	81.1	42.7	18.3	92.9	86.6	64.1	64.4
OpenThoughts	114k	79.6	43.6	38.7	92.2	71.5	47.9	72.0	75.0	31.5	91.2	82.3	57.7	65.2
OmniThought	365k	50.3	49.7	46.2	95.4	74.9	67.9	91.5	64.6	29.4	93.9	86.8	65.0	67.9
MiroMind	719k	85.0	43.5	55.1	96.8	77.0	62.9	82.3	71.3	21.2	92.9	86.5	62.0	69.7
Tulu3-sft-mixture	939k	56.3	69.6	44.7	56.4	34.4	4.2	72.0	62.8	15.1	85.4	64.7	47.0	51.0
R-Select	100k	81.8	56.4	61.1	94.4	71.1	56.7	81.1	75.6	28.0	92.9	80.6	59.1	69.9

🚀 Usage

Model repo: OpenDataArena/Qwen2.5-7B-R-Select-100k. Below is a minimal runnable example for loading and inference:

from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "OpenDataArena/Qwen2.5-7B-R-Select-100k"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🌐 About OpenDataArena

OpenDataArena is an open research platform for discovering, evaluating, and advancing high-quality datasets for AI post-training. R-Select uses the OpenDataArena toolkit for multi-metric data annotation and quality analysis.

Key Features:

🏆 Dataset Leaderboard — helps researchers identify valuable and high-quality datasets across different domains.
📊 Detailed Evaluation Scores — provides rich metric annotations and scored data to assess data quality, complexity, difficulty, and related properties.
🧰 Data Scoring Toolkit — provides an open-source toolkit for scoring datasets with multiple quality metrics.
🧬 Data Lineage — analyzes relationships among datasets by exploring their composition and source overlap.

If you find our work helpful, please consider ⭐ starring and subscribing to support our research.

📚 Citation

@inproceedings{gao2026rselect,
  title={R-Select: A Robust Multi-Metric Data Selection Approach for Fine-Tuning Large Language Models},
  author={Gao, Xin and Wang, Xiaoyang and Zhu, Yun and Liu, Zheng and He, Conghui and Wu, Lijun},
  booktitle={Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2},
  year={2026},
  publisher={ACM},
  doi={10.1145/3770855.3817656}
}

@article{cai2025opendataarena,
  title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
  author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
  journal={arXiv preprint arXiv:2512.14051},
  year={2025}
}

Downloads last month: 40

Safetensors

Model size

8B params

Tensor type

BF16

Dataset used to train GX-XinGao/Qwen2.5-7B-R-Select-100k

Collection including GX-XinGao/Qwen2.5-7B-R-Select-100k

R-Select

Collection

High-quality Multi-metric selected datasets for efficient LLM post-training • 3 items • Updated 11 days ago

Paper for GX-XinGao/Qwen2.5-7B-R-Select-100k

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Paper • 2512.14051 • Published Dec 16, 2025 • 47