Instructions to use GX-XinGao/Qwen2.5-7B-R-Select-100k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use GX-XinGao/Qwen2.5-7B-R-Select-100k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="GX-XinGao/Qwen2.5-7B-R-Select-100k") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("GX-XinGao/Qwen2.5-7B-R-Select-100k") model = AutoModelForCausalLM.from_pretrained("GX-XinGao/Qwen2.5-7B-R-Select-100k") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use GX-XinGao/Qwen2.5-7B-R-Select-100k with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "GX-XinGao/Qwen2.5-7B-R-Select-100k" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GX-XinGao/Qwen2.5-7B-R-Select-100k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/GX-XinGao/Qwen2.5-7B-R-Select-100k
- SGLang
How to use GX-XinGao/Qwen2.5-7B-R-Select-100k with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "GX-XinGao/Qwen2.5-7B-R-Select-100k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GX-XinGao/Qwen2.5-7B-R-Select-100k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "GX-XinGao/Qwen2.5-7B-R-Select-100k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "GX-XinGao/Qwen2.5-7B-R-Select-100k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use GX-XinGao/Qwen2.5-7B-R-Select-100k with Docker Model Runner:
docker model run hf.co/GX-XinGao/Qwen2.5-7B-R-Select-100k
Qwen2.5-7B-R-Select-100k
Qwen2.5-7B-R-Select-100k is a supervised fine-tuned (SFT) model built on top of Qwen2.5-7B-Base, trained with R-Select-100k.
π§ Model Summary
- Base Model:
Qwen/Qwen2.5-7B-Base - Training Data:
OpenDataArena/R-Select-100k - Domain Coverage: General, Math, Code, Reasoning
- Scale (selected training set): 100K samples
π About R-Select
R-Select: A Robust Multi-Metric Data Selection Approach for Fine-Tuning Large Language Models is a KDD 2026 paper that studies how to select high-quality SFT data from large heterogeneous instruction-tuning pools. R-Select is designed to move beyond single-metric filtering or manually designed aggregation rules by formulating data selection as a multi-metric weight optimization problem.
Specifically, R-Select annotates each sample with 30 quality metrics, clusters correlated metrics into functional groups, and learns a hierarchical selection policy through proxy-model validation. The learned policy is then applied to a source pool of over 3.4M samples to select 100K high-value samples for downstream SFT.
Code: https://github.com/OpenDataArena/R-Select
βοΈ Data Curation Pipeline
Overview of R-Select.
R-Select-100k is built by applying the R-Select data selection framework to a large-scale heterogeneous SFT data pool.
1οΈβ£ Source Pool Construction
We construct a candidate pool from 21 publicly available instruction-tuning datasets, covering diverse domains and paradigms such as reasoning, code, math, and general instruction following. The source pool includes datasets such as OpenThoughts3, AM-Thinking-v1-Distilled, OpenThoughts, OmniThought, LIMO.
2οΈβ£ Multi-Metric Annotation
Each sample is annotated with 30 complementary data-quality metrics using the OpenDataArena toolkit. These metrics cover three broad categories:
- Model-based metrics: e.g., IFD, PPL, SkyworkRM.
- Heuristic metrics: e.g., token length, token entropy, MTLD.
- LLM-as-Judge: e.g., complexity.
Before optimization, metric values are standardized through outlier mitigation, Min-Max normalization, and score inversion for metrics where lower values indicate better quality. The scored version of the data has also been released at: OpenDataArena/OpenDataArena-scored-data-2603. Detailed metric information can also be found there.
3οΈβ£ Hierarchical Bayesian Optimization
R-Select formulates data selection as a learnable multi-metric weighting problem. Instead of using a single metric or manually designed aggregation rule, it learns how to combine metrics automatically.
The optimization process has two stages:
- Intra-group refinement: correlated metrics are clustered with Ward hierarchical clustering, and local weights are optimized within each metric group.
- Inter-group integration: group-level signals are combined through a second-stage optimization to balance global quality dimensions.
The search uses a lightweight proxy model, Qwen3-1.7B-Base, and the Optuna TPE optimizer. The proxy validation set contains 2,255 samples from Omni-Math, GPQA, MMLU, BigCodeBench, and MBPP.
4οΈβ£ Final Selection
After optimization, the learned weight vector is applied to the full candidate pool. Each sample receives a final scalar quality score, and the top 100K samples are selected to form R-Select-100k.
π Source Composition
| Source | Selected_count | Percentage |
|---|---|---|
| AM-Thinking-v1-Distilled-math | 28,434 | 28.43% |
| OpenThoughts | 22,904 | 22.90% |
| AM-Thinking-v1-Distilled-code | 12,670 | 12.67% |
| Tulu-3-Persona-MATH | 8,693 | 8.69% |
| OpenO1-SFT | 6,670 | 6.67% |
| OpenThoughts3 | 4,062 | 4.06% |
| NuminaMath-TIR | 3,825 | 3.83% |
| Raiden-DeepSeek-R1 | 3,104 | 3.10% |
| OmniThought | 1,735 | 1.74% |
| Magpie-Pro-GPT4o-mini | 1,576 | 1.58% |
| Fast-Math-R1-SFT | 1,493 | 1.49% |
| Tulu-3-Persona-Python | 1,469 | 1.47% |
| Tulu-3-Persona-IF | 1,403 | 1.40% |
| SYNTHETIC-2-SFT-verified | 1,014 | 1.01% |
| Tulu-3-Persona-Algebra | 428 | 0.43% |
| Tulu-3-Persona-GSM | 190 | 0.19% |
| FLAN-v2 | 105 | 0.11% |
| LIMO | 99 | 0.10% |
| Evol-CodeAlpaca | 80 | 0.08% |
| SciRIFF | 46 | 0.05% |
| No-Robots | 0 | 0.00% |
π Data Format
{
"id": "unique_identifier",
"source": "source_dataset",
"instruction": "textual question or instruction",
"outpt": "textual response",
"scores": [
"AtheneScore": ...,
"CleanlinessScore": ...,
...
],
"overall_score":...
}
π Performance
R-Select-100k is evaluated as an SFT corpus for both Qwen2.5-7B-Base and Qwen3-8B-Base. Evaluation is conducted with OpenCompass on unseen benchmarks across four domains:
- General: DROP, IFEval, MMLU-Pro
- Math: MATH500, OlympiadBench, AIME2024
- Code: HumanEval, HumanEval+, LiveCodeBench v5
- Reasoning: ARC-C, BBH, KOR-Bench
R-Select-100k achieves the best reported average performance among the compared open-source SFT datasets while using only 100K samples.
| Dataset | Size | DROP | IFEval | MMLU-P | MATH500 | OLYMP | AIME'24 | HE | HE+ | LCB | ARC-C | BBH | K.B. | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B-Base | ||||||||||||||
| Qwen2.5-7B-Base | - | 68.3 | 35.5 | 44.2 | 50.2 | 35.9 | 6.7 | 77.4 | 43.3 | 8.2 | 36.6 | 69.5 | 33.3 | 45.8 |
| LIMO | 817 | 73.5 | 53.3 | 54.7 | 66.8 | 34.9 | 4.6 | 83.5 | 59.8 | 17.6 | 90.9 | 55.6 | 48.3 | 53.6 |
| Light-R1-SFT | 79k | 79.4 | 38.5 | 46.3 | 88.0 | 60.2 | 38.3 | 45.7 | 40.2 | 3.2 | 78.0 | 72.7 | 52.3 | 53.6 |
| SYNTHETIC-2-SFT | 105k | 64.4 | 54.9 | 35.5 | 90.0 | 67.4 | 45.0 | 42.1 | 40.2 | 11.1 | 93.2 | 81.1 | 58.8 | 57.0 |
| OpenThoughts | 114k | 68.5 | 41.2 | 55.6 | 83.6 | 53.3 | 22.5 | 68.9 | 68.9 | 17.2 | 90.5 | 72.4 | 51.0 | 57.8 |
| OmniThought | 365k | 52.8 | 34.3 | 38.1 | 89.8 | 68.1 | 50.4 | 57.9 | 51.8 | 17.9 | 90.5 | 76.8 | 58.9 | 57.2 |
| MiroMind | 719k | 82.3 | 30.6 | 38.6 | 91.6 | 66.3 | 55.0 | 32.3 | 26.2 | 5.4 | 84.1 | 74.9 | 51.4 | 53.2 |
| Tulu3-sft-mixture | 939k | 62.3 | 72.1 | 46.0 | 55.5 | 29.0 | 4.4 | 79.1 | 66.7 | 12.3 | 80.7 | 61.9 | 48.5 | 51.5 |
| R-Select | 100k | 65.7 | 46.6 | 46.9 | 86.6 | 64.4 | 31.7 | 65.9 | 63.6 | 19.0 | 87.5 | 66.8 | 52.2 | 58.1 |
| Qwen3-8B-Base | ||||||||||||||
| Qwen3-8B-Base | - | 71.5 | 45.9 | 56.2 | 79.6 | 47.2 | 6.7 | 82.9 | 34.2 | 16.9 | 37.3 | 78.1 | 46.6 | 50.3 |
| LIMO | 817 | 72.0 | 52.3 | 49.8 | 69.0 | 31.3 | 12.5 | 81.1 | 62.2 | 13.6 | 83.4 | 48.7 | 45.7 | 51.8 |
| Light-R1-SFT | 79k | 83.4 | 46.8 | 57.7 | 92.6 | 69.7 | 54.6 | 81.7 | 65.9 | 14.7 | 92.2 | 84.5 | 60.4 | 67.0 |
| SYNTHETIC-2-SFT | 105k | 38.7 | 67.6 | 56.8 | 93.8 | 71.5 | 58.8 | 81.1 | 42.7 | 18.3 | 92.9 | 86.6 | 64.1 | 64.4 |
| OpenThoughts | 114k | 79.6 | 43.6 | 38.7 | 92.2 | 71.5 | 47.9 | 72.0 | 75.0 | 31.5 | 91.2 | 82.3 | 57.7 | 65.2 |
| OmniThought | 365k | 50.3 | 49.7 | 46.2 | 95.4 | 74.9 | 67.9 | 91.5 | 64.6 | 29.4 | 93.9 | 86.8 | 65.0 | 67.9 |
| MiroMind | 719k | 85.0 | 43.5 | 55.1 | 96.8 | 77.0 | 62.9 | 82.3 | 71.3 | 21.2 | 92.9 | 86.5 | 62.0 | 69.7 |
| Tulu3-sft-mixture | 939k | 56.3 | 69.6 | 44.7 | 56.4 | 34.4 | 4.2 | 72.0 | 62.8 | 15.1 | 85.4 | 64.7 | 47.0 | 51.0 |
| R-Select | 100k | 81.8 | 56.4 | 61.1 | 94.4 | 71.1 | 56.7 | 81.1 | 75.6 | 28.0 | 92.9 | 80.6 | 59.1 | 69.9 |
π Usage
Model repo: OpenDataArena/Qwen2.5-7B-R-Select-100k. Below is a minimal runnable example for loading and inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "OpenDataArena/Qwen2.5-7B-R-Select-100k"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)
messages = [
{"role": "user", "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
π About OpenDataArena
OpenDataArena is an open research platform for discovering, evaluating, and advancing high-quality datasets for AI post-training. R-Select uses the OpenDataArena toolkit for multi-metric data annotation and quality analysis.
Key Features:
- π Dataset Leaderboard β helps researchers identify valuable and high-quality datasets across different domains.
- π Detailed Evaluation Scores β provides rich metric annotations and scored data to assess data quality, complexity, difficulty, and related properties.
- π§° Data Scoring Toolkit β provides an open-source toolkit for scoring datasets with multiple quality metrics.
- 𧬠Data Lineage β analyzes relationships among datasets by exploring their composition and source overlap.
If you find our work helpful, please consider β starring and subscribing to support our research.
π Citation
@inproceedings{gao2026rselect,
title={R-Select: A Robust Multi-Metric Data Selection Approach for Fine-Tuning Large Language Models},
author={Gao, Xin and Wang, Xiaoyang and Zhu, Yun and Liu, Zheng and He, Conghui and Wu, Lijun},
booktitle={Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2},
year={2026},
publisher={ACM},
doi={10.1145/3770855.3817656}
}
@article{cai2025opendataarena,
title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
journal={arXiv preprint arXiv:2512.14051},
year={2025}
}
- Downloads last month
- 40