MOP-RL-model

Introduction

MOP-RL-model 是一款专为多目标混合整数线性规划 (MO-MILP) 任务打造的大型语言模型。该模型基于 Qwen2.5-7B 架构，并使用新颖的多目标规划强化学习框架 (MOP-RL) 进行了对齐训练。

在复杂的资源调度、智能制造与现代物流决策中，传统大模型往往难以平衡存在冲突的多目标，且容易在长序列推理中产生“逻辑幻觉”或“奖励作弊 (Reward Hacking)”。MOP-RL-model 突破了这一瓶颈，不仅掌握了基础的运筹学约束构建能力，更具备了在高维决策空间中捕捉帕累托前沿 (Pareto Front) 的深层权衡智慧。

本模型的核心技术亮点包括：

两阶段课程学习 (Two-stage Curriculum Learning)：模型经历了从“单目标基础训练 (密集奖励)”到“多目标扩展训练 (稀疏帕累托奖励)”的阶梯式对齐，有效抑制了强化学习中的策略震荡与基础能力崩塌。
帕累托感知奖励 (Pareto-Aware Reward Shaping)：摒弃了传统的标量逼近奖励，引入基于底层求解器 (如 Gurobi) 支配性测试的帕累托验证器，将复杂的数学证明转化为精确的绝对物理反馈。
REINFORCE++ 算法：采用去除了价值网络 (Critic-free) 的改进型策略梯度算法，结合 Batch 内优势函数归一化与概率比裁剪，显著提升了面对上千 Token 的结构化思维链 (Structured CoT) 推理时的收敛稳定性。
结构化思维链输出 (Structured CoT)**：模型被强制要求遵循“问题分析——> 建模与标量化——>可执行代码生成”的严谨规范，确保了生成结果的物理可执行性与逻辑自治性。

Requirements

运行 MOP-RL-model 的代码与普通 Qwen2.5 模型（详情参见：https://huggingface.co/Qwen/Qwen2.5-7B ）一致，推荐使用最新版本的 transformers。

pip install transformers>=4.37.0
# 如果需要本地运行生成的优化代码，还需要安装求解器
pip install gurobipy

Warning: The generated code often imports gurobipy. Ensure you have a valid Gurobi license to execute the generated solver scripts in your local environment.

Quickstart

以下是一段使用 transformers 库加载并运行模型进行多目标建模的示例代码：

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "YourOrg/MOP-RL-model"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 示例多目标调度问题描述
prompt = """
**[Task]**
Based on the [Problem Description] and [Target JSON Format] below, please write a complete Python script.
The script must use the `gurobipy` library to model and solve the multi-objective problem, and encapsulate the final solution results into JSON data, writing them to a file named `input.json`.

**[Input Information]**
1.  **Multi-objective Problem Description**:
    \"\"\"
    {problem_desc}
    \"\"\"
2.  **Target Output JSON Format Example**:
    \"\"\"
    {json_template}
    \"\"\"

**[Requirements]**
1.  **Modeling Logic**:
    - Clearly define decision variables and constraints.
    - **Multi-objective Handling**: Choose an appropriate multi-objective processing method based on the context (e.g., weighted sum, hierarchical sequence/lexicographic, or Pareto frontier). If weights are not specified, assume equal weights or provide adjustable parameters.
2.  **Code Standards**:
    - The code must be a complete, runnable Python script.
    - Include necessary comments explaining the mathematical model.
    - Must include checks for model solution status (e.g., checking for `GRB.OPTIMAL`).
3.  **Data Output**:
    - After solving, extract variable values.
    - **Format Matching**: Construct a Python dictionary that strictly matches the [Target Output JSON Format Example].
    - **File Writing**: Use `json.dump` to save the result as `input.json`.
4.  **Final Output**:
    - Please wrap your code in a Python markdown block, i.e., starts with ```python and ends with ```.
    - Provide only the Python code block."""

messages = [
    {"role": "system", "content": "You are an algorithm expert proficient in Operations Research and the Python Gurobi solver. You excel at translating complex business scenarios into mathematical models and writing robust code."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

RL code

RL训练代码放在如下链接里： https://github.com/xuebozhang525-alt/MOP_RL

Evaluation & Performance

MOP-RL-model 在具有极高难度的工业级多目标混合整数线性规划测试集（包含 1892 个复杂长尾案例）上进行了严苛的级联指标测评。

相较于直接在多目标数据上进行零样本推理或单纯进行 SFT 的基线模型，本模型实现了“降维打击”般的性能跃升。

Metric / Evaluation Level	Description	Score
格式准确率	生成代码无格式错误	1.000 (100%)
代码可执行率	无约束遗漏，成功提取变量并求得有效可行解，且求解器语法无报错的概率	88.01%
条件帕累托率	在模型完全正确的前提下，解集通过支配性测试的概率	77.43%
综合帕累托成功率	parado解的总概率	68.15%

与base的对比结果如下：

模型	格式准确率	代码可执行率	条件帕累托率	整体帕累托率
ChatGPT 5（闭源）	0.985	0.615	0.732	45.0%
DeepSeek-R1(671B)	0.968	0.821	0.582	47.8%
Qwen3-Max(1T)	0.975	0.862	0.689	59.4%
MOP-RL(7B) (Ours)	1.000	0.880	0.774	68.1%

Citation

如果您在研究中使用了本模型，请引用以下论文：

📢 注：相关数据集和论文将会在近日发布。

Downloads last month: 4

Safetensors

Model size

8B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support