MOP-RL-model

Introduction

MOP-RL-model 是一款专为多目标混合整数线性规划 (MO-MILP) 任务打造的大型语言模型。该模型基于 Qwen2.5-7B 架构,并使用新颖的多目标规划强化学习框架 (MOP-RL) 进行了对齐训练。

在复杂的资源调度、智能制造与现代物流决策中,传统大模型往往难以平衡存在冲突的多目标,且容易在长序列推理中产生“逻辑幻觉”或“奖励作弊 (Reward Hacking)”。MOP-RL-model 突破了这一瓶颈,不仅掌握了基础的运筹学约束构建能力,更具备了在高维决策空间中捕捉帕累托前沿 (Pareto Front) 的深层权衡智慧。

本模型的核心技术亮点包括:

  • 两阶段课程学习 (Two-stage Curriculum Learning):模型经历了从“单目标基础训练 (密集奖励)”到“多目标扩展训练 (稀疏帕累托奖励)”的阶梯式对齐,有效抑制了强化学习中的策略震荡与基础能力崩塌。
  • 帕累托感知奖励 (Pareto-Aware Reward Shaping):摒弃了传统的标量逼近奖励,引入基于底层求解器 (如 Gurobi) 支配性测试的帕累托验证器,将复杂的数学证明转化为精确的绝对物理反馈。
  • REINFORCE++ 算法:采用去除了价值网络 (Critic-free) 的改进型策略梯度算法,结合 Batch 内优势函数归一化与概率比裁剪,显著提升了面对上千 Token 的结构化思维链 (Structured CoT) 推理时的收敛稳定性。
  • 结构化思维链输出 (Structured CoT)**:模型被强制要求遵循“问题分析——> 建模与标量化——>可执行代码生成”的严谨规范,确保了生成结果的物理可执行性与逻辑自治性。

Requirements

运行 MOP-RL-model 的代码与普通 Qwen2.5 模型(详情参见:https://huggingface.co/Qwen/Qwen2.5-7B )一致,推荐使用最新版本的 transformers

pip install transformers>=4.37.0
# 如果需要本地运行生成的优化代码,还需要安装求解器
pip install gurobipy

Warning: The generated code often imports gurobipy. Ensure you have a valid Gurobi license to execute the generated solver scripts in your local environment.

Quickstart

以下是一段使用 transformers 库加载并运行模型进行多目标建模的示例代码:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "YourOrg/MOP-RL-model"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 示例多目标调度问题描述
prompt = """
**[Task]**
Based on the [Problem Description] and [Target JSON Format] below, please write a complete Python script.
The script must use the `gurobipy` library to model and solve the multi-objective problem, and encapsulate the final solution results into JSON data, writing them to a file named `input.json`.

**[Input Information]**
1.  **Multi-objective Problem Description**:
    \"\"\"
    {problem_desc}
    \"\"\"
2.  **Target Output JSON Format Example**:
    \"\"\"
    {json_template}
    \"\"\"

**[Requirements]**
1.  **Modeling Logic**:
    - Clearly define decision variables and constraints.
    - **Multi-objective Handling**: Choose an appropriate multi-objective processing method based on the context (e.g., weighted sum, hierarchical sequence/lexicographic, or Pareto frontier). If weights are not specified, assume equal weights or provide adjustable parameters.
2.  **Code Standards**:
    - The code must be a complete, runnable Python script.
    - Include necessary comments explaining the mathematical model.
    - Must include checks for model solution status (e.g., checking for `GRB.OPTIMAL`).
3.  **Data Output**:
    - After solving, extract variable values.
    - **Format Matching**: Construct a Python dictionary that strictly matches the [Target Output JSON Format Example].
    - **File Writing**: Use `json.dump` to save the result as `input.json`.
4.  **Final Output**:
    - Please wrap your code in a Python markdown block, i.e., starts with ```python and ends with ```.
    - Provide only the Python code block."""

messages = [
    {"role": "system", "content": "You are an algorithm expert proficient in Operations Research and the Python Gurobi solver. You excel at translating complex business scenarios into mathematical models and writing robust code."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

RL code

RL训练代码放在如下链接里: https://github.com/xuebozhang525-alt/MOP_RL

Evaluation & Performance

MOP-RL-model 在具有极高难度的工业级多目标混合整数线性规划测试集(包含 1892 个复杂长尾案例)上进行了严苛的级联指标测评。

相较于直接在多目标数据上进行零样本推理或单纯进行 SFT 的基线模型,本模型实现了“降维打击”般的性能跃升。

Metric / Evaluation Level Description Score
格式准确率 生成代码无格式错误 1.000 (100%)
代码可执行率 无约束遗漏,成功提取变量并求得有效可行解,且求解器语法无报错的概率 88.01%
条件帕累托率 在模型完全正确的前提下,解集通过支配性测试的概率 77.43%
综合帕累托成功率 parado解的总概率 68.15%

与base的对比结果如下:

模型 格式准确率 代码可执行率 条件帕累托率 整体帕累托率
ChatGPT 5(闭源) 0.985 0.615 0.732 45.0%
DeepSeek-R1(671B) 0.968 0.821 0.582 47.8%
Qwen3-Max(1T) 0.975 0.862 0.689 59.4%
MOP-RL(7B) (Ours) 1.000 0.880 0.774 68.1%

Citation

如果您在研究中使用了本模型,请引用以下论文:

📢 注:相关数据集和论文将会在近日发布。


Downloads last month
4
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support