MOP-RL-model
Introduction
MOP-RL-model 是一款专为多目标混合整数线性规划 (MO-MILP) 任务打造的大型语言模型。该模型基于 Qwen2.5-7B 架构,并使用新颖的多目标规划强化学习框架 (MOP-RL) 进行了对齐训练。
在复杂的资源调度、智能制造与现代物流决策中,传统大模型往往难以平衡存在冲突的多目标,且容易在长序列推理中产生“逻辑幻觉”或“奖励作弊 (Reward Hacking)”。MOP-RL-model 突破了这一瓶颈,不仅掌握了基础的运筹学约束构建能力,更具备了在高维决策空间中捕捉帕累托前沿 (Pareto Front) 的深层权衡智慧。
本模型的核心技术亮点包括:
- 两阶段课程学习 (Two-stage Curriculum Learning):模型经历了从“单目标基础训练 (密集奖励)”到“多目标扩展训练 (稀疏帕累托奖励)”的阶梯式对齐,有效抑制了强化学习中的策略震荡与基础能力崩塌。
- 帕累托感知奖励 (Pareto-Aware Reward Shaping):摒弃了传统的标量逼近奖励,引入基于底层求解器 (如 Gurobi) 支配性测试的帕累托验证器,将复杂的数学证明转化为精确的绝对物理反馈。
- REINFORCE++ 算法:采用去除了价值网络 (Critic-free) 的改进型策略梯度算法,结合 Batch 内优势函数归一化与概率比裁剪,显著提升了面对上千 Token 的结构化思维链 (Structured CoT) 推理时的收敛稳定性。
- 结构化思维链输出 (Structured CoT)**:模型被强制要求遵循“问题分析——> 建模与标量化——>可执行代码生成”的严谨规范,确保了生成结果的物理可执行性与逻辑自治性。
Requirements
运行 MOP-RL-model 的代码与普通 Qwen2.5 模型(详情参见:https://huggingface.co/Qwen/Qwen2.5-7B )一致,推荐使用最新版本的 transformers。
pip install transformers>=4.37.0
# 如果需要本地运行生成的优化代码,还需要安装求解器
pip install gurobipy
Warning: The generated code often imports
gurobipy. Ensure you have a valid Gurobi license to execute the generated solver scripts in your local environment.
Quickstart
以下是一段使用 transformers 库加载并运行模型进行多目标建模的示例代码:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "YourOrg/MOP-RL-model"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 示例多目标调度问题描述
prompt = """
**[Task]**
Based on the [Problem Description] and [Target JSON Format] below, please write a complete Python script.
The script must use the `gurobipy` library to model and solve the multi-objective problem, and encapsulate the final solution results into JSON data, writing them to a file named `input.json`.
**[Input Information]**
1. **Multi-objective Problem Description**:
\"\"\"
{problem_desc}
\"\"\"
2. **Target Output JSON Format Example**:
\"\"\"
{json_template}
\"\"\"
**[Requirements]**
1. **Modeling Logic**:
- Clearly define decision variables and constraints.
- **Multi-objective Handling**: Choose an appropriate multi-objective processing method based on the context (e.g., weighted sum, hierarchical sequence/lexicographic, or Pareto frontier). If weights are not specified, assume equal weights or provide adjustable parameters.
2. **Code Standards**:
- The code must be a complete, runnable Python script.
- Include necessary comments explaining the mathematical model.
- Must include checks for model solution status (e.g., checking for `GRB.OPTIMAL`).
3. **Data Output**:
- After solving, extract variable values.
- **Format Matching**: Construct a Python dictionary that strictly matches the [Target Output JSON Format Example].
- **File Writing**: Use `json.dump` to save the result as `input.json`.
4. **Final Output**:
- Please wrap your code in a Python markdown block, i.e., starts with ```python and ends with ```.
- Provide only the Python code block."""
messages = [
{"role": "system", "content": "You are an algorithm expert proficient in Operations Research and the Python Gurobi solver. You excel at translating complex business scenarios into mathematical models and writing robust code."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=2048
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
RL code
RL训练代码放在如下链接里: https://github.com/xuebozhang525-alt/MOP_RL
Evaluation & Performance
MOP-RL-model 在具有极高难度的工业级多目标混合整数线性规划测试集(包含 1892 个复杂长尾案例)上进行了严苛的级联指标测评。
相较于直接在多目标数据上进行零样本推理或单纯进行 SFT 的基线模型,本模型实现了“降维打击”般的性能跃升。
| Metric / Evaluation Level | Description | Score |
|---|---|---|
| 格式准确率 | 生成代码无格式错误 | 1.000 (100%) |
| 代码可执行率 | 无约束遗漏,成功提取变量并求得有效可行解,且求解器语法无报错的概率 | 88.01% |
| 条件帕累托率 | 在模型完全正确的前提下,解集通过支配性测试的概率 | 77.43% |
| 综合帕累托成功率 | parado解的总概率 | 68.15% |
与base的对比结果如下:
| 模型 | 格式准确率 | 代码可执行率 | 条件帕累托率 | 整体帕累托率 |
|---|---|---|---|---|
| ChatGPT 5(闭源) | 0.985 | 0.615 | 0.732 | 45.0% |
| DeepSeek-R1(671B) | 0.968 | 0.821 | 0.582 | 47.8% |
| Qwen3-Max(1T) | 0.975 | 0.862 | 0.689 | 59.4% |
| MOP-RL(7B) (Ours) | 1.000 | 0.880 | 0.774 | 68.1% |
Citation
如果您在研究中使用了本模型,请引用以下论文:
📢 注:相关数据集和论文将会在近日发布。
- Downloads last month
- 4