vllm-ltr-predictors β output-length predictors for LTR LLM-inference scheduling
Self-trained output-length predictors for Learning-to-Rank (LTR) scheduling in vLLM, plus the baseline latency measurements. Capstone artifact (CSCI 6806, FDU; advisor Prof. Jeeho Ryoo), reproducing & extending vllm-ltr (Fu et al., NeurIPS'24) and testing PARS (Tao et al., 2025, arXiv:2510.03243).
A predictor estimates a request's output length so the scheduler can approximate Shortest-Job-First, cutting head-of-line blocking. All numbers below are our own measurements β nothing fabricated.
Predictors (predictors/)
| dir | backbone | loss | role | Kendall's Tau (in-dist) |
|---|---|---|---|---|
listMLE-opt125m |
OPT-125M | listMLE | LTR baseline (reproduced) | 0.559 |
classification-opt125m |
OPT-125M | cross-entropy (10 buckets) | ranking-vs-classification control | 0.194 |
PARS-bert |
BERT-base | margin(1.0)+Ξ΄(0.2) | optimization (ours) | 0.596 |
A1-opt125m-margin |
OPT-125M | margin, Ξ΄=0.2 | ablation: isolate the loss | 0.543 |
A2-bert-nofilter |
BERT-base | margin, Ξ΄=0 | ablation: isolate the backbone | 0.598 |
Each dir holds finetuned/model.safetensors + finetuned/config.json + usage_config.json.
Key results
- Baseline (FCFS vs LTR,
results/): at high load LTR cuts TTFT ~2.86Γ (rate 16: 17.3 s β 6.0 s); honest cost: per-token latency (p99 TPOT) is worse β the SJF trade-off. - Ranking β« classification: Tau 0.559/0.596 vs 0.194 β confirms the base paper's design choice.
- Generalization (PARS vs listMLE): cross-distribution (ShareGPT) Tau 0.315 β 0.361 (+15% rel.).
- Ablation (honest negative): the cross-dist gain is driven almost entirely by the BERT backbone (+0.065); the margin loss (β0.012) and Ξ΄-filter (β0.007) did not help on this single-GPU 8B setup.
Limitation (disclosed)
The classification predictor trained fine (Tau 0.194) but its end-to-end latency could not be measured:
the fork's tpt-class10 schedule path returns 0 completions across all request rates (a serving bug,
not a config artifact). No classification latency numbers were produced or fabricated.
Base models & data (links only β not redistributed here)
- Serving model:
meta-llama/Meta-Llama-3-8B-Instruct(gated, Meta license) - Backbones:
facebook/opt-125m(non-commercial research only),google-bert/bert-base-uncased(Apache-2.0) - In-dist / training data:
lmsys/lmsys-chat-1m(gated) - Cross-dist data: ShareGPT (community mirrors, e.g.
anon8231489123/ShareGPT_Vicuna_unfiltered; provenance varies)
Derived length-labelled traces are not redistributed (upstream terms); regenerate them from the sources.
License
Mixed: the OPT-125M-derived weights (listMLE, classification, A1) are research / non-commercial only
(inherit OPT's terms); the BERT-derived weights (PARS, A2) are Apache-2.0. Use for academic research only.
Reproduction & citation
Code + reproduction guides: https://github.com/TaliesinYang/vllm-ltr-optimization
(docs/REPRODUCE-BASELINE.md, docs/REPRODUCE-PREDICTORS.md).
- Fu et al., Efficient LLM Scheduling by Learning to Rank, NeurIPS 2024.
- Tao et al., PARS (pairwise margin ranking), 2025, arXiv:2510.03243.
Model tree for Luna2099/vllm-ltr-predictors
Base model
facebook/opt-125m