vllm-ltr-predictors β€” output-length predictors for LTR LLM-inference scheduling

Self-trained output-length predictors for Learning-to-Rank (LTR) scheduling in vLLM, plus the baseline latency measurements. Capstone artifact (CSCI 6806, FDU; advisor Prof. Jeeho Ryoo), reproducing & extending vllm-ltr (Fu et al., NeurIPS'24) and testing PARS (Tao et al., 2025, arXiv:2510.03243).

A predictor estimates a request's output length so the scheduler can approximate Shortest-Job-First, cutting head-of-line blocking. All numbers below are our own measurements β€” nothing fabricated.

Predictors (predictors/)

dir backbone loss role Kendall's Tau (in-dist)
listMLE-opt125m OPT-125M listMLE LTR baseline (reproduced) 0.559
classification-opt125m OPT-125M cross-entropy (10 buckets) ranking-vs-classification control 0.194
PARS-bert BERT-base margin(1.0)+Ξ΄(0.2) optimization (ours) 0.596
A1-opt125m-margin OPT-125M margin, Ξ΄=0.2 ablation: isolate the loss 0.543
A2-bert-nofilter BERT-base margin, Ξ΄=0 ablation: isolate the backbone 0.598

Each dir holds finetuned/model.safetensors + finetuned/config.json + usage_config.json.

Key results

  • Baseline (FCFS vs LTR, results/): at high load LTR cuts TTFT ~2.86Γ— (rate 16: 17.3 s β†’ 6.0 s); honest cost: per-token latency (p99 TPOT) is worse β€” the SJF trade-off.
  • Ranking ≫ classification: Tau 0.559/0.596 vs 0.194 β€” confirms the base paper's design choice.
  • Generalization (PARS vs listMLE): cross-distribution (ShareGPT) Tau 0.315 β†’ 0.361 (+15% rel.).
  • Ablation (honest negative): the cross-dist gain is driven almost entirely by the BERT backbone (+0.065); the margin loss (βˆ’0.012) and Ξ΄-filter (βˆ’0.007) did not help on this single-GPU 8B setup.

Limitation (disclosed)

The classification predictor trained fine (Tau 0.194) but its end-to-end latency could not be measured: the fork's tpt-class10 schedule path returns 0 completions across all request rates (a serving bug, not a config artifact). No classification latency numbers were produced or fabricated.

Base models & data (links only β€” not redistributed here)

Derived length-labelled traces are not redistributed (upstream terms); regenerate them from the sources.

License

Mixed: the OPT-125M-derived weights (listMLE, classification, A1) are research / non-commercial only (inherit OPT's terms); the BERT-derived weights (PARS, A2) are Apache-2.0. Use for academic research only.

Reproduction & citation

Code + reproduction guides: https://github.com/TaliesinYang/vllm-ltr-optimization (docs/REPRODUCE-BASELINE.md, docs/REPRODUCE-PREDICTORS.md).

  • Fu et al., Efficient LLM Scheduling by Learning to Rank, NeurIPS 2024.
  • Tao et al., PARS (pairwise margin ranking), 2025, arXiv:2510.03243.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Luna2099/vllm-ltr-predictors

Finetuned
(118)
this model

Paper for Luna2099/vllm-ltr-predictors