vllm-ltr-predictors — output-length predictors for LTR LLM-inference scheduling

Self-trained output-length predictors for Learning-to-Rank (LTR) scheduling in vLLM, plus the baseline latency measurements. Capstone artifact (CSCI 6806, FDU; advisor Prof. Jeeho Ryoo), reproducing & extending vllm-ltr (Fu et al., NeurIPS'24) and testing PARS (Tao et al., 2025, arXiv:2510.03243).

A predictor estimates a request's output length so the scheduler can approximate Shortest-Job-First, cutting head-of-line blocking. All numbers below are our own measurements — nothing fabricated.

Predictors (`predictors/`)

dir	backbone	loss	role	Kendall's Tau (in-dist)
`listMLE-opt125m`	OPT-125M	listMLE	LTR baseline (reproduced)	0.559
`classification-opt125m`	OPT-125M	cross-entropy (10 buckets)	ranking-vs-classification control	0.194
`PARS-bert`	BERT-base	margin(1.0)+δ(0.2)	optimization (ours)	0.596
`A1-opt125m-margin`	OPT-125M	margin, δ=0.2	ablation: isolate the loss	0.543
`A2-bert-nofilter`	BERT-base	margin, δ=0	ablation: isolate the backbone	0.598

Each dir holds finetuned/model.safetensors + finetuned/config.json + usage_config.json.

Key results

Baseline (FCFS vs LTR, results/): at high load LTR cuts TTFT ~2.86× (rate 16: 17.3 s → 6.0 s); honest cost: per-token latency (p99 TPOT) is worse — the SJF trade-off.
Ranking ≫ classification: Tau 0.559/0.596 vs 0.194 — confirms the base paper's design choice.
Generalization (PARS vs listMLE): cross-distribution (ShareGPT) Tau 0.315 → 0.361 (+15% rel.).
Ablation (honest negative): the cross-dist gain is driven almost entirely by the BERT backbone (+0.065); the margin loss (−0.012) and δ-filter (−0.007) did not help on this single-GPU 8B setup.

Limitation (disclosed)

The classification predictor trained fine (Tau 0.194) but its end-to-end latency could not be measured: the fork's tpt-class10 schedule path returns 0 completions across all request rates (a serving bug, not a config artifact). No classification latency numbers were produced or fabricated.

Base models & data (links only — not redistributed here)

Serving model: meta-llama/Meta-Llama-3-8B-Instruct (gated, Meta license)
Backbones: facebook/opt-125m (non-commercial research only), google-bert/bert-base-uncased (Apache-2.0)
In-dist / training data: lmsys/lmsys-chat-1m (gated)
Cross-dist data: ShareGPT (community mirrors, e.g. anon8231489123/ShareGPT_Vicuna_unfiltered; provenance varies)

Derived length-labelled traces are not redistributed (upstream terms); regenerate them from the sources.

License

Mixed: the OPT-125M-derived weights (listMLE, classification, A1) are research / non-commercial only (inherit OPT's terms); the BERT-derived weights (PARS, A2) are Apache-2.0. Use for academic research only.

Reproduction & citation

Code + reproduction guides: https://github.com/TaliesinYang/vllm-ltr-optimization (docs/REPRODUCE-BASELINE.md, docs/REPRODUCE-PREDICTORS.md).

Fu et al., Efficient LLM Scheduling by Learning to Rank, NeurIPS 2024.
Tao et al., PARS (pairwise margin ranking), 2025, arXiv:2510.03243.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Luna2099/vllm-ltr-predictors

Base model

facebook/opt-125m

Finetuned

(118)

this model

Paper for Luna2099/vllm-ltr-predictors

Prompt-Aware Scheduling for Low-Latency LLM Serving

Paper • 2510.03243 • Published Oct 10, 2025