Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference

arXivarX

Optimization heuristics (Greedy and Adaptive Greedy) for allocating mixed-scale LLMs across heterogeneous GPU clusters to satisfy Service Level Objectives (SLOs) and budget constraints.

View on arXiv

Defensibility

2.0/10

citations

co_authors

Platform Dominationhigh

Market Consolidationhigh

Displacement Horizon1-2 years

REASONING

This project is a research artifact (9 days old, 0 stars) providing a mathematical approach to the 'packing and routing' problem for LLM inference. While the optimization heuristics (GH/AGH) solve a critical problem—balancing cost, latency, and model accuracy across varied GPU tiers—the code lacks the infrastructure to be a standalone product. It is highly vulnerable to 'feature absorption' by existing orchestration and serving frameworks. Specific competitors include SkyPilot (for cloud orchestration), Ray Serve (for inference scaling), and vLLM's internal scheduling logic. Frontier labs and hyperscalers (AWS, Azure, Google) already utilize similar internal MILP-based or heuristic schedulers for their managed LLM services (Bedrock, Vertex AI). The primary value of this work is as a reference implementation for engineers building in-house inference platforms rather than a defensible open-source project.

COMPOSABILITY

TECH STACK

PythonMILP Solvers (e.g., Gurobi/CPLEX)PyTorchNumPy

INTEGRATION

algorithm_implementable

llm_servingresource_allocationslo_optimizationheterogeneous_computing

READINESS

Composabilityalgorithm

Depthreference_implementation

Noveltynovel_combination