HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences

arXivarX

Hybrid auto-scaling (vertical + horizontal) and fine-grained GPU resource partitioning for SLO-aware serverless inference workloads.

View on arXiv

Defensibility

3.0/10

citations

co_authors

Platform Dominationhigh

Market Consolidationhigh

Displacement Horizon1-2 years

REASONING

HAS-GPU addresses the critical inefficiency in GPU serverless computing: the mismatch between rigid resource allocation and fluctuating inference workloads. By combining vertical scaling (adjusting GPU slices for active containers) with horizontal scaling (spinning up new instances), it aims to reduce cold starts and improve utilization. Quantitatively, the project has 0 stars but 5 forks, which is a classic signature of an academic repository where peers or researchers are cloning for evaluation rather than community adoption. From a competitive standpoint, the defensibility is low. The 'moat' consists entirely of the orchestration algorithms described in the paper. Major platforms like AWS (Sagemaker), Google Cloud (Vertex AI), and specialized GPU clouds like CoreWeave are already building similar proprietary schedulers to lower their COGS and offer better pricing. Open-source alternatives like KServe, Ray Serve, and vLLM are also moving toward more granular resource management. While the hybrid scaling approach is a novel combination of existing techniques, it is likely to be absorbed as a feature into larger orchestration frameworks rather than surviving as a standalone product. The displacement horizon is short because the industry is aggressively moving toward 'dynamic fractional GPU' allocation as a standard requirement for LLM inference at scale.

COMPOSABILITY

TECH STACK

KubernetesNVIDIA MIGNVIDIA MPSPythonPyTorchGoKnative

INTEGRATION

reference_implementation

gpu_partitioningserverless_inferenceslo_managementhybrid_autoscaling

READINESS

Composabilityframework

Depthreference_implementation