FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference

arXivarX

An SLO-aware GPU scheduling framework that enables both spatial (partitioning SMs/memory) and temporal (time-slicing) multiplexing for deep learning inference in serverless environments.

View on arXiv

Defensibility

2.0/10

citations

co_authors

Platform Dominationhigh

Market Consolidationhigh

Displacement Horizon6 months

REASONING

FaST-GShare represents a typical academic contribution to the field of GPU resource management. While technically sound—addressing the critical inefficiency of 'coarse-grained' GPU allocation in FaaS—it lacks any market defensibility. Quantitatively, the project shows zero stars and minimal activity nearly three years post-release, indicating it has not transitioned from a paper artifact to a living open-source tool. Competitively, this space is dominated by infrastructure giants and hardware vendors. NVIDIA's Multi-Instance GPU (MIG) and Multi-Process Service (MPS) provide the primitives, while orchestration layers like Kubernetes (via Device Plugins) and CSP-specific implementations (AWS Lambda with GPU, Google Cloud Run) are the 'natural' homes for this logic. Projects like Alibaba's Antman or NTHU's KubeShare offer more mature, community-backed alternatives. For an investor or analyst, the risk is 'high' because the functionality is a feature of the platform, not a standalone product; frontier labs and cloud providers are incentivized to build this directly into their control planes to improve their own margins and offer lower pricing, rendering third-party scheduling shims obsolete.

COMPOSABILITY

TECH STACK

PythonC++CUDAPyTorchNVIDIA MPSKubernetes

INTEGRATION

reference_implementation

gpu_schedulingserverless_inferenceresource_isolationslo_managementspatio_temporal_multiplexing

READINESS

Composabilityframework

Depthreference_implementation