stmcgovern/llm-inference-costs

GitHub

View on GitHub

2.0/10

Platform Domination Riskmedium

Market Consolidation Risklow

Displacement Horizon1-2 years

CORE FUNCTION

Cost modeling and estimation tool for mixture-of-experts (MoE) LLM inference serving, focusing on KV cache transfer overhead, expert pipeline (EP) scaling dynamics, and expert pipeline load balancing (EPLB) rebalancing costs.

TRACTION

stars

0.0 velocity

forks

0.0 velocity

REASONING

This is a 35-day-old personal research project with 1 star, no forks, and zero activity velocity. It appears to be a solo effort to quantify the cost characteristics of MoE inference serving—a legitimate technical problem but presented as back-of-envelope calculations rather than a production system or comprehensive framework. The novelty lies in combining KV cache transfer analysis with EPLB rebalancing cost models, which is a useful analytical angle for the growing MoE inference space, but the implementation is nascent and experimental. DEFENSIBILITY (2/10): No user adoption, no community, no moat. This is a personal tool/research artifact that anyone familiar with MoE inference could recreate in a weekend. The insights are valuable but not proprietary; the code is likely illustrative rather than production-hardened. PLATFORM DOMINATION (medium): Cloud providers (AWS SageMaker, Google Vertex, Azure ML) and LLM serving platforms (vLLM, TensorRT-LLM, Ollama) are all actively optimizing MoE inference costs. A platform could trivially absorb cost-modeling utilities as diagnostic features within 12-18 months. OpenAI, Anthropic, and Meta (running their own MoE models) have strong incentives to internalize this analysis. MARKET CONSOLIDATION (low): There is no incumbent cost-modeling vendor in MoE inference specifically. The problem is niche enough that startups haven't yet emerged to own it. Acquisition is unlikely unless traction grows dramatically. DISPLACEMENT HORIZON (1-2 years): Platforms will build native cost dashboards and simulators as MoE inference scales. This project has a narrow window to either become a community standard (low probability at current velocity) or be absorbed into a broader inference optimization framework. The technical insights are solid but the artifact itself is fragile and easily displaced. COMPOSABILITY: The code is likely useful as reference material or as an algorithm to embed in a cost calculator, but at this maturity it's more of an academic exercise than a reusable component. IMPLEMENTATION DEPTH: Prototype—demonstrates the analytical framework but lacks the robustness, testing, and documentation for production use.

COMPOSABILITY

TECH STACK

PythonLLM inference frameworks (likely vLLM or similar)numerical modeling (NumPy/SciPy likely)

INTEGRATION

reference_implementation

cost_modelingkv_cache_analysismoe_scalinginference_optimization

READINESS

Composabilityalgorithm

Depthprototype

Noveltynovel_combination