morgannito/blackwell-moe

GitHubGH

High-performance FP8 and INT4 inference kernels optimized for Mixture-of-Experts (MoE) workloads on NVIDIA Blackwell (sm_120) architecture.

View on GitHub

Defensibility

2.0/10

stars

Platform Dominationhigh

Market Consolidationhigh

Displacement Horizon6 months

REASONING

The project addresses a critical performance bottleneck: MoE inference on the latest NVIDIA Blackwell hardware. While the claimed 8.3x speedup over cuBLASLt is impressive, the project has zero stars and is only a day old, indicating it is currently a solo research or reference implementation. The defensibility is low because low-level kernel optimizations for new hardware are rapidly superseded by official vendor libraries (NVIDIA TensorRT-LLM, CUTLASS) and high-velocity open-source inference engines (vLLM, sglang). Specifically, as NVIDIA's 'Transformer Engine' matures for sm_120, these hand-rolled kernels will likely be sherlocked. The primary value here is the early-mover advantage for developers needing Blackwell-specific MoE kernels before they land in the main branches of larger frameworks. Displacement risk is extremely high within a 6-month window as standard libraries catch up to the new architecture.

COMPOSABILITY

TECH STACK

CUDAC++NVIDIA Blackwell (sm_120)FP8INT4CUTLASS

INTEGRATION

reference_implementation

low_latency_inferencemixture_of_expertsquantizationgpu_optimization

READINESS

Composabilitycomponent

Depthprototype

Novelty