multi-token-prediction-speculative-decoding

transform

HiddenState -> List<TokenProposal>

Employ auxiliary parallel prediction heads during inference to forecast subsequent tokens simultaneously, generating draft sequences for single-step verification.

Problem it solves

Auto-regressive generation is memory-bandwidth bound; traditional speculative decoding requires a separate, mismatched draft model.

Consumes

HiddenState

Emits

List<TokenProposal>

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.

Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoningarxiv