dynamic-capacity-switching

transform

DynamicRoutingMask, Activations -> RoutedExpertOutputs

Dynamically adjust expert capacity and parallel routing configurations at runtime without performance penalties or tensor padding.

Problem it solves

Static expert capacity allocations cause tensor padding overhead or token drop when token-to-expert distributions shift dynamically.

Consumes

DynamicRoutingMaskActivations

Emits

RoutedExpertOutputs

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.