latent-mixture-of-experts

transform

Tensor<Batch, Seq, Dim> -> Tensor<Batch, Seq, Dim>

Project input representations into a lower-dimensional latent space before applying Mixture-of-Experts routing and token-to-expert multiplication.

Problem it solves

Standard Mixture-of-Experts layers suffer from high parameter and memory bandwidth overhead relative to their FLOP usage.

Consumes

Tensor<Batch, Seq, Dim>

Emits

Tensor<Batch, Seq, Dim>

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.