overlapping sparse lookup with dense computation

read

Batch<N+1> -> PrefetchedEmbeddings<N+1>

Prefetch and route sparse embedding features for the next training iteration while simultaneously executing dense neural network layers for the current iteration.

Problem it solves

Embedding lookup and network communication stall the accelerator's compute units during large-scale recommendation training.

Consumes

BatchID

Emits

PrefetchedEmbeddings

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipeliningarxiv