Collected sources and patterns will appear here. Add from search, explore, or the patterns library.
Model -> List<ModelShard>
Split weight matrices of feedforward and attention projection layers across multiple GPU devices to compute parallel matrix multiplications.
Problem it solves
Large model sizes exceed the maximum memory capacity of a single GPU device.
Consumes
Emits
The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.