Collected sources and patterns will appear here. Add from search, explore, or the patterns library.
Image -> DepthAugmentedVisualTokens
Inject monocular depth-estimation features alongside RGB frames to provide explicit spatial priors to a vision-language-action model.
Problem it solves
Standard 2D VLMs lack precise depth and 3D geometric awareness required for robotic manipulation tasks.
Consumes
Emits
The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.