point-based spatial grounding

AI / MLtransform

(Image, TextQuery) -> PointAnnotatedText

Translate natural language target queries into normalized 2D pixel coordinate tokens embedded directly inside text output.

Problem it solves

Traditional bounding box formats are overly verbose and computationally complex for generative language decoders to output.

Consumes

ImageTextQuery

Emits

PointAnnotatedText

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.