spatial-coordinate text embedding

AI / MLtransform

TextTokens + BoundingBoxes -> SpatialTextEmbeddings

Encode 2D spatial coordinates of word bounding boxes into coordinate embeddings and sum them with standard text token embeddings to retain document layout geometry.

Problem it solves

Standard 1D text tokenization discards visual and spatial formatting of scanned or PDF documents.

Consumes

TextTokensBoundingBoxes

Emits

SpatialTextEmbeddings

Distilled from 1 source

The real projects this mechanism was found in. Attribution is the point — this is how the best teams actually do it.

microsoft/unilmgithub