CLASP: Closed-loop Asynchronous Spatial Perception for Open-vocabulary Desktop Object Grasping

arXivarX

A robotic grasping framework that integrates Vision-Language Models (VLMs) with asynchronous, closed-loop spatial perception to enable open-vocabulary object manipulation while mitigating VLM spatial hallucinations.

View on arXiv

Defensibility

3.0/10

citations

co_authors

Platform Dominationhigh

Market Consolidationmedium

Displacement Horizon1-2 years

REASONING

CLASP addresses a critical bottleneck in VLM-based robotics: the 'spatial hallucination' problem where models understand 'what' an object is but fail at the 'where' (precise 3D coordinates). The use of an asynchronous control loop is a clever engineering solution to the high latency of modern VLMs, allowing the robot to act at a higher frequency than the perception updates. However, as a project with 0 stars and 8 forks, it currently exists primarily as an academic reference implementation rather than a production tool. Its defensibility is low because the core logic is methodological rather than based on a proprietary dataset or ecosystem. Frontier labs like OpenAI (with Figure), Google (RT-2/RT-X), and NVIDIA (Isaac Lab/Foundation Models) are aggressively building general-purpose grasping models that internalize these spatial reasoning capabilities. While CLASP provides a valuable modular approach today, it is highly likely to be superseded by native end-to-end VLA (Vision-Language-Action) models that handle closed-loop feedback internally within the next 12-24 months.

COMPOSABILITY

TECH STACK

PythonPyTorchROS/ROS2Open3DVision-Language Models (VLMs)Point Cloud Processing

INTEGRATION

reference_implementation

open_vocabulary_graspingclosed_loop_roboticsspatial_hallucination_mitigationasynchronous_perception

READINESS

Composabilityalgorithm

Depthreference_implementation