Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning

arXivarX

Translates natural language prompts into a structured 'Graph of Events in Space and Time' (GEST) which is executed by a 3D game engine to produce semantically accurate, physically consistent video with automated ground-truth annotations.

View on arXiv

Defensibility

4.0/10

citations

co_authors

Platform Dominationhigh

Market Consolidationhigh

Displacement Horizon1-2 years

REASONING

The project represents a pivot from 'pixel-first' video generation (like Sora or Runway) to 'logic-first' generation. By using an LLM to generate a Graph of Events in Space and Time (GEST) rather than raw pixels, it solves the semantic drift and hallucination issues inherent in diffusion models. This makes it highly valuable for synthetic data generation where ground-truth labels are required. However, the defensibility is currently low (Score: 4) due to the lack of community traction (0 stars) and the fact that the 'moat' relies on the specific GEST schema, which is easily reproducible. Frontier labs like OpenAI are already moving toward 'World Simulators' that likely use similar internal spatial-temporal representations. Furthermore, game engine giants like Epic Games (Unreal) or Unity could trivially implement an LLM-to-Blueprint/Scene-Graph layer, effectively absorbing this methodology. The displacement horizon is short because the intersection of LLM planning and 3D simulation is one of the most active research areas in both robotics and AI-generated media.

COMPOSABILITY

TECH STACK

PythonLLMs (GPT-4/Claude)3D Game Engine (e.g., Unity/Unreal)Graph-based PlanningSpatial-Temporal Constraints

INTEGRATION

reference_implementation

synthetic_data_generationvideo_world_modelsagentic_planningsemantic_video_synthesisscene_graph_generation

READINESS

Composabilityalgorithm

Depth