FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

arXivarX

A benchmark and evaluation framework for assessing the trajectory-level performance of LLMs using tools to solve long-horizon financial tasks.

View on arXiv

Defensibility

4.0/10

citations

co_authors

Platform Dominationlow

Market Consolidationmedium

Displacement Horizon1-2 years

REASONING

FinTrace addresses a critical gap in LLM evaluation: the shift from 'atomic' tool-calling accuracy (did it call the right API once?) to 'trajectory' accuracy (did it solve a complex, multi-step financial problem?). The project's primary moat is its 800 expert-annotated trajectories across 34 financial tasks, which are expensive and time-consuming to produce. Quantitative signals show 14 forks within 2 days despite 0 stars, indicating high immediate interest from the research community (likely clones by researchers prior to social media promotion). While frontier labs like OpenAI and Anthropic are improving general reasoning (e.g., o1-preview), they often lack the domain-specific 'ground truth' datasets for niche sectors like finance. However, as an evaluation benchmark, its defensibility is capped; it is a diagnostic tool rather than a piece of infrastructure with switching costs. Its survival depends on becoming a cited standard in financial AI research, competing with existing benchmarks like FinQA or TAT-QA by offering deeper, trajectory-based insights.

COMPOSABILITY

TECH STACK

pythonllm-apisfinancial-data-toolsevaluation-metrics

INTEGRATION

reference_implementation

financial_reasoningagentic_evaluationtool_calling_benchmarktrajectory_analysis

READINESS

Composabilityalgorithm

Depthreference_implementation

Noveltynovel_combination