How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

arXivarX

SteerEval is a hierarchical evaluation benchmark designed to measure the controllability of Large Language Models (LLMs) across three dimensions: language features, sentiment, and personality, using a three-level specification hierarchy (L1: Content, L2: Style/Constraint, L3: Instantiation).

View on arXiv

Defensibility

3.0/10

citations

co_authors

Platform Dominationhigh

Market Consolidationhigh

Displacement Horizon6 months

REASONING

SteerEval addresses a critical gap in LLM evaluation: the lack of a structured hierarchy for instruction following. While existing benchmarks like IFEval focus on objective constraints (e.g., 'no more than 50 words'), SteerEval attempts to quantify more subjective 'steering' like sentiment and personality. However, the project's defensibility is low (score 3) because it is a research-centric benchmark. In the LLM space, benchmarks only survive if they achieve massive social proof and integration into major leaderboards (like Hugging Face's Open LLM Leaderboard). With 0 stars and only 11 forks (likely the research team and early peers), it has no network effects yet. Frontier labs like OpenAI and Anthropic are the primary competitors here; they build proprietary, internal 'steering' evaluations that are often more rigorous than open-source benchmarks. Furthermore, popular existing tools like 'Instruction-Following Evaluation' (IFEval) already dominate this niche. The displacement horizon is short (6 months) because the field of 'evals' moves faster than any other sector in AI; a more comprehensive or better-marketed benchmark is released almost every week. The high platform risk stems from the fact that 'steerability' is a core product feature for companies like Google and Microsoft, who have the compute to generate much larger and more diverse eval sets than a single academic project.

COMPOSABILITY

TECH STACK

PythonPyTorchHugging Face TransformersOpenAI APILLM-as-a-judge

INTEGRATION

cli_tool

controllability_evaluationbehavioral_benchmarkinginstruction_followingpersona_steeringllm_alignment

READINESS

Composabilityframework

Depthreference_implementation