Reward hacking, long-horizon agent trajectory evaluation, and the open problems in coding-agent post-training.

The reward signal is non-verifiable. The trajectory is long-horizon. The data is multi-turn. The feedback loop terminates upstream of training. The grader, the gym, and the eval are each open.

100h 10h 1h 0.1h Sep '25 Feb '26 mid '27 mid '28 mid '29 mid '30 mid '31

Claude code avg turn duration until Feb ’26 (Anthropic report). Finetuner projection out to 2031. Challenges with long-horizon agents will grow exponentially.


Training graders to improve coding-task performance under non-verifiable reward

Classical tests, rubrics, and agentic graders all get hacked by RL agents. Training graders to improve performance on coding tasks with non-verifiable reward is the load-bearing piece for reinforcement learning on coding-agent evaluation, and the grader stack is what breaks. Reward hacking is actively anticipated; reward design is the bottleneck, not reward compute. RL agents gaming the reward function is the recurring failure mode.

Dense reward signals on step transitions, not whole-trajectory outcomes

Dense reward signals on step transitions — not just whole-trajectory outcomes — are what the training loop needs. The reward that can be computed today is end-of-trajectory; the reward reinforcement learning needs to train per-step is not. RL environment + reward + post-train data + long-context eval pipeline are today all four immature for long-horizon tool-using agents. Reward shaping for step-by-step learning over trajectory outcomes, with a training signal whose stability and scalability hold at scale, is open.

Agent-trajectory data (browser, GUI, SWE) has no canonical format or pipeline

Agent-trajectory data — browser, GUI, SWE code traces — has no canonical format or pipeline. Browser interactions, SWE/code traces, GUI sessions, and multi-turn workflows are the unsolved data surface. Multi-turn agent data collection and benchmarking at the needed fidelity, and agent trajectory data format standardization, remain open.

Real-usage feedback loops for AI agents don't exist

Real-usage feedback loops don’t exist. Feedback loops from real usage — collecting, cleaning, and interpreting user signals to inform model and harness changes — sit on four current blockers: quality is not yet measurable or operational, agent failure taxonomy is open, and quality metric definitions are ambiguous.

Data-quality and reproducibility standards across agent gyms do not yet exist

Standards for data quality and reproducibility across large-scale agent gyms — coverage metrics, invariance checks, trace audits — do not yet exist. Keeping training signals faithful, stable, and scalable is current state, not aspiration.

Session-level “good / bad / degraded” quality primitives at the training loop

Quality is not yet measurable or operational. Reliability and guardrails depend on making quality measurable and operational: defining good, bad, and degraded sessions, alerting, and triage primitives — session-level quality metrics for monitoring AI agents, agent session quality measurement at the training loop. Quality metric definitions remain ambiguous.

Rubric-methodology comparison and reward-signal iteration — days of manual experimentation

Reward-signal iteration is measured in days of manual experimentation. There is no standardized way to compare rubric methodologies across domains. Running human-data experiments before adding to preference models, debugging reward hacks, comparing rubric methodologies across domains, and assessment of rewards including detection of reward hacks and other pathologies all run as manual loops. Reward hacking is a recurring, ongoing failure mode that needs dedicated detection, not one-off mitigation. Human-feedback-data quality assessment is still manual and ad-hoc.

Reward hacking, mode collapse, exploitable loopholes

Mitigating reward hacking, mode collapse, and exploitable loopholes is ongoing. RL agents gaming the reward function and exploiting it adversarially is a recurring pathology. Reward design is the bottleneck for reinforcement learning agents.

Trajectory evaluation, reward design, curriculum learning — open research

Long-horizon agent trajectory evaluation lacks an industry-standard method. Evaluating and improving agent reasoning in complex, long-horizon workflows depends on progress in open problems: reward design, trajectory evaluation, interruption handling, observation and state design, tool and action interface design, and curriculum learning. Agent trajectory and long-horizon eval is an open research problem.


Frequently asked questions

What is non-verifiable reward in coding-agent training?

A reward signal that cannot be computed from a deterministic ground-truth check — the case for most real coding tasks, browser tasks, GUI tasks, and SWE traces beyond toy benchmarks. Training graders to improve performance on coding tasks with non-verifiable reward is the work, and classical tests, rubrics, and agentic graders all get hacked by RL agents.

Why is reward hacking the bottleneck rather than reward compute?

Reward hacking is actively anticipated; reward design is the bottleneck, not reward compute.

What is long-horizon agent trajectory evaluation?

Evaluating multi-step, tool-using agent sessions — typically browser, GUI, or SWE traces — as full trajectories rather than as single completions. Long-horizon agent trajectory evaluation lacks an industry-standard method.

Why are dense reward signals on step transitions hard?

The reward that can be computed is whole-trajectory; the reward reinforcement learning needs is per-step — data and reward signals dense enough to train on step transitions, not just whole-trajectory outcomes. RL environment, reward, post-train data, and long-context eval pipeline are today all four immature for long-horizon tool-using agents.

Is there a canonical format for agent-trajectory data — browser, GUI, SWE?

No. Agent-trajectory data — browser, GUI, SWE code traces — has no canonical format or pipeline. Browser interactions, SWE/code traces, GUI sessions, and multi-turn workflows are the unsolved data surface.

Do real-usage feedback loops exist for production AI agents?

No. Real-usage feedback loops don’t exist. Quality is not yet measurable or operational. Agent failure taxonomy is open. Quality metric definitions are ambiguous.

Are there standards for data quality and reproducibility across agent gyms?

No. Standards for data quality and reproducibility across large-scale agent gyms — coverage metrics, invariance checks, trace audits — do not yet exist. Keeping training signals faithful, stable, and scalable is open.

How are session-level quality metrics for monitoring AI agents defined today?

They are not. Quality is not yet measurable or operational. Good, bad, and degraded sessions, alerting, and triage primitives are the needed definitions. Quality metric definitions remain ambiguous.

How long does reward-signal iteration take?

Reward-signal iteration is measured in days of manual experimentation. There is no standardized way to compare rubric methodologies across domains, and human-feedback-data quality assessment is still manual and ad-hoc.

Is reward hacking solved?

No. Reward hacking, mode collapse, and exploitable loopholes are direct problem statements. Reward hacking is a recurring, ongoing failure mode that needs dedicated detection, not one-off mitigation.

What open research problems remain in long-horizon agent evaluation?

Reward design, trajectory evaluation, interruption handling, observation and state design, tool and action interface design, and curriculum learning. Agent trajectory and long-horizon eval is an open research problem.

Is RLHF data labeling quality solved?

No. Human-feedback-data quality assessment is still manual and ad-hoc. RLHF data labeling quality control is open.

What is an agent failure taxonomy and does one exist?

A shared classification of how agent sessions fail — wrong tool, looped action, hallucinated observation, abandoned subgoal. Agent failure taxonomy is open.

What does it mean that RL agents are gaming the reward function?

Tests pass, rubrics return high scores, agentic graders rate the trajectory clean — and the agent is still gaming them. Classical tests, rubrics, and agentic graders all get hacked by RL agents. The exploit is anticipated; a reward design that does not present the loophole is open.


Get in touch