Question 1

What is non-verifiable reward in coding-agent training?

Accepted Answer

A reward signal that cannot be computed from a deterministic ground-truth check — the case for most real coding tasks, browser tasks, GUI tasks, and SWE traces beyond toy benchmarks. Training graders to improve performance on coding tasks with non-verifiable reward is the work, and classical tests, rubrics, and agentic graders all get hacked by RL agents.

Question 2

Why is reward hacking the bottleneck rather than reward compute?

Accepted Answer

Reward hacking is actively anticipated; reward design is the bottleneck, not reward compute.

Question 3

What is long-horizon agent trajectory evaluation?

Accepted Answer

Evaluating multi-step, tool-using agent sessions — typically browser, GUI, or SWE traces — as full trajectories rather than as single completions. Long-horizon agent trajectory evaluation lacks an industry-standard method.

Question 4

Why are dense reward signals on step transitions hard?

Accepted Answer

The reward that can be computed is whole-trajectory; the reward reinforcement learning needs is per-step — data and reward signals dense enough to train on step transitions, not just whole-trajectory outcomes. RL environment, reward, post-train data, and long-context eval pipeline are today all four immature for long-horizon tool-using agents.

Question 5

Is there a canonical format for agent-trajectory data — browser, GUI, SWE?

Accepted Answer

No. Agent-trajectory data — browser, GUI, SWE code traces — has no canonical format or pipeline. Browser interactions, SWE/code traces, GUI sessions, and multi-turn workflows are the unsolved data surface.

Question 6

Do real-usage feedback loops exist for production AI agents?

Accepted Answer

No. Real-usage feedback loops don’t exist. Quality is not yet measurable or operational. Agent failure taxonomy is open. Quality metric definitions are ambiguous.

Question 7

Are there standards for data quality and reproducibility across agent gyms?

Accepted Answer

No. Standards for data quality and reproducibility across large-scale agent gyms — coverage metrics, invariance checks, trace audits — do not yet exist. Keeping training signals faithful, stable, and scalable is open.

Question 8

How are session-level quality metrics for monitoring AI agents defined today?

Accepted Answer

They are not. Quality is not yet measurable or operational. Good, bad, and degraded sessions, alerting, and triage primitives are the needed definitions. Quality metric definitions remain ambiguous.

Question 9

How long does reward-signal iteration take?

Accepted Answer

Reward-signal iteration is measured in days of manual experimentation. There is no standardized way to compare rubric methodologies across domains, and human-feedback-data quality assessment is still manual and ad-hoc.

Question 10

Is reward hacking solved?

Accepted Answer

No. Reward hacking, mode collapse, and exploitable loopholes are direct problem statements. Reward hacking is a recurring, ongoing failure mode that needs dedicated detection, not one-off mitigation.

Question 11

What open research problems remain in long-horizon agent evaluation?

Accepted Answer

Reward design, trajectory evaluation, interruption handling, observation and state design, tool and action interface design, and curriculum learning. Agent trajectory and long-horizon eval is an open research problem.

Question 12

Is RLHF data labeling quality solved?

Accepted Answer

No. Human-feedback-data quality assessment is still manual and ad-hoc. RLHF data labeling quality control is open.

Question 13

What is an agent failure taxonomy and does one exist?

Accepted Answer

A shared classification of how agent sessions fail — wrong tool, looped action, hallucinated observation, abandoned subgoal. Agent failure taxonomy is open.

Question 14

What does it mean that RL agents are gaming the reward function?

Accepted Answer

Tests pass, rubrics return high scores, agentic graders rate the trajectory clean — and the agent is still gaming them. Classical tests, rubrics, and agentic graders all get hacked by RL agents. The exploit is anticipated; a reward design that does not present the loophole is open.

Reward hacking, long-horizon agent trajectory evaluation, and the open problems in coding-agent post-training.

Training graders to improve coding-task performance under non-verifiable reward

Dense reward signals on step transitions, not whole-trajectory outcomes

Agent-trajectory data (browser, GUI, SWE) has no canonical format or pipeline

Real-usage feedback loops for AI agents don't exist

Data-quality and reproducibility standards across agent gyms do not yet exist

Session-level “good / bad / degraded” quality primitives at the training loop

Rubric-methodology comparison and reward-signal iteration — days of manual experimentation

Reward hacking, mode collapse, exploitable loopholes

Trajectory evaluation, reward design, curriculum learning — open research

Frequently asked questions