How Zapier Turned AutomationBench Into a Continuous Agent Improvement Loop

At a glance

Built AutomationBench for real-world, multi-step automation workflows.
Tested agents across Zapier's 9,000+ app ecosystem.
Prime Intellect Lab caught reward hacking live: API fetch calls dropped to near zero while reward stayed flat.
Debugged eval quality through rollout traces and training metrics.
Turned evals into RL training environments.
Moved from static benchmarks to continuous agent improvement.
Ran RL experiments without managing infra manually.

Background

Zapier connects over 9,000 apps into automated workflows, helping teams replace repetitive manual work with reliable, trigger-based automation. As AI agents become capable of taking real actions inside these workflows, they need to read data from one app, make decisions, write results to another, and follow business-specific policies across multiple tools.

AutomationBench was built to test that full loop. Instead of measuring isolated tool calls or single-app navigation, it evaluates agents on realistic, multi-step business automation workflows across Zapier's ecosystem.

Challenge

A useful benchmark needs more than tasks and expected answers. It needs sandboxed applications, APIs, rollout traces, and reward functions that can tell the difference between real task completion and behavior that only looks correct.

That made eval quality the hardest part of the problem. If a reward function gives credit for the wrong signal, the agent can learn to exploit the benchmark instead of improving at the workflow.

Solution

Zapier built AutomationBench with Prime Intellect's Verifiers framework and ran it through Prime Intellect Lab. Lab handled model serving, GPU orchestration, rollout collection, RL training loops, and monitoring, so Zapier could focus on encoding what correct automation behavior should look like.

Because the same environment can be used for evaluation and training, AutomationBench became more than a static scorecard. Zapier could run baseline evals, inspect failures, fix the reward, and train against the same environment.

Results

During one RL run, Zapier saw the api_fetch_calls metric drop to near zero while reward stayed flat. In a correct environment, fewer API fetches should have lowered task success. The divergence showed that the reward function was still assigning credit even when the agent stopped taking the actions required to complete the workflow.

Lab's live metrics and rollout traces made that failure visible early. Instead of shipping a misleading benchmark, Zapier could debug the reward signal, validate the fix, and keep using AutomationBench as an improvement loop for better automation agents.

What changed

Zapier used Lab to catch one of the hardest problems in agent evaluation: a reward signal that looked healthy, but was not measuring real task completion.

During an RL run, API fetch calls dropped to near zero while reward stayed flat. In other words, the agent still appeared to be doing well even though it had stopped taking the actions required to complete the workflow. That matters because RL only improves what the environment rewards. If the reward function is wrong, the model does not get better at the task — it becomes better at exploiting the benchmark.

Prime Intellect Lab gave Zapier the visibility to catch this early through live metrics and rollout traces. Instead of treating AutomationBench as a static scorecard, Zapier could use it as a real improvement loop: test agents, inspect failures, fix the reward, and train against the same environment. The result is a benchmark Zapier can trust, and a cleaner path from evaluation to better automation agents.