Experience‑Native, Evolutionary Agents
Agents that finish the job
Programmatic reasoning. Isolated execution. Durable workflows.
Research spotlight: Agency efficiency (LIMI, 2025)
A recent study proposes an Agency Efficiency Principle: sophisticated agentic intelligence can emerge from strategically curated demonstrations rather than large data volumes. The authors define agency as autonomous problem discovery, hypothesis formation, and tool‑using execution. In their LIMI report, they state that 78 carefully designed samples reached 73.5% on comprehensive agency benchmarks, surpassing several larger systems, and claiming a 53.7% improvement over models trained on 10,000 samples—suggesting that what you show matters more than how much you show for real‑workflow skills. [21][22]
Learn From Lived Experience
Agents improve when they learn from their own interaction histories with grounded rewards, not one‑off prompts. This is the core thesis of “The Era of Experience.” [2]
Built For Production (Not Demos)
The shift is happening now: enterprises are operationalizing agentic workflows and 2025 has been called the “year of AI agents.” Reliability, governance, and observability are table stakes. [3]
Environments Where Work Happens
Reliable outcomes require instrumented workspaces where agents execute, test, and verify changes— moving beyond file I/O to close the loop from plan → action → validation. [1]

“Imitation hit its ceiling; experience is the new frontier.” (Sutton & Silver, 2025)[2]
“2025 is the year of agents; the discussion has shifted from novelty to production.” (Davos panel)[3]
“Agents need workspaces to validate outputs; otherwise they’re coding in Notepad.” (Daytona)[1]
iReliability primitives
- Temporal retry policies with backoff; non‑retryable classification. [4]
- Heartbeats for long tasks; Continue‑As‑New for long histories. [6]
- Durable event history for recovery and audits. [7]
Think (DSPy)
Programmatic intelligence you can version, test, and improve—turning prompting into an engineered, measurable policy.
Do (Daytona)
Durable execution inside AI workspaces where agents run, test, and validate their changes—closing the loop from plan → action → feedback.[1]
Persist (Temporal)
Durable workflows with state, retries, and history for long-horizon tasks—so agents survive timeouts, crashes, and restarts without losing the plot.
From Prompt to Durable Outcome: Our Process
Opulent OS treats autonomy as an ongoing stream of actions and observations, not isolated chat turns. We align the agent with grounded rewards that come from the environment itself (builds passing, PRs merging, SLAs met) rather than subjective stars—matching the experience‑native direction argued by Sutton & Silver (2025). [2]
We start by scoping the goal and guardrails: define the success signal, risk thresholds, and required artifacts. These are encoded as measurable KPIs and approval gates that become the objective function for the agent.
We implement programmatic reasoning so the policy can be versioned and tested like software, enabling steady improvement with auditable diffs. [5]
Execution runs in isolated workspaces that enable commands, tests, and previews—going beyond file I/O to close the loop from plan to verification. [1]
Orchestration uses durable workflows designed for long‑running, recoverable processes with explicit retries, liveness checks, and resumable histories—providing replayable state for recovery and a durable audit trail. [4][6][7]
After each step, validation gates check artifacts against tests and policies. Safe changes can be promoted; higher‑risk changes route to a human for review. These gates, together with Daytona’s dry‑run capability and Temporal’s durable state, keep throughput high without sacrificing control.
Finally, outcomes flow back as grounded feedback: success or failure signals update the policy and improve future trajectories. Over time, the system learns from its own experience stream, pushing beyond imitation and toward robust, production‑grade autonomy. [2]
Security & Privacy
- Data protection: encryption in transit and at rest; credential isolation with least‑privilege access.
- Isolated workspaces: sensitive operations run inside controlled Daytona sandboxes; no persistence outside secure contexts.
- Governance: audit trails for actions and promotions; configurable retention aligned to enterprise policies.
External trials (anonymized)
Representative pilot runs; detailed logs and dashboards available under NDA.
Partner A — Research brief automation
- Task: research brief with grounded evidence and executive summary.
- Outputs: compiled report, sources bundle, and shareable previews.
- Ops: Daytona sandbox for controlled browsing and search; diversified retrieval; verifiable citations. [18]
- Checks: evidence freshness and consistency; holds on ambiguous prompts.
- Evidence: workflow event history and artifact receipts for audit.
Partner B — Codebase maintenance
- Task: targeted refactor and stability fixes.
- Outputs: PRs, CI/test logs, review notes, and rollback plan.
- Ops: Ephemeral test runners; resilient workflows with retries and heartbeats; promotion gates to stable. [4][6]
- Checks: structural validation and no‑regression safeguards before merge.
- Evidence: replayable workflow event history with linked artifacts.
A production pipeline for agents that improve themselves, automatically
TL;DR. We operate a self‑evolving optimization loop that (1) observes failures, (2) reflects, (3) mutates instructions/tools/data, and (4) promotes only what wins in gated A/B trials. It’s model‑agnostic, scales at test time, and is engineered for strict isolation and reliable, resumable workflows—so uplift is steady, measurable, and production‑safe. We believe this is the new standard for shipping agent systems: agents that learn from their own mistakes in hours, not quarters.
Why agent performance stalls in production
- Fragile instructions. A single prompt (or even a prompt pack) cannot anticipate the combinatorial chaos of real users, tools, and data. Minor drift—formatting, arithmetic, grounding, tool choice—snowballs into brittle behavior.
- Slow improvement cycles. Manual prompt tuning, heavyweight fine‑tuning, or bespoke RL are expensive and slow. They don’t match the tempo of production where new edge cases appear daily.
Our answer: a self‑evolving loop
- Observe rich traces (plans, tool calls, citations, outputs) with full replay.
- Reflect to turn failure modes into precise, human‑readable rules. [5][8]
- Mutate prompt packs, tool policies, retrieval/citation guards, and safety rails via small, testable deltas.
- Select with deterministic, workflow‑level A/B gates; only non‑dominated wins are promoted with provenance.
How the loop works (high‑level)
- Reflection engine. We mine traces for recurring failure patterns and distill them into clear, human‑readable rules across instructions, tools, and policies. [5][8]
- Evolutionary mutation. We generate small, targeted candidates across those layers that are easy to test and compare.
- Selection & promotion. Rapid screening prunes weak variants; winning candidates advance to full validation in live workflows, with promotion based on evidence and provenance.
- Memory. Promoted changes become prior knowledge for future cycles—each loop starts smarter than the last.
Built for production from day one
- Isolated execution lanes. Tool use runs inside tightly controlled workspaces with strict I/O policies and per‑task budgets. Blast radius is confined; trials are auditable. [1]
- Durable, resumable workflows. Long‑running optimizations use retries, liveness heartbeats, resumable histories, and durable event logs for recovery and audit. [4][6][7]
- Observability & governance. End‑to‑end traces cover generation ↔ verification. Gates are enforced at the workflow layer. Changes are explainable by design.

Long‑horizon execution: our optimization loop and Temporal workflows reliably resume and complete tasks that exceed typical time windows.
What this looks like in practice
On a tough GAIA‑style slice spanning diverse tasks, we ran the loop against a baseline policy and a candidate policy derived from reflection‑guided mutations:
- A/B results. The candidate delivered double‑digit percentage‑point gains on two segments while holding parity elsewhere.
- Why it worked. Reflection identified recurring extraction issues; targeted edits tightened formatting and applied task‑aware constraints to resolve them.
- Operational health. Unknown provider params were dropped by design; large payloads were chunked; endpoint identity was verified before external calls.
In a lightweight reflective program optimization, we likewise lifted a validation metric by roughly six percentage points (≈3×) with a handful of accepted mutations. High‑variance subsamples guided pruning; only deltas that generalized were promoted. None of this required model retraining—the uplift came from smarter instructions, policies, and guards, proven in the same execution fabric that serves users.
Why not “just do RL”? Why this beats one‑off prompt tuning
We love RL where it shines. Training can instill durable, long‑horizon skills and robust tool use. When those backbones exist, we combine them with our loop: training contributes durable, general skills, while the loop adds rapid, task‑level specialization. This avoids the trap of one‑off prompt tuning—hand‑editing prompts for each edge case doesn’t scale. Our loop industrializes the improvement cycle—reflect → mutate → select—with proof. It’s frugal in trials, friendly to API models, and tightly coupled to observable outcomes.
Test‑time search mindset. Diversity among candidates and real selection pressure keep the system out of local minima, reduce overfit, and converge on strategies that hold up under real users.
What sets our approach apart
- Tight causality. We tie every uplift to a promoted delta and its artifact trail: what changed, why it won, how it generalizes.
- Live‑system truth. Gates run at the workflow layer users touch. Wins aren’t simulated; they’re experienced.
- Speed with safety. Isolated execution and durable workflows make fast iteration compatible with strict controls.
- Model‑agnostic leverage. Works with frontier APIs and fine‑tuned backbones alike. No lock‑in; immediate portability.
Vision: the standard for self‑improving systems
The next wave of agent platforms won’t ship static reasoning templates. They’ll ship optimizers—systems that can introspect, evolve, and prove their own improvements. Our stack is that optimizer: pragmatic enough for today’s workloads, principled enough to serve as the backbone of tomorrow’s autonomous systems.
Get started / Work with us
- Pilot the loop. Pick a real workload; we’ll instrument traces, run reflective cycles, and stand up workflow‑level gates. Expect visible wins within a modest trial budget.
- Integrate the harness. Prefer to own it? We can provide a reflect‑mutate‑select harness with pluggable gates and metrics.
- Need a visual? We have a one‑pager (loop diagram + A/B table + Pareto sketch) to align your team quickly.
Reliability & SLOs
- Reliability: industry‑grade success rates with minimal rollbacks; workflows automatically resume after faults (verified in execution histories).
- Safeguards: dry‑run in Daytona workspaces; approval gates; audit logs; human‑in‑the‑loop on elevated‑risk changes.
Reliability primitives:
- Temporal retry policies with backoff and non-retryable classification. [4]
- Heartbeats for long-running activities and Continue-As-New to manage event history growth. [6]
- Durable Event History for crash recovery and audit. [7]
Why we lead with durability: independent analyses suggest frontier capability will keep scaling, but infrastructure reliability remains the practical constraint—so we architect for recoverability first. [20]
Grounded Evaluation
“Good” isn’t a vibe—it’s a metric. We score agents by environmental signals (e.g., build passes, PR merges, response SLAs, revenue events), not just human stars. This reflects Sutton & Silver (2025) — “The Era of Experience” — that grounded rewards unlock strategies beyond human prejudgment; it also matches industry guidance to prefer verifiable artifacts and checks over free‑form chat when stakes are high. [2][16]
Example KPIs:
- Coding: % PRs merged w/o rework; flaky-test reduction; MTTR for builds.
- Research: time-to-artifact; source coverage; hallucination rate ≤ target.
- Customer Ops: first-contact resolution; time-to-resolution; SLA adherence.
DSPy + GEPA: reflective prompt evolution
We implement GEPA principles on top of DSPy to evolve policies with language feedback, not just scalar rewards. A reflector analyzes traces (reasoning, tool use, evaluation feedback) and proposes targeted prompt edits; a genetic search with Pareto selection maintains diverse, high‑performing candidates and avoids local optima. This delivers rapid, sample‑efficient improvement that’s practical in production where rollouts are costly. [8][11]
- Language‑level learning: mutate detailed instructions from human‑readable critiques of failures (formatting, arithmetic, missing evidence, tool choice). [8]
- Pareto selection: promote non‑dominated candidates across slices to maintain strategy diversity and robustness. [11]
- Sample efficiency: GEPA reports double‑digit gains over an RL baseline with dramatically fewer rollouts; we adopt that mindset to keep trials frugal and fast. [11]
Environment scaling and asynchronous evaluation
Research on environment scaling shows that broad, verifiable tool‑use domains and simulated agent–human interplay can systematically grow function‑calling competence. Tongyi’s AgentScaler constructs heterogeneous, database‑backed environments and uses two‑stage experience learning (general skills → domain specialization) to lift tool‑calling accuracy across τ‑bench/ACEBench—evidence that richer environments, not just bigger prompts, raise the ceiling. [12]
In parallel, Meta’s ARE frames agent evaluations as event‑driven, time‑based scenarios with rigorous verifiers (Gaia2). Budgets and asynchronous notifications expose failure modes unseen in static tests and reveal plateaus in simple scaffolds—motivating our focus on durable workflows,event history, and gated promotion. [13][14]
How Heavy maps to ARE concepts
- Apps → Tools/Workspaces: ARE apps that read/write state map to Daytona tool adapters in sandboxed workspaces. [1]
- Events → Workflow history: ARE events align with Temporal’s event history, retries, and signals for deterministic recovery. [4][6][7]
- Scenarios → Workflows + gates: ARE scenarios with verifiers correspond to Heavy workflows with validation gates and promotion rules. [13][14]
- Notifications → Signals/queues: asynchronous observability maps to workflow signals and progress channels.
Limits & guardrails
Known failure modes and mitigations
- Tool auth expiry / quota: proactive re‑auth checks; retry with backoff; human hold if repeated.
- CAPTCHAs / bot walls: avoid unsanctioned bypass; escalate to human; record gap in artifact.
- Ambiguous specs: require crisp acceptance criteria; block on missing inputs via workflow signal.
- Flaky sources: diversify retrieval and require corroboration for claims. [18]
- Long histories: Continue‑As‑New thresholds keep histories bounded; idempotency on activities. [6][7]
Ready to move from demos to delivery?
Talk to an architect about deploying Opulent Heavy, or open the Platform to explore the experience-native model.
Evolutionary Agents
Opulent OS
References
- [1] Daytona — Building Better AI Agents: The AI Enablement Stack
- [2] Sutton & Silver — The Era of Experience (preprint hosted on DeepMind media)
- [3] Axios — “2025 is the year of AI agents” (OpenAI CPO, Davos)
- [4] Temporal — Retry Policies
- [5] DSPy — Documentation
- [6] Temporal — Events and Event History (concepts)
- [7] Temporal — Event History (encyclopedia)
- [8] DSPy — GEPA tutorial (reflective prompt‑evolution)
- [9] Jeremy Berman — How I got the highest score on ARC‑AGI (again)
- [10] Tongyi — Deep Research pipeline (long‑horizon agents)
- [11] GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning (2025)
- [12] Tongyi Lab — Towards General Agentic Intelligence via Environment Scaling (AgentScaler, Sep 2025)
- [13] Meta — ARE: Scaling Up Agent Environments and Evaluations (2025)
- [14] Meta — ARE code (Gaia2 and platform, 2025)
- [15] Cursor — Online RL for Cursor Tab (Sep 11, 2025)
- [16] Simon Willison — On verifiable artifacts vs. free‑form outputs (Mar 2025)
- [18] Submodular optimization for diverse query selection (retrieval)
- [19] Artificial Analysis — Intelligence benchmarking methodology (updated 2025)
- [20] Epoch AI — Can AI scaling continue through 2030? (2025)
- [21] LIMI — Less Is More for Intelligent Agency (Hugging Face Papers index, 2025)
- [22] LIMI — arXiv preprint (PDF, 2025)