The bad version is when the cloud job is still green, the agent summary says completed, and the PnL is wrong in that annoying way where nothing is obviously broken, just slightly poisoned.

Gateway starts throwing 504s. The cheap model path times out. fallback_enabled is true, max_inference_usd_per_run is missing, so the planner spends the next few minutes on the heavy model trying to reason through a route that should have been rejected by a dumb cap before the LLM even got involved. While that is happening, the router quote is aging, the RPC is lagging by a few blocks, and slippage_bps is still 800 because somebody used the test config on a live pool and nobody wanted to deal with false fails during dry runs.

Then the agent executes against the stale quote.

Not because it is malicious. It just has enough permission to be stupid at full speed.

The trade hash exists. The bridge hash exists. The vault receipt exists. The UI has numbers. OpenLedger plus OctoClaw can make that path feel pretty smooth from the builder side, which is the dangerous part, because the smoother it feels to launch, the easier it is to forget that every default is now sitting between a strategy and actual money.

I keep seeing the same shape of failure in different clothes.

The first run lands, so the config gets treated like deployment garnish. temperature: 0.7 because “reasoning” sounded useful. quote_ttl_seconds: 90 because 12 seconds caused annoying rejects. bridge_finality_mode: submitted because waiting for finalized made the run look slow. allow_vault_deposit: true because the strategy needed yield access. retry_count: 3 because retrying sounds safer than failing.

All normal looking lines.

Under load they start interacting like junk.

An RPC node drops requests during a gas spike, so the quote refresh fails, then retry_on_quote_failure kicks in and reuses a route that was already stale. The bridge call returns submitted, not finalized, but the state object only has submitted and failed, so the planner treats it as usable enough. In the same run, the vault deposit receipt comes back before the accounting check pulls a fresh pricePerShare, and now the agent has shares, a half-true destination balance, and a route plan built from a number it should not have trusted.

This is the kind of thing that makes traces miserable to read.

Block height says one thing. Router quote timestamp says another. Bridge explorer says pending. Vault UI shows shares. The agent memory says the asset moved. The next step was sized off inferred inventory because the JSON never had a real limbo state like awaiting_finality or balance_not_spendable_yet.

So now the operator is not debugging one bug. They are debugging a config-shaped race condition.

The bridge did not fail cleanly. The quote did not expire hard enough. The vault did not lie exactly, it just exposed shares in a way the agent treated as a clean balance. The fallback model did not break the run, it made the run expensive while giving the planner more room to sound confident. And because the retry policy had no clue whether it was retrying a read, a quote, a write, or a cross-chain message already in flight, the logs start to read like someone trying to prove innocence after the fact.

I do not want retry_count: 3 unless the state machine is boring and mean.

Re-quote after timeout. Kill the run if RPC providers disagree on block height. Split max_slippage_bps by route and market condition. Put max_inference_usd_per_run in the config like an adult. fallback_model_allowed can be true for analysis and false for execution. bridge_finality_mode should not mean “we got a hash, close enough.” Vault actions need caps, simulation requirements, post-receipt accounting checks, and probably less trust in whatever balance field is easiest to parse.

None of this is glamorous. It is mostly just refusing to let a planner continue when the facts are squishy.

The config diff that matters is always ugly.

bridge_finality_mode: finalized instead of submitted.

quote_ttl_seconds down from 90 to 12.

max_route_cost_usd set to a real number instead of null.

max_inference_usd_per_run: 25.

fail_closed_on_rpc_mismatch: true.

require_post_vault_accounting_check: true.

And then you still get the run where the bridge is pending longer than usual, the LP moved, the RPC provider returned inconsistent block heights twice, and the agent almost recovered itself into a worse trade because recovery was allowed to reuse too much previous state.

That is the part people underrate with trading agents. The agent does not need some sci-fi failure mode. It just needs a few soft edges in the config and enough cloud uptime to keep acting on them.

By the time the LP pings on Telegram asking why your agent crossed the pool at a garbage price, you are not thinking about agent UX anymore. You are staring at a broken execution trace, a vault receipt, two stale quotes, three RPC responses, and a config file that allowed every single step.

#OpenLedger $OPEN @OpenLedger