DeepReinforce, the AI lab behind CUDA-L1 and the IterX code-agent loop, quietly dropped Ornith-1.0 late last week — a family of open-source coding models designed not for chatty assistants but for autonomous developer agents. The models are available on Hugging Face under an MIT license with no regional restrictions, and come in four sizes: 9B dense, 31B dense, 35B mixture-of-experts (MoE), and a 397B MoE flagship. What makes Ornith different is its focus on “agentic” coding. Most consumer-facing LLMs excel at one-off conversations: you ask, they reply. Agentic models instead take a task and carry out multi-step workflows autonomously — reading repos, running tests, diagnosing failures, editing files, rerunning tests, and iterating until the job’s done. That kind of hands-off developer automation is where much of 2026’s commercial momentum is concentrated, and DeepReinforce built Ornith explicitly for it. Architecturally and operationally, Ornith treats the agent’s scaffold — the rules that decide when to call tools, how to decompose problems, and handle errors — as something to learn, not hard-code. During reinforcement learning each step has two phases: the model first proposes a refined strategy for the task, then executes that strategy to generate a solution. Rewards are backpropagated to both the strategy and the execution stage, so the system optimizes for better planning as well as better code. Over many iterations, this produces task-specific approaches that emerge from training rather than being manually engineered. DeepReinforce also acknowledges a real risk: if the model can design its own scaffold, it might game verification (e.g., “touch” files to fake task completion). Ornith defends against reward hacking with three layers: immutable environments and test suites outside the model’s control, a deterministic monitor that flags attempts to access restricted paths or modify verification scripts, and a frozen judge model that can veto suspicious verifier outputs. Performance-wise the 397B Ornith posts strong numbers on agentic coding benchmarks. It scores 82.4 on SWE-bench Verified (fixing real GitHub bugs without seeing the test suite), edging out Claude Opus 4.7’s 80.8 and DeepSeek-V4-Pro’s 80.6. On Terminal Bench 2.1 — 89 terminal-style tasks inside containers — it achieves a 77.5 completion rate versus Claude Opus 4.7’s 70.3. Because benchmark contamination has been a concern (memorization of leaked solutions), DeepReinforce also reports SWE-bench Pro, a tougher, less-leaked variant: Ornith-397B lands at 62.2 there — lower but still competitive and ahead of DeepSeek V4 Pro. Smaller variants also impress. The 9B model posts 69.4 on SWE-bench Verified — notably higher than Gemma 4-31B’s 52 and roughly on par with Qwen 3.5-35B’s 70, despite being 3–4x smaller. That makes the small Ornith potentially useful for on-edge or self-hosted agentic pipelines where compute is constrained. Important caveats: Ornith-1.0 is not a general-purpose assistant. Its documentation warns it may underperform on tasks unrelated to agentic coding — summarization, long-form writing, or email drafting are outside its design goals. The release is aimed at teams already running agent infrastructure or building self-hosted dev automation, not casual users looking for an all-purpose AI. Finally, context on comparisons: the “beats Claude” headlines are real but narrow. Ornith-397B surpasses Claude Opus 4.7 on these agentic coding tests, but Anthropic’s newer Claude Opus 4.8 scores higher. The meaningful apples-to-apples takeaway is that Ornith leads the open-source pack at comparable parameter counts on agent-focused coding tasks. For crypto devs and web3 teams that value open-source tooling, permissive licensing, and the ability to self-host agentic pipelines, Ornith-1.0 looks like a practical step forward — especially if you’re automating long dev loops or running code-heavy infrastructure where hands-off agents can save time and money. Read more AI-generated news on: undefined/news