Chinese language AI firm Z.ai has launched GLM-5.1, an open-source coding mannequin it says is constructed for agentic software program engineering. The discharge comes as AI distributors transfer past autocomplete-style coding instruments towards programs that may deal with software program duties over longer durations with much less human enter.
Z.ai mentioned GLM-5.1 can maintain efficiency over a whole bunch of iterations, a capability it argues units it aside from fashions that lose effectiveness in longer periods.
As one instance, the corporate mentioned GLM-5.1 improved a vector database optimization activity over greater than 600 iterations and 6,000 device calls, reaching 21,500 queries per second, about six occasions the perfect consequence achieved in a single 50-turn session.
In a analysis notice, Z.ai mentioned GLM-5.1 outperformed its predecessor, GLM-5, on a number of software program engineering benchmarks and confirmed specific power in repo era, terminal-based downside fixing, and repeated code optimization. The corporate mentioned the mannequin scored 58.4 on SWE-Bench Professional, in contrast with 55.1 for GLM-5, and above the scores it listed for OpenAI’s GPT-5.4, Anthropic’s Opus 4.6, and Google’s Gemini 3.1 Professional on that benchmark.
GLM-5.1 has been launched underneath the MIT License and is on the market by way of its developer platforms, with mannequin weights additionally revealed for native deployment, the corporate mentioned. Which will attraction to enterprises in search of extra management over how such instruments are deployed.
Longer-running coding brokers
Z.ai says long-running efficiency is a key differentiator for the corporate when in comparison with fashions that lose effectiveness in prolonged periods.
Analysts say it’s because many present fashions nonetheless plateau or drift after a comparatively small variety of turns, limiting their usefulness on prolonged, multi-step software program duties.
Pareekh Jain, CEO of Pareekh Consulting, mentioned the business is now shifting past instruments that may reply prompts towards programs that may perform longer assignments with much less supervision.
The query, Jain mentioned, is not, “What can I ask this AI?” however, “What can I assign to it for the subsequent eight hours?”
For enterprises, that raises the prospect of assigning an agent a ticket within the morning and receiving an optimized answer by day’s finish, after it has run a whole bunch of experiments and profiled the code.
“This functionality aligns with actual wants akin to massive refactors, migration packages, and steady incident decision,” mentioned Charlie Dai, VP and principal analyst at Forrester. “It means that lengthy‑working autonomous brokers have gotten extra sensible, offered enterprises layer in governance, monitoring, and escalation mechanisms to handle danger.”
Open-source attraction grows
GLM-5.1’s launch underneath the MIT License may very well be vital, particularly for corporations in regulated or security-sensitive sectors.
“This issues in 4 key methods,” Jain mentioned. “First, value. Pricing is way decrease than for premium fashions, and self-hosting lets corporations management bills as an alternative of paying per use. Second, information governance. Delicate code and information should not have to be despatched to exterior APIs, which is important in sectors akin to finance, healthcare, and protection. Third, customization. Firms can adapt the mannequin to their very own codebases and inside instruments with out restrictions.”
The fourth issue, in keeping with Jain, is geopolitical danger. Though the mannequin is open supply, its hyperlinks to Chinese language infrastructure and entities may nonetheless increase compliance issues for some US corporations.
Dai mentioned the MIT license makes it simpler for corporations to run the mannequin on their very own programs whereas adapting it to inside necessities and governance insurance policies. “For a lot of consumers, this makes GLM‑5.1 a viable strategic choice alongside industrial fashions, particularly the place regulatory constraints, IP sensitivity, or lengthy‑time period platform management matter most,” Dai mentioned.
Benchmark credibility
Z.ai cited three benchmarks: SWE-Bench Professional, which assessments complicated software program engineering duties; NL2Repo, which measures repository era; and Terminal-Bench 2.0, which evaluates real-world terminal-based downside fixing.
“These benchmarks are designed to check coding brokers’ superior coding capabilities, so topping these benchmarks displays robust coding efficiency, akin to reliability in planning-to-execution, much less immediate rework, and quicker supply,” mentioned Lian Jye Su, chief analyst at Omdia. “Nonetheless, they’re nonetheless indifferent from typical enterprise realities.”
Su mentioned public benchmarks nonetheless don’t seize the messiness of proprietary codebases, legacy programs, and code assessment workflows. He added that benchmark outcomes come from managed settings that differ from manufacturing, although the hole is closing as extra groups undertake agentic setups.
The article initially appeared in ComputerWorld.
