Ramp Labs proposes a new scheme for multi-agent memory sharing, achieving a maximum Token consumption reduction of 65%

According to a report from CoinWorld on April 11, AI infrastructure company Ramp Labs released research results titled 'Latent Briefing', achieving efficient memory sharing among multi-agent systems through direct compression of large model KV caches, significantly reducing Token consumption without sacrificing accuracy. In mainstream multi-agent architectures, the orchestrator breaks down tasks and repeatedly calls worker models, leading to an exponential increase in Token usage as the inference chain extends. The core idea of Latent Briefing is to utilize attention mechanisms to identify the truly critical parts of the context, discarding redundant information directly at the representation layer, rather than relying on the slow LLM summaries or the less stable RAG retrievals. In the LongBench v2 benchmark test, this method showed remarkable performance: Worker model Token consumption was reduced by 65%, with the median Token savings for medium-length documents (32k to 100k) reaching 49%, and the overall accuracy improved by about 3 percentage points compared to the baseline, while the additional time cost for each compression was only about 1.7 seconds, accelerating approximately 20 times compared to the original algorithm. The experiments used Claude Sonnet 4 as the orchestrator and Qwen3-14B as the worker model, covering various document scenarios such as academic papers, legal documents, novels, and government reports. The research also found that the optimal compression threshold varies depending on task difficulty and document length—challenging tasks are suitable for aggressive compression to filter speculative reasoning noise, while long documents are more suitable for mild compression to retain scattered key information.