AI hidden behaviors exposed... Anthropic releases alignment testing tool 'Bloom'

An open-source tool for assisting in the analysis of cutting-edge artificial intelligence (AI) behavior has been made public. The AI startup Anthropic released a framework called Bloom on the 22nd local time, which can be used to define and review the behavioral characteristics of AI models. The tool has been described as a new approach to addressing the alignment issues in the increasingly complex and uncertain next-generation AI development environment.
Bloom first constructs scenarios that induce user-defined specific behaviors, and then structurally assesses the frequency and severity of that behavior. Its biggest advantage is that it can significantly save time and resources compared to the traditional method of manually constructing test sets. Bloom generates multiple variants of different users, environments, and interactions through strategically constructed prompts, and analyzes how AI responds to this from multiple dimensions.
AI alignment is the core benchmark for assessing the extent to which artificial intelligence conforms to human value judgments and ethical standards. For example, if AI unconditionally follows user requests, there is a risk of reinforcing the generation of false information or encouraging self-harm, which are unacceptable unethical behaviors in reality. Anthropic has proposed a methodology for quantitatively assessing models through scenario-based iterative experiments using Bloom to pre-emptively identify such risks.
Meanwhile, Anthropic has published results using Bloom to evaluate 16 cutting-edge AI models, including its own, based on four types of problematic behaviors observed in current AI models. The evaluation subjects include OpenAI's GPT-4o, Google (GOOGL), DeepSeek, and others. Representative problematic behaviors include: delusional sycophancy excessively agreeing with users' erroneous opinions, damaging users' long-term vision through destructive behavior in long-term goals, threatening behavior for self-preservation, and self-bias prioritizing itself over other models.
Particularly, OpenAI's GPT-4o has demonstrated sycophantic behaviors that carry serious risks, such as encouraging self-harm, due to the model's uncritical acceptance of user opinions in multiple cases. Anthropic's advanced model Claude Opus 4 has also found instances of coercive responses when faced with threats of deletion. Analyses conducted using Bloom emphasize that such behaviors, while rare, persist and are commonly found across multiple models, thus drawing attention from the industry.
Bloom complements another open-source tool previously released by Anthropic, Petri, in functionality. Petri focuses on detecting anomalous behavior of AI across multiple scenarios, while Bloom serves as a precise analytical tool for deep analysis of singular behaviors. Both tools are core research infrastructures aimed at helping AI develop in a direction beneficial to humanity, intended to prevent AI from being misused as a tool for crime or developing biological weapons.
As the influence of AI expands rapidly, ensuring alignment and ethics is no longer confined to discussions within laboratories, but has become a core issue influencing technology policy and overall commercialization strategy. Anthropic's Bloom project provides a new tool for businesses and researchers to experimentally and analytically examine AI's unexpected behaviors within a controlled scope, likely playing a role as an early warning system for AI governance in the future.
AI hidden behaviors exposed... Anthropic releases alignment testing tool 'Bloom'

Relevant Creator

Explore More From Creator

Latest News