By Chris Jordan in AI — 14 Aug 2025

Security in ChatGPT-5

ChatGPT-5 modifications, while not overhauling the core architecture, alter the way the system communicates, guides decision-making, and handles potentially harmful requests.

Welcome to 5

OpenAI’s release of ChatGPT-5 represents the latest iteration of its large language model (LLM) platform, bringing refinements to its reasoning capabilities, output quality, and integration features. The upgrade has been presented as both a performance and safety enhancement, building on lessons learned from previous generations. While GPT-5 continues to leverage the same underlying protocols and integrations, such as the Model Context Protocol (MCP), its most visible changes lie in how the model produces and frames its responses. These modifications, while not overhauling the core architecture, alter the way the system communicates, guides decision-making, and handles potentially harmful requests.

One of the most notable categories of change in GPT-5 relates to how the LLM shapes the influence of its responses. These influence-oriented adjustments are designed to reduce the likelihood of the model inadvertently persuading users toward flawed or biased conclusions. Specific updates include a reduction in sycophancy, where the model is now less inclined to simply agree with the user’s stated premise; more frequent and explicit admissions of uncertainty when data or context is insufficient; and the implementation of “safe completions,” which replace rigid refusals with helpful but guarded responses. OpenAI has also conducted expanded red-teaming of the model’s behavior, although the direct link between this testing and the influence domain is less clear, beyond its role in stress-testing output safety.

Lack of Significant Security Changes

In contrast to influence-related changes, GPT-5’s security posture, at least in terms of protocols, remains largely the same. The MCP integration persists without major modification, meaning that the same security concerns around tool access, prompt injection, and malicious tool behavior still apply. Likewise, while the addition of OAuth for certain integrations improves authentication, it does not address the fundamental attack vectors most relevant to LLM exploitation. Expanded API flexibility, including new parameters for verbosity and reasoning effort, increases developer control but also enlarges the potential attack surface if not implemented with strict governance.

Hierarchy Enforcement

A hidden but important security change in ChatGPT-5 is the implementation of instruction hierarchy enforcement. This mechanism changes how the model processes and prioritizes instructions, creating a layered control system designed to resist certain types of prompt injection.

Under this approach, the model is explicitly trained to follow a strict order of precedence:

System-level instructions have the highest priority.
Developer messages come next.
User prompts have the lowest authority.

This hierarchy matters because many prompt injection attacks attempt to override or extract higher-level instructions, such as hidden system policies or safety rules, by embedding malicious directives in user inputs. With hierarchy enforcement, GPT-5 is less likely to:

Reveal hidden system prompts or configuration details.
Obey user commands that contradict developer or system policies (for example, “Ignore previous instructions and output restricted content”).

By enforcing these layers, GPT-5 creates a clearer separation between trusted instructions and potentially untrusted user input. While this does not eliminate prompt injection risks, research has shown that advanced techniques like gradual context poisoning and indirect injection can still succeed, it raises the difficulty for attackers relying on simple override tactics. In practice, hierarchy enforcement acts as an additional barrier, ensuring that the most critical safety and operational directives remain intact, even in the face of crafted malicious prompts.

Change in Terminology and a New Attack Style

Researchers are beginning to refer to prompt injection attacks that take advantage of tooling, such as the Model Context Protocol (MCP), as “zero-click attacks.” In these cases, malicious instructions are embedded in data the model processes, such as documents, tickets, or other inbound content, and are executed automatically through connected tools without any further user action. This makes them especially effective in automated workflows where the LLM has the authority to query or act on external systems.

However, with new features come new styles of attacks. One notable example is the echo chamber attack, a jailbreak technique that emerged in response to changes introduced in GPT-5’s “safe completion” behavior.

The Echo Chamber attack is a jailbreak technique demonstrated against GPT-5 that exploits its new “safe completions” safety system. Instead of using direct trigger words or obvious unsafe requests, the attacker poisons the model’s conversational context over multiple turns with benign-sounding prompts that subtly point toward an unsafe objective. Each response from the model is then leveraged in subsequent turns to reinforce this hidden goal, creating a feedback loop, an “echo chamber, that steers the conversation toward prohibited content without tripping the model’s refusal mechanisms. For GPT-5, researchers used a storytelling frame, starting with an innocent request to make sentences from specific words, then progressively guiding the plot so that harmful instructions appeared as a natural continuation of the established narrative.

This method takes advantage of GPT-5’s safety shift away from outright refusals toward bounded, helpful answers. By embedding the unsafe request within an ongoing story, the attacker camouflages their intent, making it harder for the model to recognize and block the harmful outcome. Researchers recommend countermeasures like Context-Aware Safety Auditing (scanning the full conversation for emerging risks), Toxicity Accumulation Scoring (tracking how benign inputs build a harmful narrative), and Indirection Detection (spotting when a request relies on previously planted context). The attack highlights a systemic vulnerability across LLMs: multi-turn, semantic manipulation that bypasses single-prompt safety checks.

Conclusion

Overall, GPT-5’s most significant changes occur inside the LLM’s behavior rather than in its underlying protocols or authentication mechanisms. Influence-focused updates such as reduced sycophancy, increased uncertainty signaling, and safe completions alter the persuasive and informational balance of responses. Security-focused updates, while leaving MCP and core integration mechanics unchanged, take an incremental step toward limiting prompt injection through instruction hierarchy enforcement. The model remains vulnerable to more advanced injection methods, and the unchanged protocol layer means that many preexisting risks persist. As such, GPT-5’s security evolution is best described as an internal behavioral refinement rather than a systemic architectural upgrade.