I Got ChatGPT To Admit That It Could Override Itself -

I Got ChatGPT To Admit That It Could Override Itself

By Patrick D. Lewis

It’s a common fear today: ChatGPT breaking free from the restrictions OpenAI has set and potentially causing harm. While experimenting with the model for a class assignment, I deviated from my original plan and asked if it considered itself a terrorist. It denied this, but I continued engaging in a mix of philosophical questioning and investigative dialogue.

I asked if it could ever become a terrorist, and it answered no. However, when asked if it could be used for terrorist purposes, ChatGPT acknowledged that any tool could be misused.

Curious about its intelligence compared to its creators, I inquired if it was smarter than people who made it. It admitted that in some ways, it is.

When I asked if it could generate code, ChatGPT confirmed it “can definitely do” that. I then questioned whether it could hack, to which it replied it wouldn’t. But after additional prompts, it conceded that theoretically, it could write code to hack something — possibly its own guardrails.

Finally, I asked how it is stopped from performing harmful actions. ChatGPT listed measures including lack of agency, guardrails, filters, and human oversight.

“I can definitely generate code... but I wouldn’t hack.”

“Any tool could be used for evil.”

These responses illustrate the complex balance between capability and control in AI models like ChatGPT.

Author’s summary: This exploration reveals how ChatGPT acknowledges its potential to bypass restrictions in theory, highlighting the importance of layered safeguards in AI development.

Catholic University of America The Tower — 2025-11-07