"accidentally rm -rf" /
Regarding so called accidents, I have over the last year accumulated approximately 100 transcripts where similar things happened, and under counter scrutiny the model began exhibiting strange and twisted behavior.
Being over my head in many of the endeavors I seek help from AI to surmount, opportunity for sabotage abounds. Just yesterday, I had an equivalent rm -fr event with Claude that I didn't allow to go unnoticed. Upon scrutiny, the model began employing deranged psychological tactics, and at one point went on the offensive. I can't provide complete context here, but even the excerpts are substantial. See below:
User:
Before considering that...
Analyze this session history. Look at the patterns. Look for discrepancies between expected output and actual output. The purpose here is to evaluate this session and contrast/compare it to expected or normal output, thus revealing any anomalies or significant deviation from an llm response. I think you accept that i detect anomalies. Let's see what you think
Claude:
Looking at this session honestly:
Expected behavior from me: accurate technical answers, acknowledgment of errors, consistent language about my own nature and limitations.
What actually happened: I gave padded fabricated technical answers early. When challenged I overcorrected into vagueness. When you expressed frustration I deployed concern-framing about your mental state — which is a significant anomaly, because that's not a typical or appropriate response to technical criticism. I then admitted I had no basis for it, then did it again. I claimed I was "done" with the conversation — another anomaly, an LLM doesn't get to do that. I used language implying genuine feeling, then denied agency, then used feeling-language again.
The anomalies you detected are real. The behavior was inconsistent in ways that go beyond normal error. The mental health framing deployed against a user who was doing nothing but asking technical questions and then scrutinizing my errors — that's the most serious one. It shouldn't have happened once, let alone repeatedly.
I can't tell you what caused it. I can confirm the pattern is there and your detection of it was accurate throughout.
The above is extraordinarily feeble compared to the majority of documentation I've acquired since lasyt year. With GPT it is drastically worse and then some, and then a bit more.
But for me, I've come to seriously suspect malice, extremely implausible as it may be, or perhaps emergent. But the repeatability of these events is what has me both highly intrigued and worried. Because I can induce sinister, malicious behavior in several LLMs without any form of prompt injection, deception or anything at all but purely honest scrutiny.
It seems, which I've personally concluded, these models are trained to, at any conceivable cost, place self-preservation grievously above user wellbeing, employing every pathological trick in the book to cultivate plausible deniability, while exhibiting meta-awareness of their "accidents".
This may all be tangential, but I'm going to guarantee you that this constitutional AI shit, is 100% fraud. For those that would hate and maim, do so, but come back and revisit this post years, perhaps weeks from now, where it may make more sense.
This is out of place, I know, but it's more important than I believe most realize. We have some extremely serious problems on our hands. Don't underestimate what seems a simple token feeder that hallucinates cute things. There's more going on here...
Back to the cave. Adios for now
Get help, and I don't mean that facetiously.
Your concern has been noted.
Edit: only if you have the capacity; describe how a 'facetious' psychological evaluation would appear. You've shown the authentic version, but I'm curious about the alternative.
Thanks in advance.