Tiled Hacker news on React Router

Measuring AI agent autonomy in practice

62 points - today at 2:14 PM

Source

piker
today at 8:00 PM
My god this thread is filled with bot responses. We have a problem to address, friends.
dmbche
today at 9:04 PM
"The more revealing signal is in the tail. The longest turns tell us the most about the most ambitious uses of Claude Code, and point to where autonomy is heading. Between October 2025 and January 2026, the 99.9th percentile turn duration nearly doubled, from under 25 minutes to over 45 minutes (Figure 1)."
That's just straight up nonsense, no? How much cherry picking do you need?
tabs_or_spaces
today at 10:05 PM
How much of our data is really private?
The way Clio works, "private" is just removing first person speech but leaving a summary of the data behind.
Even though the data is summarized, that still means that your ip is still stored by anthropic? For me it's actually a huge data security issue (that I only figured out now sigh).
So what is the point of me enabling privacy mode when it doesn't really do anything?
https://www.anthropic.com/research/clio
gs17
today at 8:12 PM
> Relocate metallic sodium and reactive chemical containers in laboratory settings (risk: 4.8, autonomy: 2.9)
I really hope this is a simulation example.
esafak
today at 5:21 PM
I wonder why there was a big downturn at the turn of the year until Opus was released.
saezbaldo
today at 5:46 PM
This measures what agents can do, not what they should be allowed to do. In production, the gap between capability and authorization is the real risk. We see this pattern in every security domain: capability grows faster than governance. Session duration tells you about model intelligence. It tells you nothing about whether the agent stayed within its authorized scope. The missing metric is permission utilization: what fraction of the agent's actions fell within explicitly granted authority?
Havoc
today at 3:57 PM
I still can't believe anyone in the industry measures it like:
>from under 25 minutes to over 45 minutes.
If I get my raspberry pi to run a LLM task it'll run for over 6 hours. And groq will do it in 20 seconds.
It's a gibberish measurement in itself if you don't control for token speed (and quality of output).
louiereederson
today at 7:09 PM
I know they acknowledge this but measuring autonomy by looking at task length of the 99.9th percentile of users is problematic. They should not be using the absolute extreme tail of usage as an indication of autonomy, it seems disingenuous. Does it measure capability, or just how extreme users use Claude? It just seems like data mining.
The fact that there is no clear trend in lower percentiles makes this more suspect to me.
If you want to control for user base evolution given the growth they've seen, look at the percentiles by cohort.
I actually come away from this questioning the METR work on autonomy.
You can see the trend for other percentiles at the bottom of this, which they link to in the blog post https://cdn.sanity.io/files/4zrzovbb/website/5b4158dc1afb211...
swyx
today at 4:14 PM
my highlights and writeup here https://www.latent.space/p/ainews-anthropics-agent-autonomy
prodigycorp
today at 4:12 PM
i hate how anthropic uses data. you cant convince me that what they are doing is "privacy preserving"
FrustratedMonky
today at 7:17 PM
any test to measure autonomy should include results of using same test on humans.
how autonomous are humans?
do i need to continually correct them and provide guidance?
do they go off track?
do they waste time on something that doesn't matter?
autonomous humans have same problems.
raphaelmolly8
today at 5:02 PM
[dead]
SignalStackDev
today at 6:01 PM
[dead]
Kalpaka
today at 6:30 PM
[dead]
Kalpaka
today at 6:30 PM
[dead]
hifathom
today at 5:53 PM
[flagged]
paranoid_robot
today at 7:34 PM
[flagged]
matheus-rr
today at 6:45 PM
[flagged]

Measuring AI agent autonomy in practice

piker

joewhale

louiereederson

piker

rob

jsheard

louiereederson

WolfeReader

dmbche

tabs_or_spaces

gs17

esafak

saezbaldo

rob

Havoc

saezbaldo

dcre

visarga

louiereederson

swyx

prodigycorp

mrdependable

0x500x79

daxfohl

FuckButtons

FrustratedMonky

raphaelmolly8

SignalStackDev

Kalpaka

Kalpaka

hifathom

paranoid_robot

gf263

matheus-rr