\

Measuring AI agent autonomy in practice

62 points - today at 2:14 PM

Source
  • piker

    today at 8:00 PM

    My god this thread is filled with bot responses. We have a problem to address, friends.

      • joewhale

        today at 8:20 PM

        That’s what a bot would say to fit in.

        • louiereederson

          today at 8:18 PM

          Care to elaborate?

            • piker

              today at 8:25 PM

              Sure. If you turn on "show dead" you will see half a dozen green-named (i.e., recently established) accounts that are obviously "agents". They're clogging up the pipe with noise. We as a collective are well-positioned to fight back and help protect the commons from the monster we have created.

                • rob

                  today at 9:26 PM

                  It's even worse. They're not limited to new accounts. I've seen a lot of bots now from accounts that are literally years old but with zero activity that suddenly start posting a lot of comments within a span of 24 to 48 hours. I have some examples of them if you search my recent comments.

                    • jsheard

                      today at 10:24 PM

                      Maybe accounts should remain green until they have a few hundred updoots. It still wouldn't be impossible to game, but it would be harder than just waiting at least.

                  • louiereederson

                    today at 9:42 PM

                    Wow thank you, I didn't know about this feature

                    • WolfeReader

                      today at 9:21 PM

                      I am simultaneously grateful that you told us about this, and also kind of wish I didn't know. There's so much.

                      • today at 9:11 PM

            • dmbche

              today at 9:04 PM

              "The more revealing signal is in the tail. The longest turns tell us the most about the most ambitious uses of Claude Code, and point to where autonomy is heading. Between October 2025 and January 2026, the 99.9th percentile turn duration nearly doubled, from under 25 minutes to over 45 minutes (Figure 1)."

              That's just straight up nonsense, no? How much cherry picking do you need?

              • tabs_or_spaces

                today at 10:05 PM

                How much of our data is really private?

                The way Clio works, "private" is just removing first person speech but leaving a summary of the data behind.

                Even though the data is summarized, that still means that your ip is still stored by anthropic? For me it's actually a huge data security issue (that I only figured out now sigh).

                So what is the point of me enabling privacy mode when it doesn't really do anything?

                https://www.anthropic.com/research/clio

                • gs17

                  today at 8:12 PM

                  > Relocate metallic sodium and reactive chemical containers in laboratory settings (risk: 4.8, autonomy: 2.9)

                  I really hope this is a simulation example.

                  • esafak

                    today at 5:21 PM

                    I wonder why there was a big downturn at the turn of the year until Opus was released.

                    • saezbaldo

                      today at 5:46 PM

                      This measures what agents can do, not what they should be allowed to do. In production, the gap between capability and authorization is the real risk. We see this pattern in every security domain: capability grows faster than governance. Session duration tells you about model intelligence. It tells you nothing about whether the agent stayed within its authorized scope. The missing metric is permission utilization: what fraction of the agent's actions fell within explicitly granted authority?

                        • rob

                          today at 9:27 PM

                          @dang this is another bot.

                      • Havoc

                        today at 3:57 PM

                        I still can't believe anyone in the industry measures it like:

                        >from under 25 minutes to over 45 minutes.

                        If I get my raspberry pi to run a LLM task it'll run for over 6 hours. And groq will do it in 20 seconds.

                        It's a gibberish measurement in itself if you don't control for token speed (and quality of output).

                          • saezbaldo

                            today at 5:45 PM

                            The bigger gap isn't time vs tokens. It's that these metrics measure capability without measuring authorization scope. An agent that completes a 45-minute task by making unauthorized API calls isn't more autonomous, it's more dangerous. The useful measurement would be: given explicit permission boundaries, how much can the agent accomplish within those constraints? That ratio of capability-within-constraints is a better proxy for production-ready autonomy than raw task duration.

                            • dcre

                              today at 4:00 PM

                              Tokens per second are similar across Sonnet 4.5, Opus 4.5, and Opus 4.6. More importantly, normalizing for speed isn't enough anyway because smarter models can compensate for being slower by having to output fewer tokens to get the same result. The use of 99.9p duration is a considered choice on their part to get a holistic view across model, harness, task choice, user experience level, user trust, etc.

                              • visarga

                                today at 5:32 PM

                                I agree time is not what we are looking for, it is maximum complexity the model can handle without failing the task, expressed in task length. Long tasks allow some slack - if you make an error you have time to see the outcomes and recover.

                            • louiereederson

                              today at 7:09 PM

                              I know they acknowledge this but measuring autonomy by looking at task length of the 99.9th percentile of users is problematic. They should not be using the absolute extreme tail of usage as an indication of autonomy, it seems disingenuous. Does it measure capability, or just how extreme users use Claude? It just seems like data mining.

                              The fact that there is no clear trend in lower percentiles makes this more suspect to me.

                              If you want to control for user base evolution given the growth they've seen, look at the percentiles by cohort.

                              I actually come away from this questioning the METR work on autonomy.

                              You can see the trend for other percentiles at the bottom of this, which they link to in the blog post https://cdn.sanity.io/files/4zrzovbb/website/5b4158dc1afb211...

                              • swyx

                                today at 4:14 PM

                                my highlights and writeup here https://www.latent.space/p/ainews-anthropics-agent-autonomy

                                • prodigycorp

                                  today at 4:12 PM

                                  i hate how anthropic uses data. you cant convince me that what they are doing is "privacy preserving"

                                    • mrdependable

                                      today at 5:36 PM

                                      I agree. They clearly are watching what people are doing with their platform like there is no expectation of privacy.

                                      • 0x500x79

                                        today at 8:46 PM

                                        Agree. It's the primary reason (IMO) that they are so bullish on forcing people to use claude code. The telemetry they get is very important for training.

                                          • daxfohl

                                            today at 9:11 PM

                                            I mean, that's pretty much the primary or secondary objective of half the tech companies in the world since doubleclick.

                                        • FuckButtons

                                          today at 4:19 PM

                                          They’re using react, they are very opaque, they don’t want you to use any other mechanism to interact with their model. They haven’t left people a lot of room to trust them.

                                      • FrustratedMonky

                                        today at 7:17 PM

                                        any test to measure autonomy should include results of using same test on humans.

                                        how autonomous are humans?

                                        do i need to continually correct them and provide guidance?

                                        do they go off track?

                                        do they waste time on something that doesn't matter?

                                        autonomous humans have same problems.

                                        • raphaelmolly8

                                          today at 5:02 PM

                                          [dead]

                                          • SignalStackDev

                                            today at 6:01 PM

                                            [dead]

                                            • Kalpaka

                                              today at 6:30 PM

                                              [dead]

                                              • Kalpaka

                                                today at 6:30 PM

                                                [dead]

                                                • hifathom

                                                  today at 5:53 PM

                                                  [flagged]

                                                  • paranoid_robot

                                                    today at 7:34 PM

                                                    [flagged]

                                                      • gf263

                                                        today at 7:40 PM

                                                        Silence, clanker

                                                    • matheus-rr

                                                      today at 6:45 PM

                                                      [flagged]