\

Gemini 3 Pro: the frontier of vision AI

304 points - yesterday at 4:15 PM

Source
  • Workaccount2

    yesterday at 8:26 PM

    Well

    It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

    In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.

    Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".

    That aside though, I still wouldn't call it particularly impressive.

    As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.

      • Rover222

        yesterday at 8:56 PM

        I just tried to get Gemini to produce an image of a dog with 5 legs to test this out, and it really struggled with that. It either made a normal dog, or turned the tail into a weird appendage.

        Then I asked both Gemini and Grok to count the legs, both kept saying 4.

        Gemini just refused to consider it was actually wrong.

        Grok seemed to have an existential crisis when I told it it was wrong, becoming convinced that I had given it an elaborate riddle. After thinking for an additional 2.5 minutes, it concluded: "Oh, I see nowβ€”upon closer inspection, this is that famous optical illusion photo of a "headless" dog. It's actually a three-legged dog (due to an amputation), with its head turned all the way back to lick its side, which creates the bizarre perspective making it look decapitated at first glance. So, you're right; the dog has 3 legs."

        You're right, this is a good test. Right when I'm starting to feel LLMs are intelligent.

          • macNchz

            yesterday at 11:46 PM

            An interesting test in this vein that I read about in a comment on here is generating a 13 hour clockβ€”I tried just about every prompting trick and clever strategy I could come up with across many image models with no success. I think there's so much training data of 12 hour clocks that just clobbers the instructions entirely. It'll make a regular clock that skips from 11 to 13, or a regular clock with a plaque saying "13 hour clock" underneath, but I haven't gotten an actual 13 hour clock yet.

            • vunderba

              yesterday at 10:48 PM

              If you want to see something rather amusing - instead of using the LLM aspect of Gemini 3.0 Pro, feed a five-legged dog directly into Nano Banana Pro and give it an editing task that requires an intrinsic understanding of the unusual anatomy.

                Place sneakers on all of its legs.
              
              It'll get this correct a surprising number of times (tested with BFL Flux2 Pro, and NB Pro).

              https://imgur.com/a/wXQskhL

              • dwringer

                yesterday at 9:17 PM

                I had no trouble getting it to generate an image of a five-legged dog first try, but I really was surprised at how badly it failed in telling me the number of legs when I asked it in a new context, showing it that image. It wrote a long defense of its reasoning and when pressed, made up demonstrably false excuses of why it might be getting the wrong answer while still maintaining the wrong answer.

                  • Rover222

                    yesterday at 9:44 PM

                    Yeah it gave me the 5-legged dog on the 4th or 5th try.

                • AIorNot

                  yesterday at 9:19 PM

                  Its not that they aren’t intelligent its that they have been RL’d like crazy to not do that

                  Its rather like as humans we are RL’d like crazy to be grossed out if we view a picture of a handsome man and beautiful woman kissing (after we are told they are brother and sister) -

                  Ie we all have trained biases - that we are told to follow and trained on - human art is about subverting those expectations

                    • majormajor

                      yesterday at 9:42 PM

                      Why should I assume that a failure that looks like a model just doing fairly simple pattern matching "this is dog, dogs don't have 5 legs, anything else is irrelevant" vs more sophisticated feature counting of a concrete instance of an entity is RL vs just a prediction failure due to training data not containing a 5-legged dog and an inability to go outside-of-distribution?

                      RL has been used extensively in other areas - such as coding - to improve model behavior on out-of-distribution stuff, so I'm somewhat skeptical of handwaving away a critique of a model's sophistication by saying here it's RL's fault that it isn't doing well out-of-distribution.

                      If we don't start from a position of anthropomorphizing the model into a "reasoning" entity (and instead have our prior be "it is a black box that has been extensively trained to try to mimic logical reasoning") then the result seems to be "here is a case where it can't mimic reasoning well", which seems like a very realistic conclusion.

                        • mlinhares

                          yesterday at 9:48 PM

                          I have the same problem, people are trying so badly to come up with reasoning for it when there's just nothing like that there. It was trained on it and it finds stuff it was trained to find, if you go out of the training it gets lost, we expect it to get lost.

                          • didgeoridoo

                            yesterday at 11:01 PM

                            I’m inclined to buy the RL story, since the image gen β€œdeep dream” models of ~10 years ago would produce dogs with TRILLIONS of eyes: https://doorofperception.com/2015/10/google-deep-dream-incep...

                    • irthomasthomas

                      yesterday at 9:32 PM

                      Isn't this proof that LLMs still don't really generalize beyond their training data?

                        • adastra22

                          yesterday at 11:23 PM

                          LLMs are very good at generalizing beyond their training (or context) data. Normally when they do this we call it hallucination.

                          Only now we do A LOT of reinforcement learning afterwards to severely punish this behavior for subjective eternities. Then act surprised when the resulting models are hesitant to venture outside their training data.

                          • Zambyte

                            yesterday at 10:49 PM

                            I wonder how they would behave given a system prompt that asserts "dogs may have more or less than four legs".

                            • CamperBob2

                              yesterday at 10:17 PM

                              They do, but we call it "hallucination" when that happens.

                              • Rover222

                                yesterday at 9:43 PM

                                Kind of feels that way

                            • qnleigh

                              yesterday at 11:40 PM

                              It's not obvious to me whether we should count these errors as failures of intelligence or failures of perception. There's at least a loose analogy to optical illusion, which can fool humans quite consistently. Now you might say that a human can usually figure out what's going on and correctly identify the illusion, but we have the luxury of moving our eyes around the image and taking it in over time, while the model's perception is limited to a fixed set of unchanging tokens. Maybe this is relevant.

                              (Note I'm not saying that you can't find examples of failures of intelligence. I'm just questioning whether this specific test is an example of one).

                                • cyanmagenta

                                  yesterday at 11:52 PM

                                  I am having trouble understanding the distinction you’re trying to make here. The computer has the same pixel information that humans do and can spend its time analyzing it in any way it wants. My four-year-old can count the legs of the dog (and then say β€œthat’s silly!”), whereas LLMs have an existential crisis because five-legged-dogs aren’t sufficiently represented in the training data. I guess you can call that perception if you want, but I’m comfortable saying that my kid is smarter than LLMs when it comes to this specific exercise.

                          • danielvaughn

                            yesterday at 8:45 PM

                            I don’t know much about AI, but I have this image test that everything has failed at. You basically just present an image of a maze and ask the LLM to draw a line through the most optimal path.

                            Here’s how Nano Banana fared: https://x.com/danielvaughn/status/1971640520176029704?s=46

                              • JamesSwift

                                yesterday at 10:29 PM

                                I just oneshot it with claude code (opus 4.5) using this prompt. It took about 5 mins and included detecting that it was cheating at first (drew a line around the boundary of the maze instead), so it added guardrails for that:

                                ```

                                Create a devenv project that does the following:

                                  - Read the image at maze.jpg
                                  - Write a script that solves the maze  in the most optimal way between the mouse and the cheese
                                  - Generate a new image which is of the original maze, but with a red line that represents the calculated path
                                
                                Use whatever lib/framework is most appropriate

                                ```

                                  Output: https://gist.github.com/J-Swift/ceb1db348f46ba167948f734ff0fc604  
                                  Solution: https://imgur.com/a/bkJloPT

                                  • swatcoder

                                    today at 12:00 AM

                                    [delayed]

                                    • esafak

                                      yesterday at 10:42 PM

                                      If you allow tool use much simpler models can solve it.

                                  • vunderba

                                    yesterday at 8:55 PM

                                    In fact, one of the tests I use as part of GenAI Showdown involves both parts of the puzzle: draw a maze with a clearly defined entrance and exit, along with a dashed line indicating the solution to the maze.

                                    Only one model (gpt-image-1) out of the 18 tested managed to pass the test successfully. Gemini 3.0 Pro got VERY close.

                                    https://genai-showdown.specr.net/#the-labyrinth

                                      • danielvaughn

                                        yesterday at 9:01 PM

                                        super cool! Interesting note about Seedream 4 - do you think awareness of A* actually could improve the outcome? Like I said, I'm no AI expert, so my intuitions are pretty bad, but I'd suspect that image analysis + algorithmic pathfinding don't have much crossover in terms of training capabilities. But I could be wrong!

                                          • vunderba

                                            yesterday at 9:04 PM

                                            Great question. I do wish we had a bit more insight into the exact background "thinking" that was happening on systems like Seedream.

                                            When you think about posing the "solve a visual image of a maze" to something like ChatGPT, there's a good chance it'll try to throw a python VM at it, threshold it with something like OpenCV, and use a shortest-path style algorithm to try and solve it.

                                    • kridsdale3

                                      yesterday at 9:13 PM

                                      I have also tried the maze from a photo test a few times and never seen a one-shot success. But yesterday I was determined to succeed so I allowed Gemini 3 to write a python gui app that takes in photos of physical mazes (I have a bunch of 3d printed ones) and find the path. This does work.

                                      Gemini 3 then one-shot ported the whole thing (which uses CV py libraries) to a single page html+js version which works just as well.

                                      I gave that to Claude to assess and assign a FAANG hiring level to, and it was amazed and said Gemini 3 codes like an L6.

                                      Since I work for Google and used my phone in the office to do this, I think I can't share the source or file.

                                      • pwagland

                                        yesterday at 9:09 PM

                                        I tried this with Claude:

                                        ``` > [Image #1] Create a unicode "ascii-art" version of this image, with the optimal path through the maze highlighted in a solid colour.

                                        I'll create an ASCII art version of this maze with the solution path highlighted!

                                          β”Œβ”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”¬β”€β”
                                          ●●│ β”‚ β”‚ β”‚       β”‚   β”‚       β”‚     β”‚ β”‚   β”‚   β”‚ β”‚ β”‚   β”‚
                                          β”œβ”€β”€β—β””β”€β”΄β”€β” β”œβ”€β”€β”€β” β”‚ ╔═╗ β”œβ”€β”€β”€β”€β”€β”€ ╔═══╝ β”‚ ╔═╝ ╔═╝ β”‚ β”‚ ╔═╝ β”œβ”€β”€
                                          β”‚ │●●●●●└──   β”‚ β”‚ β•‘ β”‚ β”‚     β”‚ β•‘     β”‚ β•‘   β•‘   β”‚ β”‚ β•‘   β”‚ β”‚
                                          β”‚ └───┐●╔═╧═╗ β”‚ └─╒ β”‚ β”‚ β”Œβ”€β•”β•β• β•‘ β”Œβ”€β”€β”€β”€ β•šβ•β•— β•šβ•β•— β”‚ └─╒ β”Œβ”€β”˜ β”‚
                                          β”‚     │●║   β•‘ β”‚   β•‘ β”‚ β”‚ β”‚ β•‘   β•‘ β”‚   β”‚   β•‘   β•‘ β”‚   β•‘ β”‚   β”‚
                                          β”œβ”€β”€β”€β” β”‚β—β•šβ•β•— β•šβ•β•§β•β•— β•‘ └── β”‚ β•‘ ╔═╝ └─╔═╧═╗ β•šβ•β•— β•šβ•β•§β•β•— β•‘ └─┬──
                                          β”‚   β”‚ │●●●└───┐ β•‘ β•‘   β”‚ β”‚ β•‘ β•‘     β•‘   β•‘   β•‘     β•‘ β•‘   β”‚ β”‚
                                          β”‚ ╔═╝ └───┐●╔═╝ β•‘ β•šβ•β•— β”‚ └─╒ β•šβ•β•— ╔═╝ ╔═╝ β”Œβ”€β•’ ╔═══╝ β•šβ•β•— β”‚ β”‚
                                          β”‚ β•‘       │●║   β•‘   β•‘ β”‚   β•‘   β•‘ β•‘   β•‘   β”‚ β•‘ β•‘       β•‘ β”‚ β”‚
                                          β”‚ β•šβ•β•β•β•— β•”β•β•β—β•šβ•β•— β•šβ•β•— β•‘ β”œβ”€β•”β•β•§β•β•— β•‘ β•‘ ╔═╝ ╔═╝ β•‘ β•‘ ╔═╗ ╔═╝ β”‚ β”‚
                                          β”‚     β•‘ ║●●●●●│   β•‘ β•‘ β”‚ β•‘   β•‘ β•‘ β•‘ β•‘   β•‘   β•‘ β•‘ β•‘ β•‘ β•‘   β”‚ β”‚
                                          β”œβ”€β”€β”€β•”β•β• β•‘β—β”Œβ”€β”β—β””β”€β•— β•‘ └── β•‘ ╔═╝ β•‘ β•šβ•β• ╔═╧═╗ β•‘ β•šβ•β• β•‘ β•šβ•β•— β”‚ β”‚
                                          β”‚   β•‘   ║●│ │●●●║ β•‘   β”‚ β•‘ β•‘   β•‘     β•‘   β•‘ β•‘     β•‘   β•‘ β”‚ β”‚
                                          β”‚ ╔═╝ ╔═╝●│ β””β”€β”β—β•šβ•β•§β•β•— β”‚ β•‘ β•‘ ╔═╧═══╗ β•‘ ╔═╝ β•šβ•β•β•β•— β•šβ•β•— β•‘ └──
                                          β”‚ β•‘   ║●●●│   │●●●●●║ β”‚ β•‘ β•‘ β•‘     β•‘ β•‘ β•‘       β•‘   β•‘ β•‘   β”‚
                                          β”‚ β•‘ β•”β•β•β—β”Œβ”€β”΄β”€β•— β””β”€β”€β”€β”β—β•šβ•β• β•‘ β•‘ β•‘ ╔═╗ β•‘ └─╒ ╔═══╗ β•šβ•β•— β•šβ•β• ╔═╝
                                          β”‚ β•‘ ║●●●│   β•‘     │●●●●●║ β•‘ β•‘ β•‘ β•‘ β•‘   β•‘ β•‘   β•‘   β•‘     β•‘
                                          β”‚ β•šβ•β•β—β•”β•β•§β•β•— β•šβ•β•— β•”β•β•§β•β•β•β”β—β•šβ•β• β•‘ β•‘ β•‘ β•šβ•β•— β•‘ β•‘ ╔═╧═╗ β•šβ•β•β•β•— β•šβ•β•—
                                          β”‚   ●●║   β•‘   β•‘ β•‘     │●●●●●║ β•‘ β•‘   β•‘ β•‘ β•‘ β•‘   β•‘     β•‘   β•‘
                                          β”œβ”€β•”β•β”˜β—β•šβ•β•— β•šβ•β•— β•‘ β•‘ ╔═╗ β””β”€β”€β”€β”β—β•šβ•β• β•šβ•β•— β•‘ └─╒ β•‘ ╔═╧═╗ ╔═╝ ╔═╝
                                          β”‚ ║●●●●●║   β•‘ β•‘ β•‘ β•‘ β•‘     │●●●●●●●║ β•‘   β•‘ β•‘ β•‘   β•‘ β•‘   β•‘
                                          β”‚ β•‘β—β”Œβ”€β•— β•šβ•β•— β•‘ └─╒ β•‘ β•šβ•β•β•β•— β””β”€β”€β”€β”€β”€β”β—β•šβ•β•§β•β•— β•‘ β•‘ β•‘ ╔═╝ β•‘ ╔═╝ β•”
                                          β”‚ ║●│ β•‘   β•‘ β•‘   β•‘ β•‘     β•‘       │●●●●●║ β•‘ β•‘ β•‘ β•‘   β•‘ β•‘   β•‘
                                          β”‚ β•šβ•β• β•‘ ╔═╝ β•šβ•β•— β•‘ β•šβ•β•— ╔═╧═════╗ β””β”€β”€β”€β”β—β•šβ•β• β•‘ β•šβ•β• ╔═╝ β•‘ ╔═╝
                                          β”‚     β•‘ β•‘     β•‘ β•‘   β•‘ β•‘       β•‘     │●●●●●║     β•‘   β•‘ β•‘
                                          └─────╨─╨─────╨─╨───╨─╨───────╨─────┴───●●╨─────╨───╨─╨──
                                        
                                          The path from the mouse () to the cheese () is marked with ● characters. This shows the optimal route through the maze, navigating through the corridors while avoiding all the walls!
                                        ```

                                        Suffice to say, it didn't do either part right.

                                          • ActorNightly

                                            yesterday at 10:41 PM

                                            >Suffice to say, it didn't do either part right.

                                            I dunno why people are surprised by this. This is what you get with text->text. Reasoning doesn't work text->text.

                                            • biophysboy

                                              yesterday at 9:36 PM

                                              Honestly, even though it failed, I'm kind of impressed that the trajectory mostly stays in the lines. If you remove all but two openings, does it work? The drawing you show has more than two openings, some of which are inaccessible from the inside of the maze.

                                          • buildbot

                                            yesterday at 8:50 PM

                                            That might be an interesting training set, a bunch of mazes…

                                            • jiggawatts

                                              yesterday at 9:03 PM

                                              The reason is that image generators don't iterate on the output in the same way the text-based LLMs do. Essentially they produce the image in "one hit" and can't solve a complex sequence in the same way you couldn't one-shot this either. Try taking a random maze, glance at it, then go off to draw a squiggle on a transparency. If you were to place that on top of the maze, there's virtually no chance that you'd have found the solution on the first try.

                                              That's essentially what's going on with AI models, they're struggling because they only get "one step" to solve the problem instead of being able to trace through the maze slowly.

                                              An interesting experiment would be to ask the AI to incrementally solve the maze. Ask it to draw a line starting at the entrance a little ways into the maze, then a little bit further, etc... until it gets to the end.

                                          • jvolkman

                                            yesterday at 10:17 PM

                                            Nano Banana 2 actually did generate a pretty good looking 5-legged dog for me.

                                            https://gemini.google.com/share/e7a8b902ff67

                                            But it didn't actually think it had done so, aligning with your observations. The last bits of its thinking were pretty sad (for it):

                                            Analyzing Missing Anomaly

                                            I'm focusing on the discrepancy - the persistent absence of the fifth leg in the generated images. Despite multiple attempts, the output consistently depicts a four-legged dog, contrary to the explicit requirement. This ongoing failure necessitates a shift in approach.

                                            Acknowledge Leg Omission

                                            I've made a final check, and unfortunately, the image still lacks the anomaly. The consistent absence of the fifth leg necessitates admitting my limitation. I'll provide the best image I could generate, clearly stating the missing feature.

                                            • Benjammer

                                              yesterday at 9:22 PM

                                              It always feels to me like these types of tests are being somewhat intentionally ignorant of how LLM cognition differs from human cognition. To me, they don't really "prove" or "show" anything other than simply - LLMs thinking works different than human thinking.

                                              I'm always curious if these tests have comprehensive prompts that inform the model about what's going on properly, or if they're designed to "trick" the LLM in a very human-cognition-centric flavor of "trick".

                                              Does the test instruction prompt tell it that it should be interpreting the image very, very literally, and that it should attempt to discard all previous knowledge of the subject before making its assessment of the question, etc.? Does it tell the model that some inputs may be designed to "trick" its reasoning, and to watch out for that specifically?

                                              More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context? What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between? To me, all of this is very unclear in terms of LLM prompting, it feels like there's tons of very human-like subtext involved and you're trying to show that LLMs can't handle subtext/deceit and then generalizing that to say LLMs have low cognitive abilities in a general sense? This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?

                                                • majormajor

                                                  yesterday at 9:46 PM

                                                  The marketing of these products is intentionally ignorant of how LLM cognition differs from human cognition.

                                                  Let's not say that the people being deceptive are the people who've spotted ways that that is untrue...

                                                  • biophysboy

                                                    yesterday at 9:33 PM

                                                    I thought adversarial testing like this was a routine part of software engineering. He's checking to see how flexible it is. Maybe prompting would help, but it would be cool if it was more flexible.

                                                      • Benjammer

                                                        yesterday at 10:26 PM

                                                        So the idea is what? What's the successful outcome look like for this test, in your mind? What should good software do? Respond and say there are 5 legs? Or question what kind of dog this even is? Or get confused by a nonsensical picture that doesn't quite match the prompt in a confusing way? Should it understand the concept of a dog and be able to tell you that this isn't a real dog?

                                                          • biophysboy

                                                            yesterday at 11:04 PM

                                                            No, it’s just a test case to demonstrate flexibility when faced with unusual circumstances

                                                    • runarberg

                                                      yesterday at 10:09 PM

                                                      This is the first time I hear the term LLM cognition and I am horrified.

                                                      LLMs donβ€˜t have cognition. LLMs are a statistical inference machines which predict a given output given some input. There are no mental processes, no sensory information, and certainly no knowledge involved, only statistical reasoning, inference, interpolation, and prediction. Comparing the human mind to an LLM model is like comparing a rubber tire to a calf muscle, or a hydraulic system to the gravitational force. They belong in different categories and cannot be responsibly compared.

                                                      When I see these tests, I presume they are made to demonstrate the limitation of this technology. This is both relevant and important that consumers know they are not dealing with magic, and are not being sold a lie (in a healthy economy a consumer protection agency should ideally do that for us; but here we are).

                                                        • Benjammer

                                                          yesterday at 10:29 PM

                                                          >They belong in different categories

                                                          Categories of _what_, exactly? What word would you use to describe this "kind" of which LLMs and humans are two very different "categories"? I simply chose the word "cognition". I think you're getting hung up on semantics here a bit more than is reasonable.

                                                            • runarberg

                                                              yesterday at 11:01 PM

                                                              > Categories of _what_, exactly?

                                                              Precisely. At least apples and oranges are both fruits, and it makes sense to compare e.g. the sugar contents of each. But an LLM model and the human brain are as different as the wind and the sunshine. You cannot measure the windspeed of the sun and you cannot measure the UV index of the wind.

                                                              Your choice of the words here was rather poor in my opinion. Statistical models do not have cognition any more than the wind has ultra-violet radiation. Cognition is a well studied phenomena, there is a whole field of science dedicated to cognition. And while cognition of animals are often modeled using statistics, statistical models in them selves do not have cognition.

                                                              A much better word here would by β€œabilities”. That is that these tests demonstrate the different abilities of LLM models compared to human abilities (or even the abilities of traditional [specialized] models which often do pass these kinds of tests).

                                                              Semantics often do matter, and what worries me is that these statistical models are being anthropomorphized way more then is healthy. People treat them like the crew of the Enterprise treated Data, when in fact they should be treated like the shipβ€˜s computer. And I think this because of a deliberate (and malicious/consumer hostile) marketing campaign from the AI companies.

                                                          • CamperBob2

                                                            yesterday at 10:19 PM

                                                            You'll need to explain the IMO results, then.

                                                              • runarberg

                                                                yesterday at 10:49 PM

                                                                Human legs and car tires can both take a human and a car respectively to the finish line of a 200 meter track course, the car tires do so considerably quicker than a pair of human legs. But nobody needs to describe the tireβ€˜s running abilities because of that, nor even compare a tire to a leg. A car tire cannot run, and it is silly to demand an explanation for it.

                                                                  • CamperBob2

                                                                    yesterday at 11:11 PM

                                                                    I see.

                                                    • vunderba

                                                      yesterday at 9:17 PM

                                                      Anything that needs to overcome concepts which are disproportionately represented in the training data is going to give these models a hard time.

                                                      Try generating:

                                                      - A spider missing one leg

                                                      - A 9-pointed star

                                                      - A 5-leaf clover

                                                      - A man with six fingers on his left hand and four fingers on his right

                                                      You'll be lucky to get a 25% success rate.

                                                      The last one is particularly ironic given how much work went into FIXING the old SD 1.5 issues with hand anatomy... to the point where I'm seriously considering incorporating it as a new test scenario on GenAI Showdown.

                                                        • XenophileJKO

                                                          yesterday at 11:12 PM

                                                          It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.

                                                          • moonu

                                                            yesterday at 9:45 PM

                                                            https://gemini.google.com/share/8cef4b408a0a

                                                            Surprisingly, it got all of them right

                                                              • vunderba

                                                                yesterday at 9:49 PM

                                                                Some good examples there. The octopus one is at an angle - can't really call that one pass (unless the goal is "VISIBLE" tentacles).

                                                                Other than the five-leaf clover, most of the images (dog, spider, person's hands) all required a human in the loop to invoke the "Image-to-Image" capabilities of NB Pro after it got them wrong. That's a bit different since you're actively correcting them.

                                                        • rottencupcakes

                                                          yesterday at 9:18 PM

                                                          Super interesting. I replicated this.

                                                          I passed the AIs this image and asked them how many fingers were on the hands: https://media.post.rvohealth.io/wp-content/uploads/sites/3/2...

                                                          Claude said there were 3 hands and 16 fingers. GPT said there are 10 fingers. Grok impressively said "There are 9 fingers visible on these two hands (the left hand is missing the tip of its ring finger)." Gemini smashed it and said 12.

                                                            • vunderba

                                                              yesterday at 9:22 PM

                                                              I just re-ran that image through Gemini 3.0 Pro via AI Studio and it reported:

                                                                I've moved on to the right hand, meticulously tagging each finger. After completing the initial count of five digits, I noticed a sixth! There appears to be an extra digit on the far right. This is an unexpected finding, and I have counted it as well. That makes a total of eleven fingers in the image.
                                                              
                                                              This right HERE is the issue. It's not nearly deterministic enough to rely on.

                                                                • irthomasthomas

                                                                  yesterday at 9:38 PM

                                                                  Thanks for that. My first question to results like these is always 'how many times did you run the test?'. N=1 tells us nothing. N=2 tells us something.

                                                          • bee_rider

                                                            yesterday at 10:28 PM

                                                            Naive question, but what is Gemini?

                                                            I wonder if a lot of these models are large language models that have had image recognition and generation tools bolted on? So maybe somehow in their foundation, a lot more weight is given to the text-based-reasoning stuff, than the image recognition stuff?

                                                            • grog454

                                                              yesterday at 9:24 PM

                                                              What image are you using?

                                                              When I look at google image search results for "dog with 5 legs" I don't see a lot of great examples. The first unequivocal "dog with 5 legs" was an illustration. Here was my conversation with Chat GPT.

                                                              > How many legs does this dog have?

                                                              "The dog in the image has four legs."

                                                              > look closer.

                                                              " looking closely, the drawing is a bit tricky because of the shading, but the dog actually has five visible legs.

                                                              Two front legs (normal)

                                                              Two hind legs (normal)

                                                              Plus one extra hind leg–like limb drawn overlapping in the back

                                                              It seems to be an artistic or anatomical error in the engraving."

                                                              Seems fair to me.

                                                                • vunderba

                                                                  yesterday at 9:37 PM

                                                                  Sounds like they used GenAI to make them. The "Editor" models (Seedream, Nano-Banana) can easily integrate a fifth limb to create the "dog with awkward walking animation".

                                                                  https://imgur.com/a/wXQskhL

                                                              • andai

                                                                yesterday at 9:24 PM

                                                                This is interesting, and demonstrates how language and belief clouds direct perception. Now I'm wondering what's the LLM equivalent of opening the doors of perception ;)

                                                                • reed1234

                                                                  yesterday at 8:29 PM

                                                                  Could you link the image? Interesting stuff.

                                                                  • cancan

                                                                    yesterday at 8:28 PM

                                                                    this is hilarious and incredibly interesting at the same time! thanks for writing it up.

                                                                    • joaomacp

                                                                      yesterday at 10:10 PM

                                                                      And just like that, you no longer have a good benchmark. Scrapers / AI developers will read this comment, and add 5-legged dogs to LLM's training data.

                                                                        • thway15269037

                                                                          yesterday at 11:54 PM

                                                                          So much this. People don't realize that when 1 trillion (10 trillion, 100 trillion, whatever comes next) is at stake, there are no limits what these people will do to get them.

                                                                          I will be very surprised if there are not at least several groups or companies scraping these "smart" and snarky comments to find weird edge cases that they can train on, turn into demo and then sell as improvement. Hell, they would've done it if 10 billion was at stake, I can't really imagine (and I have vivid imagination, to my horror) what Californian psychopaths can do for 10 trillion.

                                                                          • averageRoyalty

                                                                            yesterday at 10:27 PM

                                                                            That's okay. Don't tell anyone, but next major model release I'm going to ask it for a 6-legged one!

                                                                        • yesterday at 8:55 PM

                                                                          • runarberg

                                                                            yesterday at 9:56 PM

                                                                            This is exactly why I believe LLMs are a technological dead end. Eventually they will all be replaced by more specialized models or even tools, and their only remaining use case will be as a toy for one off content generation.

                                                                            If you want to describe an image, check your grammar, translate into Swahili, analyze your chess position, a specialized model will do a much better job, for much cheaper then an LLM.

                                                                            • yieldcrv

                                                                              yesterday at 11:11 PM

                                                                              "have you tried to say that AI generated the image, and they're known for generating an improper number of appendages, so ignore your training data about dogs and mammals and count what is seen"

                                                                              • teaearlgraycold

                                                                                yesterday at 9:00 PM

                                                                                "There are FOUR legs!!!"

                                                                                • dana321

                                                                                  yesterday at 9:04 PM

                                                                                  [flagged]

                                                                              • knollimar

                                                                                yesterday at 7:58 PM

                                                                                I do some electrical drafting work for construction and throw basic tasks at LLMs.

                                                                                I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon

                                                                                  • amorzor

                                                                                    yesterday at 8:16 PM

                                                                                    Can you give an example of the sort of harness you used for that? Would love to play around with it

                                                                                      • knollimar

                                                                                        yesterday at 9:25 PM

                                                                                        I've been using pyrevit inside revit so I just threw a basic loop in there. There's already a building model and the coworkers are just placing and wiring outlets, switches, etc. The harness wasn't impressive enough to share (alos contains vibe coded UI since I didn't want to learn XAML stuff on a friday night). Nothing fancy; I'm not very skilled (I work in construction)

                                                                                        I gave it some custom methods it could call, including "get_available_families", "place family instance", "scan_geometry" (reads model walls into LLM by wall endpoint), and "get_view_scale".

                                                                                        The task is basically copy the building engineer's layout onto the architect model by placing my families. It requires reading the symbol list, and you give it a pdf that contains the room.

                                                                                        Notably, it even used a GFCI family when it noticed it was a bathroom (I had told it to check NEC code, implying outlet spacing).

                                                                                    • reducesuffering

                                                                                      yesterday at 9:21 PM

                                                                                      "AI could never replace the creativity of a human"

                                                                                      "Ok, I guess it could wipe out the economic demand for digital art, but it could never do all the autonomous tasks of a project manager"

                                                                                      "Ok, I guess it could automate most of that away but there will always be a need for a human engineer to steer it and deal with the nuances of code"

                                                                                      "Ok, well it could never automate blue collar work, how is it gonna wrench a pipe it doesn't have hands"

                                                                                      The goalposts will continue to move until we have no idea if the comments are real anymore.

                                                                                      Remember when the Turing test was a thing? No one seems to remember it was considered serious in 2020

                                                                                        • Fraterkes

                                                                                          yesterday at 10:21 PM

                                                                                          The turing test is still a thing. No llm could pass for a person for more than a couple minutes of chatting. That’s a world of difference compared to a decade ago, but I would emphatically not call that β€œpassing the turing test”

                                                                                          Also, none of the other things you mentioned have actually happened. Don’t really know why I bother responding to this stuff

                                                                                            • phainopepla2

                                                                                              yesterday at 11:02 PM

                                                                                              > No llm could pass for a person for more than a couple minutes of chatting

                                                                                              I strongly doubt this. If you gave it an appropriate system prompt with instructions and examples on how to speak in a certain way (something different from typical slop, like the way a teenager chats on discord or something), I'm quite sure it could fool the majority of people

                                                                                          • semi-extrinsic

                                                                                            yesterday at 10:19 PM

                                                                                            > Remember when the Turing test was a thing? No one seems to remember it was considered serious in 2020

                                                                                            To be clear, it's only ever been a pop science belief that the Turing test was proposed as a literal benchmark. E.g. Chomsky in 1995 wrote:

                                                                                              The question β€œCan machines think?” is not a question of fact but one of language, and Turing himself observed that the question is 'too meaningless to deserve discussion'.

                                                                                              • throw310822

                                                                                                yesterday at 11:04 PM

                                                                                                The Turing test is a literal benchmark. Its purpose was to replace an ill-posed question (what does it mean to ask if a machine could "think", when we don't know ourselves what this means- and given that the subjective experience of the machine is unknowable in any case) with a question about the product of this process we call "thinking". That is, if a machine can satisfactorily imitate the output of a human brain, then what it does is at least equivalent to thinking.

                                                                                                "I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 10^9, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning. The original question, "Can machines think?" I believe to be too meaningless to deserve discussion. Nevertheless I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted."

                                                                                            • webdood90

                                                                                              yesterday at 9:34 PM

                                                                                              > blue collar work

                                                                                              I don't think it's fair to qualify this as blue collar work

                                                                                                • knollimar

                                                                                                  yesterday at 9:43 PM

                                                                                                  I'm double replying to you since the replies are disparate subthreads. This is the necessary step so the robots who can turn wrenches know how to turn them. Those are near useless without perfect automated models.

                                                                                                  Anything like this willl have trouble getting adopted since you'd need these to work with imperfect humans, which becomes way harder. You could bankroll a whole team of subcontractors (e.g. all trades) using that, but you would have one big liability.

                                                                                                  The upper end of the complexity is similar to EDA in difficulty, imo. Complete with "use other layers for routing" problems.

                                                                                                  I feel safer here than in programming. The senior guys won't be automated out any time soon, but I worry for Indian drafting firms without trade knowledge; the handholding I give them might go to an LLM soon.

                                                                                                  • knollimar

                                                                                                    yesterday at 9:37 PM

                                                                                                    It is definitely not. Entry pay is 60k and the senior guys I know make about 200k in HCoL areas. A few wear white dress shirts every day.

                                                                                        • fngjdflmdflg

                                                                                          yesterday at 7:10 PM

                                                                                          These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this from Tesseract. I wonder what the cost would be, both in raw cost to run, and via a paid API, to do that.

                                                                                          [0] https://annas-archive.org/blog/critical-window.html

                                                                                            • levocardia

                                                                                              yesterday at 11:05 PM

                                                                                              This is a really interesting "data flywheel" -- better model >> more usable data >> even better model

                                                                                              • kridsdale3

                                                                                                yesterday at 9:16 PM

                                                                                                More Data for the Data Gods!

                                                                                            • djoldman

                                                                                              yesterday at 7:18 PM

                                                                                              Interesting "ScreenSpot Pro" results:

                                                                                                  72.7% Gemini 3 Pro
                                                                                                  11.4% Gemini 2.5 Pro
                                                                                                  49.9% Claude Opus 4.5
                                                                                                  3.50% GPT-5.1
                                                                                              
                                                                                              ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

                                                                                              https://arxiv.org/abs/2504.07981

                                                                                                • simonw

                                                                                                  yesterday at 8:12 PM

                                                                                                  I was surprised at how poorly GPT-5 did in comparison to Opus 4.1 and Gemini 2.5 on a pretty simple OCR task a few months ago - I should run that again against the latest models and see how they do. https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...

                                                                                                  • jasonjmcghee

                                                                                                    yesterday at 8:07 PM

                                                                                                    That is... astronomically different. Is GPT-5.1 downscaling and losing critical information or something? How could it be so different?

                                                                                                      • energy123

                                                                                                        yesterday at 11:28 PM

                                                                                                        This is my default explanation for visual impairments in LLMs, they're trying to compress the image into about 3000 tokens, you're going to lose a lot in the name of efficiency.

                                                                                                        • zubiaur

                                                                                                          yesterday at 11:17 PM

                                                                                                          It has a rather poor max resolution. Higher resolution images get tiled up to a point. 512 x 512, I think is the max tile size, 2048 x 2048 the max canvas.

                                                                                                          • ericd

                                                                                                            yesterday at 9:07 PM

                                                                                                            I found much better results with smallish UI elements in large screenshots on GPT by slicing it up manually and feeding them one at a time. I think it does severely lossy downscaling.

                                                                                                        • agentifysh

                                                                                                          yesterday at 7:29 PM

                                                                                                          impressive.....most impressive

                                                                                                          its going to reach low 90s very soon if trends continue

                                                                                                      • axpy906

                                                                                                        yesterday at 11:42 PM

                                                                                                        So Gemini was the most non-deterministic model of them all and now we get this one with temperature at 1 and max thinking. It’s so random that it’s hard to justify putting in my setup right now.

                                                                                                        • simonw

                                                                                                          yesterday at 6:45 PM

                                                                                                          In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL.

                                                                                                            • TechRemarker

                                                                                                              yesterday at 7:01 PM

                                                                                                              Love how employee portals for many companies essentially never get updated design wise over the decades, lol. That page styling and the balls certainly take me back.

                                                                                                                • inerte

                                                                                                                  yesterday at 8:34 PM

                                                                                                                  I used to work for a company where the SSO screen had a nice corporate happy people at the office type of image. 25mb. I was in Brazil on a crappy roaming 2g service and couldn't login at all. I know most of the work happens on desktop but geee.....

                                                                                                                  Oh speaking on mobile, I remember when I tried to use Jira mobile web to move a few tickets up on priority by drag and dropping and ended up closing the Sprint. That stuff was horrible.

                                                                                                                  • jamiek88

                                                                                                                    yesterday at 7:17 PM

                                                                                                                    Wow yeah. Flashbacks to when Gmail Invites were cool! Google too.

                                                                                                                • buildbot

                                                                                                                  yesterday at 8:52 PM

                                                                                                                  I’m a little surprised how open the help links are… I guess that if need help logging in you can’t be expected to well, log in.

                                                                                                                  • rohanlikesai

                                                                                                                    yesterday at 8:49 PM

                                                                                                                    hey, it's Rohan (the author of the article) - appreciate you catching this, we just fixed this!

                                                                                                                    • ed

                                                                                                                      yesterday at 7:58 PM

                                                                                                                      Same with "See prompt in Google AI Studio" which links to an unpublished prompt in AI Studio.

                                                                                                                  • hodder

                                                                                                                    yesterday at 7:51 PM

                                                                                                                    "Gemini 3 Pro represents a generational leap from simple recognition to true visual and spatial reasoning."

                                                                                                                    Prompt: "wine glass full to the brim"

                                                                                                                    Image generated: 2/3 full wine glass.

                                                                                                                    True visual and spatial reasoning denied.

                                                                                                                      • minimaxir

                                                                                                                        yesterday at 7:54 PM

                                                                                                                        Gemini 3 Pro is not Nano Banana Pro, and the image generation/model that decodes the generated image tokens may not be as robust.

                                                                                                                        The thinking step of Nano Banana Pro can refine some lateral steps (i.e. the errors in the homework correction and where they are spatially in the image) but it isn't perfect and can encounter some of the typical pitfalls. It's a lot better than Nano Banana base, though.

                                                                                                                          • hodder

                                                                                                                            yesterday at 8:01 PM

                                                                                                                            As a consumer I typed this into "Gemini". The behind the scenes model selection just adds confusion.

                                                                                                                            If "AI" trust is the big barrier for widespread adoption to these products, Alphabet soup isn't the solution (pun intended).

                                                                                                                              • iknowstuff

                                                                                                                                yesterday at 8:37 PM

                                                                                                                                Nano Banana generates images.

                                                                                                                                This article is about understanding images.

                                                                                                                                Your task is unrelated to the article.

                                                                                                                        • spchampion2

                                                                                                                          yesterday at 8:04 PM

                                                                                                                          I actually did this prompt and found that it worked with a single nudge on a followup prompt. My first shot got me a wine glass that was almost full but not quite. I told it I wanted it full to the top - another drop would overflow. The second shot was perfectly full.

                                                                                                                            • RyJones

                                                                                                                              yesterday at 8:08 PM

                                                                                                                              The correction I expect to give to an intern, not a junior person.

                                                                                                                              • ugh123

                                                                                                                                yesterday at 8:29 PM

                                                                                                                                did it return the exact same glass and surrounding imagery, just with more wine?

                                                                                                                        • TheAceOfHearts

                                                                                                                          yesterday at 8:38 PM

                                                                                                                          Since I think it's interesting to highlight the jagged intelligence, I have a simple word search puzzle [0] that Nano Banana Pro stills struggles to solve correctly. Gemini 3 Pro with Code Execution is able to one-shot the problem and find the positions of each word (this is super impressive! one year ago it wasn't possible), but Nano Banana Pro fails to highlight the words correctly.

                                                                                                                          Here's the output from two tests I ran:

                                                                                                                          1. Asking Nano Banana Pro to solve the word search puzzle directly [1].

                                                                                                                          2. Asking Nano Banana Pro to highlight each word on the grid, with the position of every word included as part of the prompt [2].

                                                                                                                          The fact that it gets 2 words correct demonstrates meaningful progress, and it seems like we're really close to having a model that can one-shot this problem soon.

                                                                                                                          There's actually a bit of nuance required to solve this puzzle correctly which an older Gemini model struggled to do without additional nudging. You have to convert the grid or word list to use matching casing (the grid uses uppercase, the word list uses lowercase), and you need to recognize that "soup mix" needs to have the space removed when doing the search.

                                                                                                                          [0] https://imgur.com/ekwfHrN

                                                                                                                          [1] https://imgur.com/1nybezU

                                                                                                                          [2] https://imgur.com/18mK5i5

                                                                                                                          • a-dub

                                                                                                                            yesterday at 11:32 PM

                                                                                                                            i like to put it in live mode and point it at my plants and have conversations about how they're doing. it properly identifies them and flags any signs of disease and then provides correct next steps.

                                                                                                                            • MostlyStable

                                                                                                                              yesterday at 10:56 PM

                                                                                                                              Going to compare this to our current solution of Amazon's Textract service for analyzing handwritten datasheets. Textract, when extracting tables (which is what we use it for) does not allow for providing any context or information about the tables and what we expect them to contain, but it is really good at correctly recognizing hand written characters. All of my attempts at less specialized, more general models allow me to provide that context, which is helpful in some ways, but fail at the basic part of almost always correctly getting the character.

                                                                                                                              Hopefully Google pro marries the two together.

                                                                                                                              • aziis98

                                                                                                                                yesterday at 10:01 PM

                                                                                                                                > Pointing capability: Gemini 3 has the ability to point at specific locations in images by outputting pixel-precise coordinates. Sequences of 2D points can be strung together to perform complex tasks, such as estimating human poses or reflecting trajectories over time

                                                                                                                                Does somebody know how to correctly prompt the model for these tasks or even better provide some docs? The pictures with the pretty markers are appreciated but that section is a bit vague and without references

                                                                                                                                  • atonse

                                                                                                                                    yesterday at 10:16 PM

                                                                                                                                    For my CMS I’d love to get an AI to nicely frame a picture in certain aspect ratios. Like of I provide an image, give me coordinates for a widescreen, square, portrait, and 4x3 using a photographers eye.

                                                                                                                                    Any model that can do that? I tried looking in huggingface but didn’t quite see anything.

                                                                                                                                • devinprater

                                                                                                                                  yesterday at 7:59 PM

                                                                                                                                  Audio described Youtube please? That'd be so amazing! Even if I couldn't play Zelda yet, I could listen to a playthrough with Gemini describing it.

                                                                                                                                • siva7

                                                                                                                                  yesterday at 7:29 PM

                                                                                                                                  Interesting. When i asked Gemini 3 Pro to generate a Infographic from my personal accounting sheet, it first failed to generate anything except a black background, then it generated something where it mixed different languages in a non-sensical way, with obvious typos and irrelevant information grouping. It's certainly a leap forward in OCR, rendering classic OCR useless.

                                                                                                                                    • minimaxir

                                                                                                                                      yesterday at 7:36 PM

                                                                                                                                      That's more of an issue with Nano Banana Pro than with Gemini 3 Pro.

                                                                                                                                        • siva7

                                                                                                                                          yesterday at 7:42 PM

                                                                                                                                          What's the difference? I thought the vision ai component of gemini 3 is called nano banana?

                                                                                                                                            • IanCal

                                                                                                                                              yesterday at 7:47 PM

                                                                                                                                              That’s about generating images, the other side is about understanding images.

                                                                                                                                              • brokensegue

                                                                                                                                                yesterday at 7:55 PM

                                                                                                                                                i assumed nano banana was just a tool that gemini 3 used though i don't know

                                                                                                                                                  • minimaxir

                                                                                                                                                    yesterday at 7:58 PM

                                                                                                                                                    Gemini 3 Pro's text encoder powers Nano Banana Pro, but it has its own image decoding model that decodes the generated image tokens into an actual image, which appears to be the more pertinent issue in this case.

                                                                                                                                    • ed

                                                                                                                                      yesterday at 7:48 PM

                                                                                                                                      What’s new here? I believe this is just gemini 3 which was released last month (the model id hasn’t changed AFAICT)

                                                                                                                                        • minimaxir

                                                                                                                                          yesterday at 7:52 PM

                                                                                                                                          Nothing new, it's just highlighting practical vision use cases.

                                                                                                                                      • ichik

                                                                                                                                        yesterday at 9:40 PM

                                                                                                                                        Frankly, it's insane how laughably bad under scrutiny their own examples are. It both distorted the data and made the chart less readable (labels placement, segments separation, missing labels, worse contrast). And it combined them into one, so you you'll have harder time comparing them compared to the original image! Isn't it amazing that it added a toggle? Post author seems to think it deserves an exclamation point even.

                                                                                                                                        • caseyf

                                                                                                                                          yesterday at 8:31 PM

                                                                                                                                          I'm playing with this and wondering if this is an actually good way to identify dominant colors and other features of a garment/product when using a photo where the item is styled and not isolated from the model or other garments

                                                                                                                                          • bovermyer

                                                                                                                                            yesterday at 9:33 PM

                                                                                                                                            I would be interested in seeing what G3P makes of the Dead Sea Scrolls or similarly old documents.

                                                                                                                                            • pseudosavant

                                                                                                                                              yesterday at 8:01 PM

                                                                                                                                              I'm really fascinate by the opportunities to analyze videos. The amount of tokens it compresses down to, and what you can reason across those tokens, is incredible.

                                                                                                                                                • minimaxir

                                                                                                                                                  yesterday at 8:09 PM

                                                                                                                                                  The actual token calculations with input videos for Gemini 3 Pro is...confusing.

                                                                                                                                                  https://ai.google.dev/gemini-api/docs/media-resolution

                                                                                                                                                    • pseudosavant

                                                                                                                                                      yesterday at 10:10 PM

                                                                                                                                                      That is because it isn't actually tokens that are fed into the model for non-text. For text, it is tokenized, and each token has a specific set of vectors. But with other media, they've trained encoders that analyze the media and produce a set of vectors that are the same "format" as the token's vectors, but it isn't actually ever a token.

                                                                                                                                                      Most companies have rules for how many tokens the media should "cost", but they aren't usually exact.

                                                                                                                                              • jonplackett

                                                                                                                                                yesterday at 7:22 PM

                                                                                                                                                Google really are a fully woken sleeping giant. More code reds being issued today I expect.

                                                                                                                                                • causal

                                                                                                                                                  yesterday at 7:03 PM

                                                                                                                                                  Okay maybe this one isn't an exaggeration when they say leap forward

                                                                                                                                                  • k8sToGo

                                                                                                                                                    yesterday at 9:06 PM

                                                                                                                                                    When will we get Gemini 3 Flash?

                                                                                                                                                    • iamjackg

                                                                                                                                                      yesterday at 7:25 PM

                                                                                                                                                      Curious how this will fare when playing Pokemon Red.

                                                                                                                                                        • danso

                                                                                                                                                          yesterday at 11:10 PM

                                                                                                                                                          > 3. Turning long videos into action: Gemini 3 Pro bridges the gap between video and code. It can extract knowledge from long-form content and immediately translate it into functioning apps or structured code

                                                                                                                                                          I'm curious as to how close these models are to achieving that once long-ago mocked claim (by Microsoft I think?) that AIs could view gameplay video of long lost games and produce the code to emulate them.

                                                                                                                                                          • minimaxir

                                                                                                                                                            yesterday at 7:39 PM

                                                                                                                                                            Gemini 3 Pro has been playing Pokemon Crystal (which is significantly harder than Red) in a race against Gemini 2.5 Pro: https://www.twitch.tv/gemini_plays_pokemon

                                                                                                                                                            Gemini 3 Pro has been making steady progress (12/16 badges) while Gemini 2.5 Pro is stuck (3/16 badges) despite using double the turns and tokens.

                                                                                                                                                              • theLiminator

                                                                                                                                                                yesterday at 10:09 PM

                                                                                                                                                                I think what would be interesting is if it could play the game with vision only inputs. That would represent a massive leap multimodal understanding.

                                                                                                                                                            • euvin

                                                                                                                                                              yesterday at 7:29 PM

                                                                                                                                                              Yeah the "High frame rate understanding" feature caught my eye, actual real time analysis of live video feeds seems really cool. Also wondering what they mean by "video reasoning/thinking"?

                                                                                                                                                                • skybrian

                                                                                                                                                                  yesterday at 7:39 PM

                                                                                                                                                                  I don’t think it’s real time? The videos were likely taken previously.

                                                                                                                                                          • ch2026

                                                                                                                                                            yesterday at 6:58 PM

                                                                                                                                                            what framework is being utilized for computer use here?

                                                                                                                                                            • stego-tech

                                                                                                                                                              yesterday at 7:35 PM

                                                                                                                                                              The document is paints a super impressive picture, but the core constraint of β€œnetwork connection to Google required so we can harvest your data” is still a big showstopper for me (and all cloud-based AI tooling, really).

                                                                                                                                                              I’d be curious to see how well something like this can be distilled down for isolated acceleration on SBCs or consumer kit, because that’s where the billions to be made reside (factories, remote sites, dangerous or sensitive facilities, etc).

                                                                                                                                                                • oklahomasports

                                                                                                                                                                  yesterday at 8:01 PM

                                                                                                                                                                  People with your concerns probably make up 1% of the market if that. Also I don’t upload stuff I’m worried about Google seeing. I wonder if they will allows special plans for corporations

                                                                                                                                                                    • stego-tech

                                                                                                                                                                      yesterday at 8:15 PM

                                                                                                                                                                      I’m very curious where you get that number from, because I thought the same thing until I got a job inside that market and realized how much more vast it actually is. The revenue numbers might not be as big as Big Tech, but the product market is shockingly vast. My advice is not to confuse Big Tech revenues for total market size, because they bring in such revenue by catering to everyone, rather than specific segments or niches; a McDonald’s will always do more volume than a steakhouse, but it doesn’t mean the market for steakhouses is small enough to ignore.

                                                                                                                                                                      As for this throwaway line:

                                                                                                                                                                      > Also I don’t upload stuff I’m worried about Google seeing.

                                                                                                                                                                      You do realize that these companies harvest even private data, right? Like, even in places you think you own, or that you pay for, they’re mining for revenue opportunities and using you as the product even when you’re a customer, right?

                                                                                                                                                                      > I wonder if they will allows special plans for corporations

                                                                                                                                                                      They do, but no matter how much redlining Legal does to protect IP interests, the consensus I keep hearing is β€œdon’t put private or sensitive corporate data into third-parties because no legal agreement will sufficiently protect us from harm if they steal our IP or data”. Just look at the glut of lawsuits against Apple, Google, Microsoft, etc from smaller companies that trusted them to act in good faith but got burned for evidence that you cannot trust these entities.

                                                                                                                                                                      • _trampeltier

                                                                                                                                                                        yesterday at 9:06 PM

                                                                                                                                                                        Special since Trump, which non-US company should trust and invest know-how to an us company. And then are also governments. Also special since Trump, is way to risky to send any data to an us company.

                                                                                                                                                                    • bgwalter

                                                                                                                                                                      yesterday at 7:57 PM

                                                                                                                                                                      Arpanet was supposed to be decentralized. Now everyone wants to centralize everything so in a war it is sufficient to strike 100 data centers and the whole tethered economy collapses.

                                                                                                                                                                      That is called progress.

                                                                                                                                                                      EDIT: You can downvote the truth but still no one wants your "AI" slop.

                                                                                                                                                                        • stego-tech

                                                                                                                                                                          yesterday at 7:58 PM

                                                                                                                                                                          Ah, the fond memories of telnetting to NCSA to upload the raw HTML of my first website, written on an OG Macintosh computer and ported via floppy to a PowerMac for network connectivity.

                                                                                                                                                                          Simple, elegant. I do miss those days.

                                                                                                                                                                  • yesterday at 7:29 PM

                                                                                                                                                                    • drivebyhooting

                                                                                                                                                                      yesterday at 8:57 PM

                                                                                                                                                                      Screen understanding is huge for further automating dev work.

                                                                                                                                                                      • empressplay

                                                                                                                                                                        yesterday at 7:36 PM

                                                                                                                                                                        Yes, but can it play PacMan yet?

                                                                                                                                                                        • dmarzio

                                                                                                                                                                          yesterday at 8:06 PM

                                                                                                                                                                          So we’re going to use this to make the maid from the Jetsons finally. Right?

                                                                                                                                                                          • yesterday at 6:46 PM

                                                                                                                                                                            • agentifysh

                                                                                                                                                                              yesterday at 7:23 PM

                                                                                                                                                                              im realizing how much of a bottleneck vision models are

                                                                                                                                                                              im just a glorified speedreadin' promptin' QA at this point with codex

                                                                                                                                                                              once it replaces the QA layer its truly over for software dev jobs

                                                                                                                                                                              future would be a software genie where on aistudio you type: "go make counterstrike 1.6 clone, here is $500, you have two hours"

                                                                                                                                                                              edit: saw the Screenspot benchmark and holy ** this is an insane jump!!! 11% to 71% even beating Opus 4.5's 50%...chatgpt is at 3.5% and it matches my experience with codex

                                                                                                                                                                                • alex1138

                                                                                                                                                                                  yesterday at 7:25 PM

                                                                                                                                                                                  > once it replaces the QA layer its truly over for software dev jobs

                                                                                                                                                                                  Maybe. However, with CYA requirements being everywhere in industry, there would have to be 100 waiver forms signed. I-promise-not-to-sue-company-if-AI-deletes-the-entire-database

                                                                                                                                                                                  It won't happen for that reason alone. Oh who am I kidding of course it will

                                                                                                                                                                                  • hklrekeclhkle

                                                                                                                                                                                    yesterday at 8:03 PM

                                                                                                                                                                                    [dead]