Tiled Hacker news on React Router

Gemini 3 Pro: the frontier of vision AI

566 points - 12/05/2025

Source

Workaccount2
12/05/2025
Well
It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.
In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.
Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".
That aside though, I still wouldn't call it particularly impressive.
As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.
knollimar
12/05/2025
I do some electrical drafting work for construction and throw basic tasks at LLMs.
I gave it a shitty harness and it almost 1 shotted laying out outlets in a room based on a shitty pdf. I think if I gave it better control it could do a huge portion of my coworkers jobs very soon
fngjdflmdflg
12/05/2025
These OCR improvements will almost certainly be brought to google books, which is great. Long term it can enable compressing all non-digital rare books into a manageable size that can be stored for less than $5,000.[0] It would also be great for archive.org to move to this from Tesseract. I wonder what the cost would be, both in raw cost to run, and via a paid API, to do that.
[0] https://annas-archive.org/blog/critical-window.html
djoldman
12/05/2025
Interesting "ScreenSpot Pro" results:
```
    72.7% Gemini 3 Pro
    11.4% Gemini 2.5 Pro
    49.9% Claude Opus 4.5
    3.50% GPT-5.1
```
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
https://arxiv.org/abs/2504.07981
simonw
12/05/2025
In case the article author sees this, the "HTML transcription" link is broken - it goes to https://aistudio-preprod.corp.google.com/prompts/1GUEWbLIlpX... which is a Google-employee-only URL.
mhl47
12/06/2025
We are currently working on some christmas puzzle, that are - I would say - a bit more difficult from the visual side. GPT5.1 completely failed at all of them while Gemini 3 solved two till know that I would consider rather impressive.
One was two screenshots of a phone screen with chats that are timestamped and it had to take the nth letter of the mth word based on the timestamp. While the type of riddle could be in the training data the ability to OCR this that well and understand the spatial relation to each object perfectly is something I have not seen from other models yet.
TheAceOfHearts
12/05/2025
Since I think it's interesting to highlight the jagged intelligence, I have a simple word search puzzle [0] that Nano Banana Pro stills struggles to solve correctly. Gemini 3 Pro with Code Execution is able to one-shot the problem and find the positions of each word (this is super impressive! one year ago it wasn't possible), but Nano Banana Pro fails to highlight the words correctly.
Here's the output from two tests I ran:
1. Asking Nano Banana Pro to solve the word search puzzle directly [1].
2. Asking Nano Banana Pro to highlight each word on the grid, with the position of every word included as part of the prompt [2].
The fact that it gets 2 words correct demonstrates meaningful progress, and it seems like we're really close to having a model that can one-shot this problem soon.
There's actually a bit of nuance required to solve this puzzle correctly which an older Gemini model struggled to do without additional nudging. You have to convert the grid or word list to use matching casing (the grid uses uppercase, the word list uses lowercase), and you need to recognize that "soup mix" needs to have the space removed when doing the search.
[0] https://imgur.com/ekwfHrN
[1] https://imgur.com/1nybezU
[2] https://imgur.com/18mK5i5
hodder
12/05/2025
"Gemini 3 Pro represents a generational leap from simple recognition to true visual and spatial reasoning."
Prompt: "wine glass full to the brim"
Image generated: 2/3 full wine glass.
True visual and spatial reasoning denied.
aziis98
12/05/2025
> Pointing capability: Gemini 3 has the ability to point at specific locations in images by outputting pixel-precise coordinates. Sequences of 2D points can be strung together to perform complex tasks, such as estimating human poses or reflecting trajectories over time
Does somebody know how to correctly prompt the model for these tasks or even better provide some docs? The pictures with the pretty markers are appreciated but that section is a bit vague and without references
ed
12/05/2025
What’s new here? I believe this is just gemini 3 which was released last month (the model id hasn’t changed AFAICT)
siva7
12/05/2025
Interesting. When i asked Gemini 3 Pro to generate a Infographic from my personal accounting sheet, it first failed to generate anything except a black background, then it generated something where it mixed different languages in a non-sensical way, with obvious typos and irrelevant information grouping. It's certainly a leap forward in OCR, rendering classic OCR useless.
devinprater
12/05/2025
Audio described Youtube please? That'd be so amazing! Even if I couldn't play Zelda yet, I could listen to a playthrough with Gemini describing it.
MostlyStable
12/05/2025
Going to compare this to our current solution of Amazon's Textract service for analyzing handwritten datasheets. Textract, when extracting tables (which is what we use it for) does not allow for providing any context or information about the tables and what we expect them to contain, but it is really good at correctly recognizing hand written characters. All of my attempts at less specialized, more general models allow me to provide that context, which is helpful in some ways, but fail at the basic part of almost always correctly getting the character.
Hopefully Google pro marries the two together.
hackeruser741
12/06/2025
It's fascinating how these models struggle with simple counting or novel configurations like a 5-legged dog or a 13-hour clock, despite excelling at complex language tasks. It highlights the difference between learning patterns from vast datasets and true conceptual understanding.
axpy906
12/05/2025
So Gemini was the most non-deterministic model of them all and now we get this one with temperature at 1 and max thinking. It’s so random that it’s hard to justify putting in my setup right now.
iamjackg
12/05/2025
Curious how this will fare when playing Pokemon Red.
caseyf
12/05/2025
I'm playing with this and wondering if this is an actually good way to identify dominant colors and other features of a garment/product when using a photo where the item is styled and not isolated from the model or other garments
k8sToGo
12/05/2025
When will we get Gemini 3 Flash?
a-dub
12/05/2025
i like to put it in live mode and point it at my plants and have conversations about how they're doing. it properly identifies them and flags any signs of disease and then provides correct next steps.
jonplackett
12/05/2025
Google really are a fully woken sleeping giant. More code reds being issued today I expect.
causal
12/05/2025
Okay maybe this one isn't an exaggeration when they say leap forward
pseudosavant
12/05/2025
I'm really fascinate by the opportunities to analyze videos. The amount of tokens it compresses down to, and what you can reason across those tokens, is incredible.
drivebyhooting
12/05/2025
Screen understanding is huge for further automating dev work.
ch2026
12/05/2025
what framework is being utilized for computer use here?
12/05/2025
stego-tech
12/05/2025
The document is paints a super impressive picture, but the core constraint of “network connection to Google required so we can harvest your data” is still a big showstopper for me (and all cloud-based AI tooling, really).
I’d be curious to see how well something like this can be distilled down for isolated acceleration on SBCs or consumer kit, because that’s where the billions to be made reside (factories, remote sites, dangerous or sensitive facilities, etc).
12/05/2025
bovermyer
12/05/2025
I would be interested in seeing what G3P makes of the Dead Sea Scrolls or similarly old documents.
themafia
12/06/2025
"the frontier"
I've never hated industry infatuation with a buzzword more.
romanovcode
12/06/2025
I gotta say - processing video at 10fps is very impressive.
genrader
12/06/2025
This is an excellent short way to understand that what you give Gemini 3 Pro is substantial better in understanding the data.
Making sure you ask correctly how it should give you the info is still lacking in many people's ability
kkukshtel
12/06/2025
sounds awesome but too bad it is impossible to figure out how to actually use these models and what I have to pay for/where
Frannky
12/06/2025
It's a good model. I worry that they will be able to win the game by offering the best service for free, thanks to selling users' data—kind of like search, email, etc. It's sad. Not that the alternatives are better... You either trust synchopathic ChatGPT backed by Scama, go with woke Claude (they once banned my account for asking how some news was trying to influence me), Grok that feels like a 20-year-old sure about stuff that don't work, and Chinese models that are agenda-aligned...
empressplay
12/05/2025
Yes, but can it play PacMan yet?
dmarzio
12/05/2025
So we’re going to use this to make the maid from the Jetsons finally. Right?
ichik
12/05/2025
Frankly, it's insane how laughably bad under scrutiny their own examples are. It both distorted the data and made the chart less readable (labels placement, segments separation, missing labels, worse contrast). And it combined them into one, so you you'll have harder time comparing them compared to the original image! Isn't it amazing that it added a toggle? Post author seems to think it deserves an exclamation point even.
OBELISK_ASI
12/06/2025
[dead]
sora2video
12/06/2025
[dead]
agentifysh
12/05/2025
im realizing how much of a bottleneck vision models are
im just a glorified speedreadin' promptin' QA at this point with codex
once it replaces the QA layer its truly over for software dev jobs
future would be a software genie where on aistudio you type: "go make counterstrike 1.6 clone, here is $500, you have two hours"
edit: saw the Screenspot benchmark and holy ** this is an insane jump!!! 11% to 71% even beating Opus 4.5's 50%...chatgpt is at 3.5% and it matches my experience with codex

Gemini 3 Pro: the frontier of vision AI

Workaccount2

Rover222

theoa

vunderba

Rover222

ithkuil

macNchz

RestartKernel

mkl

petters

raw_anon_1111

vunderba

tensegrist

Lamprey

vunderba

dwringer

Rover222

AIorNot

majormajor

mlinhares

didgeoridoo

Lamprey

HardCodedBias

squigz

Rover222

tarsinge

irthomasthomas

adastra22

runarberg

adastra22

runarberg

adastra22

IsTom

adastra22

runarberg

adastra22

runarberg

adastra22

CamperBob2

Zambyte

irthomasthomas

CamperBob2

Rover222

visioninmyblood

qnleigh

cyanmagenta

FeepingCreature

qnleigh

nearbuy

SecretDreams

dostick

varispeed

criddell

DANmode

isodev

Benjammer

majormajor

biophysboy

Benjammer

menaerus

michaelmrose

menaerus

biophysboy

genrader

runarberg

Benjammer

runarberg

Workaccount2

runarberg

Benjammer

runarberg

Kiro

runarberg

Libidinalecon

Benjammer

runarberg

white_dragon88

CamperBob2

runarberg