Tiled Hacker news on React Router

A new Google model is nearly perfect on automated handwriting recognition

384 points - last Tuesday at 1:52 PM

Source

lelanthran
today at 10:38 AM
> In tabulating the “errors” I saw the most astounding result I have ever seen from an LLM, one that made the hair stand up on the back of my neck. Reading through the text, I saw that Gemini had transcribed a line as “To 1 loff Sugar 14 lb 5 oz @ 1/4 0 19 1”. If you look at the actual document, you’ll see that what is actually written on that line is the following: “To 1 loff Sugar 145 @ 1/4 0 19 1”. For those unaware, in the 18th century sugar was sold in a hardened, conical form and Mr. Slitt was a storekeeper buying sugar in bulk to sell. At first glance, this appears to be a hallucinatory error: the model was told to transcribe the text exactly as written but it inserted 14 lb 5 oz which is not in the document.
I read the whole reasoning of the blog author after that, but I still gotta know - how can we tell that this was not a hallucination and/or error? There's a 1/3 chance of an error being correct (either 1 lb 45, 14 lb 5 or 145 lb), so why is the author so sure that this was deliberate?
I feel a good way to test this would be to create an almost identical ledger entry, but in a way so that the correct answer after reasoning (the way the author thinks the model reasoned) has completely different digits.
This way there'd be more confidence that the model itself reasoned and did not make an error.
throwup238
yesterday at 10:16 PM
I really hope they have because I’ve also been experimenting with LLMs to automate searching through old archival handwritten documents. I’m interested in the Conquistadors and their extensive accounts of their expeditions, but holy cow reading 16th century handwritten Spanish and translating it at the same time is a nightmare, requiring a ton of expertise and inside field knowledge. It doesn’t help that they were often written in the field by semi-literate people who misused lots of words. Even the simplest accounts require quite a lot of detective work to decipher with subtle signals like that pound sign for the sugar loaf.
> Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts.
This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?
elphinstone
today at 3:37 AM
I read the whole article, but have never tried the model. Looking at the input document, I believe the model saw enough of a space between the 14 and 5 to simply treat it that way. I saw the space too. Impressive, but it's a leap to say it saw 145 then used higher order reasoning to correct 145 to 14 and 5.
roywiggins
today at 3:00 AM
My task today for LLMs was "can you tell if this MRI brain scan is facing the normal way", and the answer was: no, absolutely not. Opus 4.1 succeeds more than chance, but still not nearly often enough to be useful. They all cheerfully hallucinate the wrong answer, confidently explaining the anatomy they are looking for, but wrong. Maybe Gemini 3 will pull it off.
Now, Claude did vibe code a fairly accurate solution to this using more traditional techniques. This is very impressive on its own but I'd hoped to be able to just shovel the problem into the VLM and be done with it. It's kind of crazy that we have "AIs" that can't tell even roughly what the orientation of a brain scan is- something a five year old could probably learn to do- but can vibe code something using traditional computer vision techniques to do it.
I suppose it's not too surprising, a visually impaired programmer might find it impossible to do reliably themselves but would code up a solution, but still: it's weird!
efitz
yesterday at 10:50 PM
I haven’t seen this new google model but now must try it out.
I will say that other frontier models are starting to surprise me with their reasoning/understanding- I really have a hard time making (or believing) the argument that they are just predicting the next word.
I’ve been using Claude Code heavily since April; Sonnet 4.5 frequently surprises me.
Two days ago I told the AI to read all the documentation from my 5 projects related to a tool I’m building, and create a wiki, focused on audience and task.
I'm hand reviewing the 50 wiki pages it created, but overall it did a great job.
I got frustrated about one issue: I have a github issue to create a way to integrate with issue trackers (like Jira), but it's TODO, and the AI featured on the home page that we had issue tracker integration. It created a page for it and everything; I figured it was hallucinating.
I went to edit the page and replace it with placeholder text and was shocked that the LLM had (unprompted) figured out how to use existing features to integrate with issue trackers, and wrote sample code for GitHub, Jira and Slack (notifications). That truly surprised me.
conception
yesterday at 10:52 PM
I will note that 2.5 pro preview… march? Was maybe the best model I’ve used yet. The actual release model was… less. I suspect Google found the preview too expensive and optimized it down but it was interesting to see there was some hidden horsepower there. Google has always been poised to be the AI leader/winner - excited to see if this is fluff or the real deal or another preview that gets nerfed.
xx_ns
yesterday at 11:47 PM
Am I missing something here? Colonial merchant ledgers and 18th-century accounting practices have been extensively digitized and discussed in academic literature. The model has almost certainly seen examples where these calculations are broken down or explained. It could be interpolating from similar training examples rather than "reasoning."
MagicMoonlight
today at 3:24 AM
It seems like a leap to assume it has done all sorts of complex calculations implicitly.
I looked at the image and immediately noticed that it is written as “14 5” in the original text. It doesn’t require calculation to guess that it might be 14 pounds 5 ounces rather than 145. Especially since presumably, that notation was used elsewhere in the document.
gcanyon
today at 1:51 AM
> So that is essentially the ceiling in terms of accuracy.
I think this is mistaken. I remember... ten years ago? When speech-to-text models came out that dealt with background noise that made the audio sound very much like straight pink noise to my ear, but the model was able to transcribe the speech hidden within at a reasonable accuracy rate.
So with handwritten text, the only prediction that makes sense to me is that we will (potentially) reach a state where the machine is at least probably more accurate than humans, although we wouldn't be able to confirm it ourselves.
But if multiple independent models, say, Gemini 5 and Claude 7, both agree on the result, and a human can only shrug and say, "might be," then we're at a point where the machines are probably superior at the task.
pavlov
yesterday at 10:46 PM
I’ve seen those A/B choices on Google AI Studio recently, and there wasn’t a substantial difference between the outputs. It felt more like a different random seed for the same model.
Of course it’s very possible my use case wasn’t terribly interesting so it wouldn’t reveal model differences, or that it was a different A/B test.
neom
today at 12:07 AM
I've been complaining on hn for some time now that my only real test of an LLM is that it can help my poor wife with her research, she spends all day every day in small town archives pouring over 18th century American historical documents. I thought maybe that day had come, I showed her the article and she said "good for him I'm still not transcribing important historical documents with a chat bot and nor should he" - ha. If you wanna play around with some difficult stuff here are some images from her work I've posted before: https://s.h4x.club/bLuNed45
AaronNewcomer
today at 3:41 AM
The thinking models (especially OpenAI's o3) still seem to do by far the best at this task as they look across the document to see how the writer wrote certain letters where the word is more clear when it runs into confusing words.
I built a whole product around this: https://DocumentTranscribe.com
But I imagine this will keep getting better and that excites me since this was largely built for my own research!
Grimblewald
today at 9:26 AM
I dunno man, looks like goodharts law in action to me. That isnt to say the models wont be good at what is stated, but it does mean it might not signal a general improvement in competence but rather a targeted gain with more general deficits rising up in untested/ignored areas, some which may or may not be catastrophic. I guess we will see but for now Imma keep my hype in the box.
jumploops
yesterday at 11:53 PM
This is exciting news, as I have some elegantly scribed family diaries from the 1800s that I can barely read (:
With that said, the writing here is a bit hyperbolic, as the advances seem like standard improvements, rather than a huge leap or final solution.
observationist
yesterday at 11:14 PM
This might just be a handcrafted prompt framework for handwriting recognition tied in with reasoning - do a rough pass, make assumptions and predictions, check assumptions and predictions, if they pass, use the degree of confidence in their passage to inform what the other characters might be, and gradually flesh out an interpretation of what was intended to be communicated.
If they could get this to occur naturally - with no supporting prompts, and only one-shot or one-shot reasoning, then it could extend to complex composition generally, which would be cool.
sriku
today at 9:14 AM
Rgd the "14 lb 5 oz" point in the article, the simpler explanation than the hypothesis there that it back calculated the weight is that there seems to be a space between 14 and 5 - i.e. It reads more like "14 5" than "145"?
netsharc
yesterday at 10:29 PM
Author says "It is the most amazing thing I have seen an LLM do, and it was unprompted, entirely accidental." and then jumps back to the "beginning of the story". Including talking about a trip to Canada.
Skip to the section headed "The Ultimate Test" for the resolution of the clickbait of "the most amazing thing...". (According to him, it correctly interpreted a line in an 18th century merchant ledger using maths and logic)
ghm2199
today at 12:07 AM
I just used AI studio for recognizing text from a relative's 60 day log of food ingested 3 times a day. I think I am using models/gemini-flash-latest and it was shockingly good at recognizing text, far better than ChatGPT 5.1 or Claude's Sonnet (IIRC its 4.5) model.
https://pasteboard.co/euHUz2ERKfHP.png
Its response I have captured here https://pasteboard.co/sbC7G9nuD9T9.png is shockingly good. I could only spot 2 mistakes. And those that seems to have been the ones even I could not read or was very difficult for me to make out what the text was.
koliber
today at 9:23 AM
It hasn't met my doctor.
_giorgio_
today at 7:23 AM
Gemini 2.5 PRO is already incredibly good in handwritten recognition. It makes maybe one small mistake every 3 pages.
It has completely changed the way I work, and it allows me to write math and text and then convert it with the Gemini app (or with a scanned PDF in the browser). You should really try it.
barremian
today at 4:58 AM
> it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts
> As is so often the case with AI, that is exciting and frightening all at once
> we need to extrapolate from this small example to think more broadly: if this holds the models are about to make similar leaps in any field where visual precision and skilled reasoning must work together required
> this will be a big deal when it’s released
> What appears to be happening here is a form of emergent, implicit reasoning, the spontaneous combination of perception, memory, and logic inside a statistical model
> model’s ability to make a correct, contextually grounded inference that requires several layers of symbolic reasoning suggests that something new may be happening inside these systems—an emergent form of abstract reasoning that arises not from explicit programming but from scale and complexity itself
Just another post with extreme hyperbolic wording to blow up another model release. How many times have we seen such non-realistic build up in the past couple of years.
greekrich92
yesterday at 11:28 PM
Pretty hyperbolic reaction to what seems like a fairly modest improvement
lproven
yesterday at 11:20 PM
Betteridge's law surely applies.
kittikitti
yesterday at 11:26 PM
I much prefer this tone about improvements in AI over the doomerism I constantly read. I was waiting for a twist where the author changed their minds and suddenly went "this is the devil's technology" or "THEY T00K OUR JOBS" but it never happened. Thank you for sharing, it felt like breathing for the first time in a long time.
thatoneengineer
yesterday at 10:47 PM
https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headline...
bgwalter
yesterday at 10:34 PM
No, just another academic with the ominous handle @generativehistory that is beguiled by "AI". It is strange that others can never reproduce such amazing feats.
yesterday at 11:34 PM
Legend2440
yesterday at 11:06 PM
What an unnecessarily wordy article. It could have been a fifth of the length. The actual point is buried under pages and pages of fluff and hyperbole.
mmaunder
today at 1:33 AM
Substack: When you have nothing to say and all day to say it.
temptemptemp111
yesterday at 11:07 PM
[dead]
superlukas99
today at 3:28 AM
[dead]
anthem2025
today at 3:40 AM
[dead]
phkahler
yesterday at 11:57 PM
It's a diffusion model, not autocomplete.
outside2344
yesterday at 11:31 PM
We are probably just a few weeks away from Google completely wiping OpenAI out.
cheevly
today at 5:35 AM
Reading HN comments just makes me realize how vastly LLMs exceed human intelligence.

A new Google model is nearly perfect on automated handwriting recognition

lelanthran

yomismoaqui

Someone

YeGoblynQueenne

nopinsight

qchris

throwup238

viftodi

krackers

tyre

risyachka

ACCount37

snickerbockers

jstummbillig

terminalshort

veegee

snickerbockers

brulard

throwaway173738

grosswait

beeflet

fragmede

kazinator

ozgrakkurt

BoorishBears

Libidinalecon

kazinator

regularfry

snickerbockers

BobbyTables2

snickerbockers

throwaway173738

ezst

ninetyninenine

versteegen

MangoToupe

ninetyninenine

MangoToupe

ninetyninenine

BirAdam

ninetyninenine

fragmede

taneq

baq

visarga

QuadmasterXLII

Workaccount2

pinnochio

testaccount28

mikestorrent

magicalist

terminalshort

CamperBob2

fragmede

greygoo222

kazinator

sosuke

n8cpdx

flatline

beeflet

vidarh

imiric

famouswaffles

lelanthran

beeflet

famouswaffles

Kim_Bruning

imiric

famouswaffles

imiric

tanseydavid

imiric

kace91

throwup238

throwout4110

vintermann

throwup238

cco

rmonvfer