\

A new Google model is nearly perfect on automated handwriting recognition

384 points - last Tuesday at 1:52 PM

Source
  • lelanthran

    today at 10:38 AM

    > In tabulating the ā€œerrorsā€ I saw the most astounding result I have ever seen from an LLM, one that made the hair stand up on the back of my neck. Reading through the text, I saw that Gemini had transcribed a line as ā€œTo 1 loff Sugar 14 lb 5 oz @ 1/4 0 19 1ā€. If you look at the actual document, you’ll see that what is actually written on that line is the following: ā€œTo 1 loff Sugar 145 @ 1/4 0 19 1ā€. For those unaware, in the 18th century sugar was sold in a hardened, conical form and Mr. Slitt was a storekeeper buying sugar in bulk to sell. At first glance, this appears to be a hallucinatory error: the model was told to transcribe the text exactly as written but it inserted 14 lb 5 oz which is not in the document.

    I read the whole reasoning of the blog author after that, but I still gotta know - how can we tell that this was not a hallucination and/or error? There's a 1/3 chance of an error being correct (either 1 lb 45, 14 lb 5 or 145 lb), so why is the author so sure that this was deliberate?

    I feel a good way to test this would be to create an almost identical ledger entry, but in a way so that the correct answer after reasoning (the way the author thinks the model reasoned) has completely different digits.

    This way there'd be more confidence that the model itself reasoned and did not make an error.

      • yomismoaqui

        today at 12:11 PM

        I implemented a receipt scanner to Google Sheet using Gemini Flash.

        The fact that it is ā€intelligent" it's fine for some things.

        For example I created structured output schema that had a field "currency" with the 3 letter format (USD, EUR...). So I scanned a receipt from some shop in Jakarta and it filled that field with IDR (Indonesian Rupiah). It inferred that data because of the city name on the receipt.

        Would it be better for my use case that it would have returned no data for the currency field? Don't think so.

        Note: if needed maybe I could have changed the prompt to not infer the currency when not explicitly listed on the receipt.

          • Someone

            today at 2:07 PM

            > Would it be better for my use case that it would have returned no data for the currency field? Don't think so.

            If there’s a decent chance it infers the wrong currency, potentially one where the value of each unit is a few units of scale larger or smaller than that of IDR, it might be better to not infer it.

        • YeGoblynQueenne

          today at 12:35 PM

          Don't you see? The hair stood up on the back of the author's neck! We're now at the point where writing about LLMs is like a bunch of teenagers sitting 'round a campfire taking turns to tell each other spooky stories with a flashlight held under their chin.

          ... and then do you know what the LLM did? It r e a s o n e d s p o n t a n e o u s l y!!!!

          Spoooookyyyy!!

            • nopinsight

              today at 1:18 PM

              The comment above seems to violate several HN guidelines. Curious, I asked GPT and Gemini which ones stood out. Both replied with the same top three:

              https://news.ycombinator.com/newsguidelines.html

              They are:

              1. ā€œBe kind. Don't be snarky. … Edit out swipes.ā€

              2. ā€œPlease don't sneer, including at the rest of the community.ā€

              3. ā€œPlease don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.ā€

                • qchris

                  today at 2:17 PM

                  I'd be interested in seeing these guidelines updated to include "don't re-post the output of an LLM" to reduce comments of this sort.

                  I don't really feel like comments with LLM output as the primary substance meet the bar of "thoughtful and substantive", and (ironically, in this instance) could actually be used as good example of shallow dismissal, since you, a human, didn't actually provide an opinion or take a stance either way that I could use to begin a good-faith engagement on the topic.

      • throwup238

        yesterday at 10:16 PM

        I really hope they have because I’ve also been experimenting with LLMs to automate searching through old archival handwritten documents. I’m interested in the Conquistadors and their extensive accounts of their expeditions, but holy cow reading 16th century handwritten Spanish and translating it at the same time is a nightmare, requiring a ton of expertise and inside field knowledge. It doesn’t help that they were often written in the field by semi-literate people who misused lots of words. Even the simplest accounts require quite a lot of detective work to decipher with subtle signals like that pound sign for the sugar loaf.

        > Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts.

        This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?

          • viftodi

            today at 2:18 AM

            You are right to be skeptical.

            There are plenty of so called windows(or other) web 'os' clones.

            There were a couple of these posted on HN actually this very year.

            Here is one example I google dthat was also on HN : https://news.ycombinator.com/item?id=44088777

            This is not an OS as in emulating a kernel in javascript or wasm, this is making a web app that looks like the desktop of an OS.

            I have seen plenty such projects, some mimick windows UI entirely, you xan find them via google.

            So this was definitely in the training data, and is not as impressive as the blog post or the twitter thread make it to be.

            The scary thing is the replies in the twitter thread have no critical thinking at all and are impressed beyond belief, they think it coded a whole kernel, os, made an interpeter for it, ported games etc.

            I think this is the reason why some people are so impressed by AI, when you can only judge an app visually or only how you intetcat with it and don't have the depth of knowledge to understand, for such people it works all the way.land AI seems magical beyond comprehension.

            But all this is only superficial IMHO.

              • krackers

                today at 2:42 AM

                Every time a model is about to be released, there are a bunch of these hype accounts that spin up. I don't know they get paid or they spring up organically to farm engagement. Last time there was such hype for a model was "strawberry" (o1) then gpt-5, and both turned out to be meaningful improvements but nowhere near the hype.

                I don't doubt though that new models will be very good at frontend webdev. In fact this is explicitly one of the recent lmarena tasks so all the labs have probably been optimizing for it.

                  • tyre

                    today at 12:37 PM

                    My guess is that there are insiders who know about the models and can’t keep their mouths shut. They like being on the inside and leaking.

                • risyachka

                  today at 10:28 AM

                  Its always amusing when "an app like windows xp" considered hard or challenging somehow.

                  Literally the most basic html/css, not sure why it is even included in benchmarks.

                    • ACCount37

                      today at 10:40 AM

                      Those things are LLMs, with text and language at the core of their capabilities. UIs are, notably, not text.

                      An LLM being able to build up interfaces that look recognizably like an UI from a real OS? That sure suggests a degree of multimodal understanding.

              • snickerbockers

                yesterday at 10:25 PM

                I'm skeptical that they're actually capable of making something novel. There are thousands of hobby operating systems and video game emulators on github for it to train off of so it's not particularly surprising that it can copy somebody else's homework.

                  • jstummbillig

                    yesterday at 11:37 PM

                    I remain confused but still somewhat interested as to a definition of "novel", given how often this idea is wielded in the AI context. How is everyone so good at identifying "novel"?

                    For example, I can't wrap my head around how a) a human could come up with a piece of writing that inarguably reads "novel" writing, while b) an AI could be guaranteed to not be able to do the same, under the same standard.

                      • today at 12:46 PM

                        • terminalshort

                          today at 3:30 AM

                          If a LLM had written Linux, people would be saying that it isn't novel because it's just based on previous OS's. There is no standard here, only bias.

                            • veegee

                              today at 4:01 AM

                              [dead]

                          • snickerbockers

                            today at 12:30 AM

                            Generally novel either refers to something that is new, or a certain type of literature. If the AI is generating something functionally equivalent to a program in its training set (in this case, dozens or even hundreds of such programs) then it by definition cannot be novel.

                              • brulard

                                today at 12:55 AM

                                This is quite a narrow view of how the generation works. AI can extrapolate from the training set and explore new directions. It's not just cutting pieces and gluing together.

                                  • throwaway173738

                                    today at 5:21 AM

                                    Calling it ā€œexploringā€ is anthropomorphising. The machine has weights that yield meaningful programs given specification-like language. It’s a useful phenomenon but it may be nothing like what we do.

                                      • grosswait

                                        today at 12:36 PM

                                        Or it may be remarkably similar to what we do

                                    • beeflet

                                      today at 1:32 AM

                                      In practice, I find the ability for this new wave of AI to extrapolate very limited.

                                        • fragmede

                                          today at 1:43 AM

                                          Do you have any concrete examples you'd care to share? While this new wave of AI doesn't have unlimited powers of extrapolation, the post we're commenting on is asserting that this latest AI from Google was able to extrapolate solutions to two of AI's oldest problems, which would seem to contradict an assertion of "very limited".

                                      • kazinator

                                        today at 3:13 AM

                                        Positively not. It is pure interpolation and not extrapolation. The training set is vast and supports an even vaster set of possible traversal paths; but they are all interpolative.

                                        Same with diffusion and everything else. It is not extrapolation that you can transfer the style of Van Gogh onto a photographl it is interpolation.

                                        Extrapolation might be something like inventing a style: how did Van Gogh do that?

                                        And, sure, the thing can invent a new style---as a mashup of existing styles. Give me a Picasso-like take on Van Gogh and apply it to this image ...

                                        Maybe the original thing there is the idea of doing that; but that came from me! The execution of it is just interpolation.

                                          • ozgrakkurt

                                            today at 4:56 AM

                                            This is how people do things as well imo. LLM does the same thing on some level but it is just not good enough for majority of use cases

                                            • BoorishBears

                                              today at 3:53 AM

                                              This is knock against you at all, but in a naive attempt to spare someone else some time: remember that based on this definition it is impossible for an LLM to do novel things and more importantly, you're not going to change how this person defines a concept as integral to one's being as novelty.

                                              I personally think this is a bit tautological of a definition, but if you hold it, then yes LLMs are not capable of anything novel.

                                                • Libidinalecon

                                                  today at 12:50 PM

                                                  I think you should reverse the question, why would we expect LLMs to even have the ability to do novel things?

                                                  It is like expecting a DJ remixing tracks to output original music. Confusing that the DJ is not actually playing the instruments on the recorded music so they can't do something new beyond the interpolation. I love DJ sets but it wouldn't be fair to the DJ to expect them to know how to play the sitar because they open the set with a sitar sample interpolated with a kick drum.

                                                  • kazinator

                                                    today at 5:27 AM

                                                    That is not strictly true, because being able to transfer the style of Van Gogh onto an arbitrary photographic scene is novel in a sense, but it is interpolative.

                                                    Mashups are not purely derivative: the choice of what to mash up carries novelty: two (or more) representations are mashed together which hitherto have not been.

                                                    We cannot deny that something is new.

                                                      • regularfry

                                                        today at 8:07 AM

                                                        Innovation itself is frequently defined as the novel combination of pre-existing components. It's mashups all the way down.

                                            • snickerbockers

                                              today at 1:44 AM

                                              uhhh can it? I've certainly not seen any evidence of an AI generating something not based on its training set. It's certainly smart enough to shuffle code around and make superficial changes, and that's pretty impressive in its own way but not particularly useful unless your only goal is to just launder somebody else's code to get around a licensing problem (and even then it's questionable if that's a derived work or not).

                                              Honest question: if AI is actually capable of exploring new directions why does it have to train on what is effectively the sum total of all human knowledge? Shouldn't it be able to take in some basic concepts (language parsing, logic, etc) and bootstrap its way into new discoveries (not necessarily completely new but independently derived) from there? Nobody learns the way an LLM does.

                                              ChatGPT, to the extent that it is comparable to human cognition, is undoubtedly the most well-read person in all of history. When I want to learn something I look it up online or in the public library but I don't have to read the entire library to understand a concept.

                                                • BobbyTables2

                                                  today at 3:48 AM

                                                  You have to realize AI is trained the same way one would train an auto-completer.

                                                  Theres no cognition. It’s not taught language, grammar, etc. none of that!

                                                  It’s only seen a huge amount of text that allows it to recognize answers to questions. Unfortunately, it appears to work so people see it as the equivalent to sci-fi movie AI.

                                                  It’s really just a search engine.

                                                    • snickerbockers

                                                      today at 4:51 AM

                                                      I agree and that's the case I'm trying to make. The machine-learning community expects us to believe that it is somehow comparable to human cognition, yet the way it learns is inherently inhuman. If an LLM was in any way similar to a human I would expect that, like a human, it might require a little bit of guidance as it learns but ultimately it would be capable of understanding concepts well enough that it doesn't need to have memorized every book in the library just to perform simple tasks.

                                                      In fact, I would expect it to be able to reproduce past human discoveries it hasn't even been exposed to, and if the AI is actually capable of this then it should be possible for them to set up a controlled experiment wherein it is given a limited "education" and must discover something already known to the researchers but not the machine. That nobody has done this tells me that either they have low confidence in the AI despite their bravado, or that they already have tried it and the machine failed.

                                                        • throwaway173738

                                                          today at 5:25 AM

                                                          There’s a third possible reason which is that they’re taking it as a given that the machine is ā€œintelligentā€ as a sales tactic, and they’re not academic enough to want to test anything they believe.

                                                          • ezst

                                                            today at 5:20 AM

                                                            > The machine-learning community

                                                            Is it? I only see a few individuals, VCs, and tech giants overblowing LLMs capabilities (and still puzzled as to how the latter dragged themselves into a race to the bottom through it). I don't believe the academic field really is that impressed with LLMs.

                                                        • ninetyninenine

                                                          today at 5:13 AM

                                                          no it's not I work on AI and what these things do are much much more then a search engine or an autocomplete. If an autocomplete passed the turing test you'd dismiss it because it's still an autocomplete.

                                                          The characterization you are regurgitating here is from laymen who do not understand AI. You are not just mildly wrong but wildly uninformed.

                                                            • versteegen

                                                              today at 1:23 PM

                                                              Well, I also work on AI, and I completely agree with you. But I've reached the point of thinking it's hopeless to argue with people about this: It seems that as LLMs become ever better people aren't going to change their opinions, as I had expected. If you don't have good awareness of how human cognition actually works, then it's not evidently contradictory to think that even a superintelligent LLM trained on all human knowledge is just pattern matching and that humans are not. Creativity, understanding, originality, intent, etc, can all be placed into a largely self-consistent framework of human specialness.

                                                              • MangoToupe

                                                                today at 1:08 PM

                                                                To be fair, it's not clear human intelligence is much more than search or autocomplete. The only thing that's clear here is that LLMs can't reproduce it.

                                                                  • ninetyninenine

                                                                    today at 1:13 PM

                                                                    Yes but colloquially this characterization you see used by laymen is deliberately used to deride AI and dismiss it. It is not honest about the on the ground progress AI has made and it’s not intellectual honest about the capabilities and weaknesses of Ai.

                                                                      • MangoToupe

                                                                        today at 1:35 PM

                                                                        I disagree. The actual capabilities of LLMs remain unclear, and there's a great deal of reasons to be suspicious of anyone whose paycheck relies on pimping them.

                                                                          • ninetyninenine

                                                                            today at 1:41 PM

                                                                            The capabilities of LLMs are unclear but it is clear that they are not just search engines or autocompletes or stochastic parrots.

                                                                            You can disagree. But this is not an opinion. You are factually wrong if you disagree. And by that I mean you don’t know what you’re talking about and you are completely misinformed and lack knowledge.

                                                                            The long term outcome if I’m right is that AI abilities continue to grow and it basically destroys my career and yours completely. I stand not to benefit from this reality and I state it because it is reality. LLMs improve every month. It’s already to the point of where if you’re not vibe coding you’re behind.

                                                        • BirAdam

                                                          today at 3:24 AM

                                                          You didn’t have to read the whole library because your brain has been absorbing knowledge from multiple inputs your entire life. AI systems are trying to temporally compress a lifetime into the time of training. Then, given that these systems have effectively a single input method of streams of bits, they need immense amounts of it to be knowledgeable at all.

                                                          • ninetyninenine

                                                            today at 5:14 AM

                                                            >I've certainly not seen any evidence of an AI generating something not based on its training set.

                                                            There is plenty of evidence for this. You have to be blind not to realize this. Just ask the AI to generate something not in it's training set.

                                                            • fragmede

                                                              today at 4:31 AM

                                                              Isn't that what's going on with synthetic data? The LLM is trained, then is used to generate data that gets put into the training set, and then gets further trained on that generated data?

                                                      • taneq

                                                        today at 12:27 PM

                                                        OK, but by that definition, how many human software developers ever develop something "novel"? Of course, the "functionally equivalent" term is doing a lot of heavy lifting here: How equivalent? How many differences are required to qualify as different? How many similarities are required to qualify as similar? Which one overrules the other? If I write an app that's identical to Excel in every single aspect except that instead of a Microsoft Flight Simulator easter egg, there's a different, unique, fully playable game that can't be summed up with any combination of genre lables, is that 'novel'?

                                                    • baq

                                                      today at 10:09 AM

                                                      If the model can map an unseen problem to something in its latent space, solve it there, map back and deliver an ultimately correct solution, is it novel? Genuine question, ā€˜novel’ doesn’t seem to have a universally accepted definition here

                                                      • visarga

                                                        today at 6:53 AM

                                                        > For example, I can't wrap my head around how a) a human could come up with a piece of writing that inarguably reads "novel" writing, while b) an AI could be guaranteed to not be able to do the same, under the same standard.

                                                        The secret ingredient is the world outside, and past experiences from the world, which are unique for each human. We stumble onto novelty in the environment. But AI can do that too - move 37 AlphaGo is an example, much stumbling around leads to discoveries even for AI. The environment is the key.

                                                        • QuadmasterXLII

                                                          today at 2:12 AM

                                                          A system of humans creates bona fide novel writing. We don’t know which human is responsible for the novelty in homoerotic fanfiction of the Odyssey, but it wasn’t a lizard. LLMs don’t have this system-of-thinkers bootstrapping effect yet, or if they do it requires an absolutely enormous boost to get going

                                                          • Workaccount2

                                                            today at 12:17 AM

                                                            [flagged]

                                                              • pinnochio

                                                                today at 12:51 AM

                                                                [flagged]

                                                            • testaccount28

                                                              yesterday at 11:43 PM

                                                              why would you admit on the internet that you fail the reverse turing test?

                                                                • mikestorrent

                                                                  today at 2:00 AM

                                                                  Didn't some fake AI country song just get on the top 100? How novel is novel? A lot of human artists aren't producing anything _novel_.

                                                                    • magicalist

                                                                      today at 3:16 AM

                                                                      > Didn't some fake AI country song just get on the top 100?

                                                                      No

                                                                      Edit: to be less snarky, it topped the Billboard Country Digital Song Sales Chart, which is a measure of sales of the individual song, not streaming listens. It's estimated it takes a few thousand sales to top that particular chart and it's widely believed to be commonly manipulated by coordinated purchases.

                                                                      • terminalshort

                                                                        today at 3:31 AM

                                                                        It was a real AI country song, not a fake one, but yes.

                                                                    • CamperBob2

                                                                      today at 12:07 AM

                                                                      You have no idea if you're talking to an LLM or a human, yourself, so ... uh, wait, neither do I.

                                                                      • fragmede

                                                                        yesterday at 11:46 PM

                                                                        Because not everyone here has a raging ego and no humility?

                                                                        • greygoo222

                                                                          today at 12:42 AM

                                                                          Because I'm an LLM and you are too

                                                                      • kazinator

                                                                        today at 3:06 AM

                                                                        Because we know that the human only read, say, fifty books since they were born, and watched a few thousand videos, and there is nothing in them which resembles what they wrote.

                                                                    • sosuke

                                                                      today at 1:16 AM

                                                                      Doing something novel is incredibly difficult through LLM work alone. Dreaming, hallucinating, might eventually make novel possible but it has to be backed up be rock solid base work. We aren't there yet.

                                                                      The working memory it holds is still extremely small compared to what we would need for regular open ended tasks.

                                                                      Yes there are outliers and I'm not being specific enough but I can't type that much right now.

                                                                      • n8cpdx

                                                                        today at 12:37 AM

                                                                        The windows (~2000) kernel itself is on GitHub. Even exquisitely documented if AI can read .doc files.

                                                                        https://github.com/ranni0225/WRK

                                                                        • flatline

                                                                          yesterday at 10:54 PM

                                                                          I believe they can create a novel instance of a system from a sufficient number of relevant references - i.e. implement a set of already-known features without (much) code duplication. LLMs are certainly capable of this level of generalization due to their huge non-relevant reference set. Whether they can expand beyond that into something truly novel from a feature/functionality standpoint is a whole other, and less well-defined, question. I tend to agree that they are closed systems relative to their corpus. But then, aren't we? I feel like the aperture for true novelty to enter is vanishingly small, and cultures put a premium on it vis-a-vis the arts, technological innovation, etc. Almost every human endeavor is just copying and iterating on prior examples.

                                                                            • beeflet

                                                                              today at 1:46 AM

                                                                              Almost all of the work in making a new operating system or a gameboy emulator or something is in characterizing the problem space and defining the solution. How do you know what such and such instruction does? What is the ideal way to handle this memory structure here? You know, knowledge you gain from spending time tracking down a specific bug or optimizing a subroutine.

                                                                              When I create something, it's an exploratory process. I don't just guess what I am going to do based on my previous step and hope it comes out good on the first try. Let's say I decide to make a car with 5 wheels. I would go through several chassis designs, different engine configurations until I eventually had something that works well. Maybe some are too weak, some too expensive, some are too complicated. Maybe some prototypes get to the physical testing stage while others don't. Finally, I publish this design for other people to work on.

                                                                              If you ask the LLM to work on a novel concept it hasn't been trained on, it will usually spit out some nonsense that either doesn't work or works poorly, or it will refuse to provide a specific enough solution. If it has been trained on previous work, it will spit out something that looks similar to the solved problem in its training set.

                                                                              These AI systems don't undergo the process of trial and error that suggests it is creating something novel. Its process of creation is not reactive with the environment. It is just cribbing off of extant solutions it's been trained on.

                                                                                • vidarh

                                                                                  today at 1:51 AM

                                                                                  I'm literally watching Claude Code "undergo the process of trial and error" in another window right now.

                                                                              • imiric

                                                                                today at 12:19 AM

                                                                                Here's a thought experiment: if modern machine learning systems existed in the early 20th century, would they have been able to produce an equivalent to the theory of relativity? How about advance our understanding of the universe? Teach us about flight dynamics and take us into space? Invent the Turing machine, Von Neumann architecture, transistors?

                                                                                If yes, why aren't we seeing glimpses of such genius today? If we've truly invented artificial intelligence, and on our way to super and general intelligence, why aren't we seeing breakthroughs in all fields of science? Why are state of the art applications of this technology based on pattern recognition and applied statistics?

                                                                                Can we explain this by saying that we're only a few years into it, and that it's too early to expect fundamental breakthroughs? And that by 2027, or 2030, or surely by 2040, all of these things will suddenly materialize?

                                                                                I have my doubts.

                                                                                  • famouswaffles

                                                                                    today at 12:55 AM

                                                                                    >Here's a thought experiment: if modern machine learning systems existed in the early 20th century, would they have been able to produce an equivalent to the theory of relativity? How about advance our understanding of the universe? Teach us about flight dynamics and take us into space? Invent the Turing machine, Von Neumann architecture, transistors?

                                                                                    Only a small percentage of humanity are/were capable of doing any of these. And they tend to be the best of the best in their respective fields.

                                                                                    >If yes, why aren't we seeing glimpses of such genius today?

                                                                                    Again, most humans can't actually do any of the things you just listed. Only our most intelligent can. LLMs are great, but they're not (yet?) as capable as our best and brightest (and in many ways, lag behind the average human) in most respects, so why would you expect such genius now ?

                                                                                      • lelanthran

                                                                                        today at 10:41 AM

                                                                                        > Only a small percentage of humanity are/were capable of doing any of these. And they tend to be the best of the best in their respective fields.

                                                                                        Sure, agreed, but the difference between a small percentage and zero percentage is infinite.

                                                                                        • beeflet

                                                                                          today at 1:50 AM

                                                                                          Were they the best of the best? or were they just at the right place and time to be exposed to a novel idea?

                                                                                          I am skeptical of this claim that you need a 140IQ to make scientific breakthroughs, because you don't need a 140IQ to understand special relativity. It is a matter of motivation and exposure to new information. The vast majority of the population doesn't benefit from working in some niche field of physics in the first place.

                                                                                          Perhaps LLMs will never be at the right place and the right time because they are only trained on ideas that already exist.

                                                                                            • famouswaffles

                                                                                              today at 2:06 AM

                                                                                              >Were they the best of the best? or were they just at the right place and time to be exposed to a novel idea?

                                                                                              It's not an "or" but an "and". Being at the right place and time is a necessary precondition, but it's not sufficient. Newton stood on the shoulders of giants like Kepler and Galileo, and Einstein built upon the work of Maxwell and Lorentz. The key question is, why did they see the next step when so many of their brilliant contemporaries, who had the exact same information and were in similar positions, did not? That's what separates the exceptional from the rest.

                                                                                              >I am skeptical of this claim that you need a 140IQ to make scientific breakthroughs, because you don't need a 140IQ to understand special relativity.

                                                                                              There is a pretty massive gap between understanding a revolutionary idea and originating it. It's the difference between being the first person to summit Everest without a map, and a tourist who takes a helicopter to the top to enjoy the view. One requires genius and immense effort; the other requires following instructions. Today, we have a century of explanations, analogies, and refined mathematics that make relativity understandable. Einstein had none of that.

                                                                                                • Kim_Bruning

                                                                                                  today at 6:11 AM

                                                                                                  It's entirely plausible that sometimes one genius sees the answer all alone -I'm sure it happens sometimes- but it's also definitely a common theme that many people/ a subset of society as a whole may start having similar ideas all around the same time. In many cases where a breakthrough is attributed to one person, if you look more closely you'll often see some sort of team effort or societal ground swell.

                                                                                          • imiric

                                                                                            today at 3:05 AM

                                                                                            > LLMs are great, but they're not (yet?) as capable as our best and brightest (and in many ways, lag behind the average human) in most respects, so why would you expect such genius now ?

                                                                                            I'm not expecting novel scientific theories today. What I am expecting are signs and hints of such genius. Something that points in the direction that all tech CEOs are claiming we're headed in. So far I haven't seen any of this yet.

                                                                                            And, I'm sorry, I don't buy the excuse that these tools are not "yet" as capable as the best and brightest humans. They contain the sum of human knowledge, far more than any individual human in history. Are they not intelligent, capable of thinking and reasoning? Are we not at the verge of superintelligence[1]?

                                                                                            > we have recently built systems that are smarter than people in many ways, and are able to significantly amplify the output of people using them.

                                                                                            If all this is true, surely we should be seeing incredible results produced by this technology. If not by itself, then surely by "amplifying" the work of the best and brightest humans.

                                                                                            And yet... All we have to show for it are some very good applications of pattern matching and statistics, a bunch of gamed and misleading benchmarks and leaderboards, a whole lot of tech demos, solutions in search of a problem, and the very real problem of flooding us with even more spam, scams, disinformation, and devaluing human work with low-effort garbage.

                                                                                            [1]: https://blog.samaltman.com/the-gentle-singularity

                                                                                              • famouswaffles

                                                                                                today at 3:49 AM

                                                                                                >I'm not expecting novel scientific theories today. What I am expecting are signs and hints of such genius.

                                                                                                Like I said, what exactly would you be expecting to see with the capabilities that exist today ? It's not a gotcha, it's a genuine question.

                                                                                                >And, I'm sorry, I don't buy the excuse that these tools are not "yet" as capable as the best and brightest humans.

                                                                                                There's nothing to buy or not buy. They simply aren't. They are unable to do a lot of the things these people do. You can't slot an LLM in place of most knowledge workers and expect everything to be fine and dandy. There's no ambiguity on that.

                                                                                                >They contain the sum of human knowledge, far more than any individual human in history.

                                                                                                It's not really the total sum of human knowledge but let's set that aside. Yeah so ? Einstein, Newton, Von Newman. None of these guys were privy to some super secret knowledge their contemporaries weren't so it's obviously not simply a matter of more knowledge.

                                                                                                >Are they not intelligent, capable of thinking and reasoning?

                                                                                                Yeah they are. And so are humans. So were the peers of all those guys. So why are only a few able to see the next step ? It's not just about knowledge, and intelligence lives in degrees/is a gradient.

                                                                                                >If all this is true, surely we should be seeing incredible results produced by this technology. If not by itself, then surely by "amplifying" the work of the best and brightest humans.

                                                                                                Yeah and that exists. Terence Tao has shared a lot of his (and his peers) experiences on the matter.

                                                                                                https://mathstodon.xyz/@tao/115306424727150237

                                                                                                https://mathstodon.xyz/@tao/115420236285085121

                                                                                                https://mathstodon.xyz/@tao/115416208975810074

                                                                                                >And yet... All we have to show for it are some very good applications of pattern matching and statistics, a bunch of gamed and misleading benchmarks and leaderboards, a whole lot of tech demos, solutions in search of a problem, and the very real problem of flooding us with even more spam, scams, disinformation, and devaluing human work with low-effort garbage.

                                                                                                Well it's a good thing that's not true then

                                                                                                  • imiric

                                                                                                    today at 10:20 AM

                                                                                                    > Like I said, what exactly would you be expecting to see with the capabilities that exist today ?

                                                                                                    And like I said, "signs and hints" of superhuman intelligence. I don't know what that looks like since I'm merely human, but I sure know that I haven't seen it yet.

                                                                                                    > There's nothing to buy or not buy. They simply aren't. They are unable to do a lot of the things these people do.

                                                                                                    This claim is directly opposed to claims by Sam Altman and his cohort, which I'll repeat:

                                                                                                    > we have recently built systems that are smarter than people in many ways, and are able to significantly amplify the output of people using them.

                                                                                                    So which is it? If they're "smarter than people in many ways", where is the product of that superhuman intelligence? If they're able to "significantly amplify the output of people using them", then all of humanity should be empowered to produce incredible results that were previously only achievable by a limited number of people. In hands of the best and brightest humans, it should empower them to produce results previously unreachable by humanity.

                                                                                                    Yet all positive applications of this technology show that it excels at finding and producing data patterns, and nothing more than that. Those experience reports by Terence Tao are prime examples of this. The system was fed a lot of contextual information, and after being coaxed by highly intelligent humans, was able to find and produce patterns that were difficult to see by humans. This is hardly a showcase of intelligence that you and others think it is. Including those highly intelligent humans, some of whom have a lot to gain from pushing this narrative.

                                                                                                    We have seen similar reports by programmers as well[1]. Yet I'm continually amazed that these highly intelligent people are surprised that a pattern finding and producing system was able to successfully find and produce useful patterns, and then interpret that as a showcase of intelligence. So much so that I start to feel suspicious about the intentions and biases of those people.

                                                                                                    To be clear: I'm not saying that these systems can't be very useful in the right hands, and potentially revolutionize many industries. Ultimately many real-world problems can be modeled as statistical problems where a pattern recognition system can excel. What I am saying is that there's a very large gap from the utility of such tools, and the extraordinary claims that they have intelligence, let alone superhuman and general intelligence. So far I have seen no evidence of the latter, despite of the overwhelming marketing euphoria we're going through.

                                                                                                    > Well it's a good thing that's not true then

                                                                                                    In the world outside of the "AI" tech bubble, that is very much the reality.

                                                                                                    [1]: https://news.ycombinator.com/item?id=45784179

                                                                                        • tanseydavid

                                                                                          today at 12:32 AM

                                                                                          How about "Protein Folding"?

                                                                                            • imiric

                                                                                              today at 12:38 AM

                                                                                              A great use case for pattern recognition.

                                                                              • kace91

                                                                                yesterday at 11:56 PM

                                                                                >I’m interested in the Conquistadors and their extensive accounts of their expeditions, but holy cow reading 16th century handwritten Spanish and translating it at the same time is a nightmare, requiring a ton of expertise and inside field knowledge

                                                                                Completely off topic, but out of curiosity, where are you reading these documents? As a Spaniard I’m kinda interested.

                                                                                  • throwup238

                                                                                    today at 12:14 AM

                                                                                    I use the Portal de Archivos EspaƱoles [1] for Spanish colonial documents. Each country has their own archive but the Spanish one has the most content (35 million digitized pages)

                                                                                    The hard part is knowing where to look since most of the images haven’t gone through HRT/OCR or indexing so you have to understand Spanish colonial administration and go through the collections to find stuff.

                                                                                    [1] https://pares.cultura.gob.es/pares/en/inicio.html

                                                                                      • throwout4110

                                                                                        today at 12:28 AM

                                                                                        Want to collab on a database and some clustering and analysis? I’m a data scientist at FAIR with an interest in antiquarian docs and books

                                                                                          • vintermann

                                                                                            today at 9:06 AM

                                                                                            You should maybe reach out to the author of this blog post, professor Mark Humphries. Or to the genealogy communities, we struggle with handwritten historical texts no public AI model can make a dent in, regularly.

                                                                                            • throwup238

                                                                                              today at 1:52 AM

                                                                                              Sadly I'm just an amateur armchair historian (at best) so I doubt I'd be of much help. I'm mostly only doing the translation for my own edification

                                                                                                • cco

                                                                                                  today at 7:44 AM

                                                                                                  You may be surprised (or not?) at how many important scientific and historical works are done by armchair practitioners.

                                                                                              • rmonvfer

                                                                                                today at 12:39 AM

                                                                                                Spaniard here. Let me know if I can somehow help navigate all of that. I’m very interested in history and everything related to the 1400-1500 period (although I’m not an expert by any definition) and I’d love to see what modern technology could do here, specially OCRs and VLMs.

                                                                                    • Aperocky

                                                                                      today at 1:43 AM

                                                                                      > This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?

                                                                                      Thanks for this, I was almost convinced and about to re-think my entire perspective and experience with LLMs.

                                                                                      • jvreeland

                                                                                        yesterday at 10:32 PM

                                                                                        I'd love to find more info on this but from what I can find it seems to be making webpages that look like those product, and seemingly can "run python" or "emulate a game" but writing something that, based on all of GitHub, can approximate an iPhone or emulator in javscript/css/HTML is very very very different than writing an OS.

                                                                                        • otherdave

                                                                                          today at 1:24 PM

                                                                                          Where can I find these Conquistador documents? Sounds like something I might like to read and explore.

                                                                                          • WhyOhWhyQ

                                                                                            yesterday at 10:22 PM

                                                                                            "> Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts."

                                                                                            Wow I'm doing it way wrong. How do I get the good stuff?

                                                                                              • zer00eyz

                                                                                                yesterday at 10:29 PM

                                                                                                Your not.

                                                                                                I want you to go into the kitchen and bake a cake. Please replace all the flour with baking soda. If it comes out looking limp and lifeless just decorate it up with extra layers of frosting.

                                                                                                You can make something that looks like a cake but would not be good to eat.

                                                                                                The cake, sometimes, is a lie. And in this case, so are likely most of these results... or they are the actual source code of some other project just regurgitated.

                                                                                                  • hinkley

                                                                                                    yesterday at 10:51 PM

                                                                                                    We got the results back. You are a horrible person. I’m serious, that’s what it says: ā€œHorrible person.ā€

                                                                                                    We weren’t even testing for that.

                                                                                                      • joshstrange

                                                                                                        yesterday at 11:01 PM

                                                                                                        Source: Portal 2, you can see the line and listen to it here (last one in section): https://theportalwiki.com/wiki/GLaDOS_voice_lines_(Portal_2)...

                                                                                                          • chihuahua

                                                                                                            today at 6:52 AM

                                                                                                            I'd really like Alexa+ to have the voice of GLaDOS.

                                                                                                            • hinkley

                                                                                                              yesterday at 11:19 PM

                                                                                                              I figured it was appropriate given the context.

                                                                                                              I’m still amazed that game started as someone’s school project. Long live the Orange Box!

                                                                                                          • erulabs

                                                                                                            yesterday at 11:00 PM

                                                                                                            Well, what does a neck-bearded old engineer know about fashion? He probably - Oh, wait. It's a she. Still, what does she know? Oh wait, it says she has a medical degree. In fashion! From France!

                                                                                                              • joshstrange

                                                                                                                yesterday at 11:02 PM

                                                                                                                If you want to listen to the line from Portal 2 it's on this page (second line in the section linked): https://theportalwiki.com/wiki/GLaDOS_voice_lines_(Portal_2)...

                                                                                                                  • fragmede

                                                                                                                    yesterday at 11:49 PM

                                                                                                                    Just because "Die motherfucker die motherfucker die" appeared in a song once doesn't mean it's not also death threat when someone's pointing a gun at you and saying that.

                                                                                                                      • scubbo

                                                                                                                        today at 12:17 AM

                                                                                                                        ...what?

                                                                                                                          • fragmede

                                                                                                                            today at 1:38 AM

                                                                                                                            hinkley wrote:

                                                                                                                            > We got the results back. You are a horrible person. I’m serious, that’s what it says: ā€œHorrible person.ā€

                                                                                                                            > We weren’t even testing for that.

                                                                                                                            joshstrange then wrote:

                                                                                                                            > If you want to listen to the line from Portal 2 it's on this page (second line in the section linked): https://theportalwiki.com/wiki/GLaDOS_voice_lines_(Portal_2)...

                                                                                                                            as if the fact that the words that hinkley wrote are from a popular video game excuses the fact that hinkley just also called zer00eyz horrible.

                                                                                                                              • hinkley

                                                                                                                                today at 3:00 AM

                                                                                                                                So if two sentences that make no sense to you sandwich one that does, you should totally accept the middle one at face value.

                                                                                                                                K.

                                                                                                • smusamashah

                                                                                                  today at 1:38 AM

                                                                                                  > Whats the kernel look like?

                                                                                                  Those clones are all HTML/CSS, same for game clones made by Gemini.

                                                                                                  • nestorD

                                                                                                    yesterday at 10:27 PM

                                                                                                    Oh! That's a nice use-case and not too far from stuff I have been playing with! (happily I do not have to deal with handwriting, just bad scans of older newspapers and texts)

                                                                                                    I can vouch for the fact that LLMs are great at searching in the original language, summarizing key points to let you know whether a document might be of interest, then providing you with a translation where you need one.

                                                                                                    The fun part has been build tools to turn Claude code and Codex CLI into capable research assistant for that type of projects.

                                                                                                      • throwup238

                                                                                                        yesterday at 11:21 PM

                                                                                                        > The fun part has been build tools to turn Claude code and Codex CLI into capable research assistant for that type of projects.

                                                                                                        What does that look like? How well does it work?

                                                                                                        I ended up writing a research TUI with my own higher level orchestration (basically have the thing keep working in a loop until a budget has been reached) and document extraction.

                                                                                                          • nestorD

                                                                                                            today at 2:43 AM

                                                                                                            I started with a UI that sounded like it was built along the same lines as yours, which had the advantage of letting me enforce a pipeline and exhaustivity of search (I don't want the 10 most promising documents, I want all of them).

                                                                                                            But I realized I was not using it much because it was that big and inflexible (plus I keep wanting to stamp out all the bugs, which I do not have the time to do on a hobby project). So I ended up extracting it into MCPs (equipped to do full-text search and download OCR from the various databases I care about) and AGENTS.md files (defining pipelines, as well as patterns for both searching behavior and reporting of results). I also put together a sub-agent for translation (cutting away all tools besides reading and writing files, and giving it some document-specific contextual information).

                                                                                                            That lets me use Claude Code and Codex CLI (which, anecdotally, I have found to be the better of the two for that kind of work; it seems to deal better with longer inputs produced by searches) as the driver, telling them what I am researching and maybe how I would structure the search, then letting them run in the background before checking their report and steering the search based on that.

                                                                                                            It is not perfect (if a search surfaces 300 promising documents, it will not check all of them, and it often misunderstands things due to lacking further context), but I now find myself reaching for it regularly, and I polish out problems one at a time. The next goal is to add more data sources and to maybe unify things further.

                                                                                                              • throwup238

                                                                                                                today at 4:09 AM

                                                                                                                > It is not perfect (if a search surfaces 300 promising documents, it will not check all of them, and it often misunderstands things due to lacking further context)

                                                                                                                This has been the biggest problem for me too. I jokingly call it the LLM halting problem because it never knows the proper time to stop working on something, finishing way too fast without going through each item in the list. That’s why I’ve been doing my own custom orchestration, drip feeding it results with a mix of summarization and content extraction to keep the context from different documents chained together.

                                                                                                                Especially working with unindexed content like colonial documents where I’m searching through thousands of pages spread (as JPEGs) over hundreds of documents for a single one that’s relevant to my research, but there are latent mentions of a name that ties them all together (like a minor member of an expedition giving relevant testimony in an unrelated case). It turns into a messy web of named entity recognition and a bunch of more classical NLU tasks, except done with an LLM because I’m lazy.

                                                                                                    • jchw

                                                                                                      today at 7:07 AM

                                                                                                      I'm surprised people didn't click through to the tweet.

                                                                                                      https://x.com/chetaslua/status/1977936585522847768

                                                                                                      > I asked it for windows web os as everyone asked me for it and the result is mind blowing , it even has python in terminal and we can play games and run code in it

                                                                                                      And of course

                                                                                                      > 3D design software, Nintendo emulators

                                                                                                      No clue what these refer to but to be honest it sounds like they've incrementally improved one-shotting capabilities mostly. I wouldn't be surprised if Gemini 2.5 Pro could get a Gameboy or NES emulator working to boot Tetris or Mario, while it is a decent chunk of code to get things going, there's an absolute boatload of code on the Internet, and the complexity is lower than you might imagine. (I have written a couple of toy Gameboy emulators from scratch myself.)

                                                                                                      Don't get me wrong, it is pretty cool that a machine can do this. A lot of work people do today just isn't that novel and if we can find a way to tame AI models to make them trustworthy enough for some tasks it's going to be an easy sell to just throw AI models at certain problems they excel at. I'm sure it's already happening though I think it still mostly isn't happening for code at least in part due to the inherent difficulty of making AI work effectively in existing large codebases.

                                                                                                      But I will say that people are a little crazy sometimes. Yes it is very fascinating that an LLM, which is essentially an extremely fancy token predictor, can one-shot a web app that is mostly correct, apparently without any feedback, like being able to actually run the application or even see editor errors, at least as far as we know. This is genuinely really impressive and interesting, and not the aspect that I think anyone seeks to downplay. However, consider this: even as relatively simple as an NES is compared to even moderately newer machines, to make an NES emulator you have to know how an NES works and even have strategies for how to emulate it, which don't necessarily follow from just reading specifications or even NES program disassembly. The existence of many toy NES emulators and a very large amount of documentation for the NES hardware and inner workings on the Internet, as well as the 6502, means that LLMs have a lot of training data to help them out.

                                                                                                      I think that these tasks which extremely well-covered in the training data gives people unrealistic expectations. You could probably pick a simpler machine that an LLM would do significantly worse at, even though a human who knows how to write emulation software could definitely do it. Not sure what to pick, but let's say SEGA's VMU units for the Dreamcast - very small, simple device, and I reckon there should be information about it online, but it's going to be somewhat limited. You might think, "But that's not fair. It's unlikely to be able to one-shot something like that without mistakes with so much less training data on the subject." Exactly. In the real world, that comes up. Not always, but often. If it didn't, programming would be an incredibly boring job. (For some people, it is, and these LLMs will probably be disrupting that...) That's not to say that AI models can never do things like debug an emulator or even do reverse engineering on its own, but it's increasingly clear that this won't emerge from strapping agents on top of transformers predicting tokens. But since there is a very large portion of work that is not very novel in the world, I can totally understand why everyone is trying to squeeze this model as far as it goes. Gemini and Claude are shockingly competent.

                                                                                                      I believe many of the reasons people scoff at AI are fairly valid even if they don't always come from a rational mindset, and I try to keep my usage of AI to be relatively tasteful. I don't like AI art, and I personally don't like AI code. I find the push to put AI in everything incredibly annoying, and I worry about the clearly circular AI market, overhyped expectations. I dislike the way AI training has ripped up the Internet, violated people's trust, and lead to a more closed Internet. I dislike that sites like Reddit are capitalizing on all of the user-generated content that users submitted which made them rich in the first place, just to crap on them in the process.

                                                                                                      But I think that LLMs are useful, and useful LLMs could definitely be created ethically, it's just that the current AI race has everyone freaking the fuck out. I continue to explore use cases. I find that LLMs have gotten increasingly good at analyzing disassembly, though it varies depending on how well-covered the machine is in its training data. I've also found that LLMs can one-shot useful utilities and do a decent job. I had an LLM one-shot a utility to dump the structure of a simple common file format so I could debug something... It probably only saved me about 15-30 minutes, but still, in that case I truly believe it did save me time, as I didn't spend any time tweaking the result; it did compile, and it did work correctly.

                                                                                                      It's going to be troublesome to truly measure how good AI is. If you knew nothing about writing emulators, being able to synthesize an NES emulator that can at least boot a game may seem unbelievable, and to be sure it is obviously a stunning accomplishment from a PoV of scaling up LLMs. But what we're seeing is probably more a reflection of very good knowledge rather than very good intelligence. If we didn't have much written online about the NES or emulators at all, then it would be truly world-bending to have an AI model figure out everything it needs to know to write one on-the-fly. Humans can actually do stuff like that, which we know because humans had to do stuff like that. Today, I reckon most people rarely get the chance to show off that they are capable of novel thought because there are so many other humans that had to do novel thinking before them. Being able to do novel thinking effectively when needed is currently still a big gap between humans and AI, among others.

                                                                                                        • stOneskull

                                                                                                          today at 10:34 AM

                                                                                                          i think google is going to repeat history with gemini.. as in chatgpt, grok, etc will be like altavista, lycos, etc

                                                                                                      • Footprint0521

                                                                                                        today at 12:12 AM

                                                                                                        Bro split that up, use LLMs for transcription first, then take that and translate it

                                                                                                        • ninetyninenine

                                                                                                          today at 7:30 AM

                                                                                                          I'm skeptical because my entire identity is basically built around being a software engineer and thinking my IQ and intelligence is higher than other people. If this AI stuff is real then it basically destroys my entire identity so I choose the most convenient conclusion.

                                                                                                          Basically we all know that AI is just a stochastic parrot autocomplete. That's all it is. Anyone who doesn't agree with me is of lesser intelligence and I feel the need to inform them of things that are obvious: AI is not a human, it does not have emotions. It just a search engine. Those people who are using AI to code and do things that are indistinguishable from human reasoning are liars. I choose to focus on what AI gets wrong, like hallucinations, while ignoring the things it gets right.

                                                                                                            • hju22_-3

                                                                                                              today at 8:17 AM

                                                                                                              > [...] my entire identity is basically built around [...] thinking my IQ and intelligence is higher than other people.

                                                                                                              Well, there's your first problem.

                                                                                                                • vintermann

                                                                                                                  today at 9:08 AM

                                                                                                                  I don't know, that's commendable self-insight, it's true of lots and lots of people but there are few who would admit it!

                                                                                                                    • ninetyninenine

                                                                                                                      today at 1:47 PM

                                                                                                                      I am unique. Totally. It is not like HN is flooded with cognition or psychology or IQ articles every other hour. Not at all. And whenever one shows up, you do not immediately get a parade of people diagnosing themselves with whatever the headline says. Never happens. You post something about slow thinking and suddenly half the thread whispers ā€œthat is literally me.ā€ You post something about fast thinking and the other half says ā€œfinally someone understands my brain.ā€ You post something about overthinking and everyone shows up with ā€œwow I feel so seen.ā€ You post something about attention and now the entire site has ADHD.

                                                                                                                      But yes. I am the unique one.

                                                                                                              • twoodfin

                                                                                                                today at 12:16 PM

                                                                                                                This kind of comment certainly shows that no organic stochastic parrots post to hn threads!

                                                                                                                • cindyllm

                                                                                                                  today at 8:03 AM

                                                                                                                  [dead]

                                                                                                          • elphinstone

                                                                                                            today at 3:37 AM

                                                                                                            I read the whole article, but have never tried the model. Looking at the input document, I believe the model saw enough of a space between the 14 and 5 to simply treat it that way. I saw the space too. Impressive, but it's a leap to say it saw 145 then used higher order reasoning to correct 145 to 14 and 5.

                                                                                                              • Coeur

                                                                                                                today at 7:08 AM

                                                                                                                I also read the whole article, and this behaviour that the author is most excited about only happened once. For a process that inherently has some randomness about it, I feel it's too early to bit this excited.

                                                                                                                  • afro88

                                                                                                                    today at 8:23 AM

                                                                                                                    Yep. A lot of things looked magical in the GPT-4 days. Eventually you realised it did it by chance and more often than not gets it wrong

                                                                                                            • roywiggins

                                                                                                              today at 3:00 AM

                                                                                                              My task today for LLMs was "can you tell if this MRI brain scan is facing the normal way", and the answer was: no, absolutely not. Opus 4.1 succeeds more than chance, but still not nearly often enough to be useful. They all cheerfully hallucinate the wrong answer, confidently explaining the anatomy they are looking for, but wrong. Maybe Gemini 3 will pull it off.

                                                                                                              Now, Claude did vibe code a fairly accurate solution to this using more traditional techniques. This is very impressive on its own but I'd hoped to be able to just shovel the problem into the VLM and be done with it. It's kind of crazy that we have "AIs" that can't tell even roughly what the orientation of a brain scan is- something a five year old could probably learn to do- but can vibe code something using traditional computer vision techniques to do it.

                                                                                                              I suppose it's not too surprising, a visually impaired programmer might find it impossible to do reliably themselves but would code up a solution, but still: it's weird!

                                                                                                                • IanCal

                                                                                                                  today at 2:03 PM

                                                                                                                  Most models don’t have good spatial information from the images. Gemini models do preprocessing and so are typically better for that. It depends a lot on how things get segmented though.

                                                                                                                  • chrischen

                                                                                                                    today at 3:46 AM

                                                                                                                    But these models are more like generalists no? Couldn’t they simply be hooked up to more specialized models and just defer to them the way coding agents now use tools to assist?

                                                                                                                    • moritonal

                                                                                                                      today at 11:23 AM

                                                                                                                      That's fairly unfair comparison. Did you include in the prompt a basic set of instructions about which way is "correct" and what to look for?

                                                                                                                        • roywiggins

                                                                                                                          today at 2:17 PM

                                                                                                                          I didn't, but they all seemed to know what to look for, they wrote explanations of what they were looking for, which were generally correct enough. They still got the answer wrong.

                                                                                                                      • hopelite

                                                                                                                        today at 3:40 AM

                                                                                                                        What is the ā€œnormalā€ way? Is that defined in a technical specification? Did you provide the definition/description of what you mean by ā€œnormalā€?

                                                                                                                        I would not have expected a language model to perform well on what sounds like a computer vision problem? Even if it was agentic, as you also imply how a five year old could learn how to do it, so too an AI system would need to be trained or at the very least be provided with a description of what is looking at.

                                                                                                                        Imagine you took an MRI brain scan back in time and showed it to a medical Doctor in even the 1950s or maybe 1900. Do you think they would know what the normal orientation is, let alone what they are looking at?

                                                                                                                        I am a bit confused and also interested in how people are interacting with AI in general, it really seems to have a tendency to highlight significant holes in all kinds of human epistemological, organizational, and logical structures.

                                                                                                                        I would suggest maybe you think of it as a kind of child, and with that, you would need to provide as much context and exact detail about the requested task or information as possible. This is what context engineering (are we still calling it that?) concerns itself with.

                                                                                                                          • roywiggins

                                                                                                                            today at 2:17 PM

                                                                                                                            The thing is that the models absolutely do know what the standard orientation is for a scan. They respond extensively about what they're looking for and what the correct orientation would be, more or less accurately. They are aware.

                                                                                                                            They then give the wrong answer, hallucinating anatomical details in the wrong place, etc.

                                                                                                                    • efitz

                                                                                                                      yesterday at 10:50 PM

                                                                                                                      I haven’t seen this new google model but now must try it out.

                                                                                                                      I will say that other frontier models are starting to surprise me with their reasoning/understanding- I really have a hard time making (or believing) the argument that they are just predicting the next word.

                                                                                                                      I’ve been using Claude Code heavily since April; Sonnet 4.5 frequently surprises me.

                                                                                                                      Two days ago I told the AI to read all the documentation from my 5 projects related to a tool I’m building, and create a wiki, focused on audience and task.

                                                                                                                      I'm hand reviewing the 50 wiki pages it created, but overall it did a great job.

                                                                                                                      I got frustrated about one issue: I have a github issue to create a way to integrate with issue trackers (like Jira), but it's TODO, and the AI featured on the home page that we had issue tracker integration. It created a page for it and everything; I figured it was hallucinating.

                                                                                                                      I went to edit the page and replace it with placeholder text and was shocked that the LLM had (unprompted) figured out how to use existing features to integrate with issue trackers, and wrote sample code for GitHub, Jira and Slack (notifications). That truly surprised me.

                                                                                                                        • schiffern

                                                                                                                          today at 2:36 AM

                                                                                                                            >I really have a hard time making (or believing) the argument that they are just predicting the next word.
                                                                                                                          
                                                                                                                          It's true, but by the same token our brain is "just" thresholding spike rates.

                                                                                                                          • astrange

                                                                                                                            yesterday at 11:51 PM

                                                                                                                            Predicting the next word is the interface, not the implementation.

                                                                                                                            (It's a pretty constraining interface though - the model outputs an entire distribution and then we instantly lose it by only choosing one token from it.)

                                                                                                                            • energy123

                                                                                                                              yesterday at 11:19 PM

                                                                                                                              Predicting the next word requires understanding, they're not separate things. If you don't know what comes after the next word, then you don't know what the next word should be. So the task implicitly forces a more long-horizon understanding of the future sequence.

                                                                                                                                • IAmGraydon

                                                                                                                                  yesterday at 11:37 PM

                                                                                                                                  This is utterly wrong. Predicting the next word requires a large sample of data made into a statistical model. It has nothing to do with "understanding", which implies it knows why rather than what.

                                                                                                                                    • orionsbelt

                                                                                                                                      yesterday at 11:49 PM

                                                                                                                                      Ilya Sustkever was on a podcast, saying to imagine a mystery novel where at the end it says ā€œand the killer is: (name)ā€. Saying it’s just a statistical model generating the next most likely word, how can it do that in this case if it doesn’t have some understanding of all the clues, etc. A specific name is not statistically likely to appear

                                                                                                                                        • nicpottier

                                                                                                                                          today at 3:45 AM

                                                                                                                                          I once was chatting with an author of books (very much an amateur) and he said he enjoyed writing because he liked discovering where the story goes. IE, he starts and builds characters and creates scenarios for them and at some point the story kind of takes over, there is only one way a character can act based on what was previously written, but it wasn't preordained. That's why he liked it, it was a discovery to him.

                                                                                                                                          I'm not saying this is the right way to write a book but it is a way some people write at least! And one LLMs seem capable of doing. (though isn't a book outline pretty much the same as a coding plan and well within their wheelhouse?)

                                                                                                                                          • shwaj

                                                                                                                                            yesterday at 11:55 PM

                                                                                                                                            Can current LLMs actually do that, though? What Ilya posed was a thought experiment: if it could do that, then we would say that it has understanding. But AFAIK that is beyond current capabilities.

                                                                                                                                              • krackers

                                                                                                                                                today at 2:37 AM

                                                                                                                                                Someone should try it and create a new "mysterybench". Find all mystery novels written after LLM training cutoff, and see how many models unravel the mystery

                                                                                                                                            • squigz

                                                                                                                                              today at 12:31 AM

                                                                                                                                              This implies understanding of preceding tokens, no? GP was saying they have understanding of future tokens.

                                                                                                                                              • IAmGraydon

                                                                                                                                                today at 12:28 AM

                                                                                                                                                It can't do that without the answer to who did it being in the training data. I think the reason people keep falling for this illusion is that they can't really imagine how vast the training dataset is. In all cases where it appears to answer a question like the one you posed, it's regurgitating the answer from its training data in a way that creates an illusion of using logic to answer it.

                                                                                                                                                  • dyauspitr

                                                                                                                                                    today at 10:02 AM

                                                                                                                                                    That’s not true, at all.

                                                                                                                                                      • IAmGraydon

                                                                                                                                                        today at 2:12 PM

                                                                                                                                                        Please…go on.

                                                                                                                                                    • CamperBob2

                                                                                                                                                      today at 1:06 AM

                                                                                                                                                      It can't do that without the answer to who did it being in the training data.

                                                                                                                                                      Try it. Write a simple original mystery story, and then ask a good model to solve it.

                                                                                                                                                      This isn't your father's Chinese Room. It couldn't solve original brainteasers and puzzles if it were.

                                                                                                                                              • Workaccount2

                                                                                                                                                today at 12:27 AM

                                                                                                                                                "Understanding" is just a trap to get wrapped up in. A word with no definition and no test to prove it.

                                                                                                                                                Whether or not the model are "understanding" is ultimately immaterial, as their ability to do things is all that matters.

                                                                                                                                                  • pinnochio

                                                                                                                                                    today at 12:51 AM

                                                                                                                                                    If they can't do things that require understanding, it's material, bub.

                                                                                                                                                    And just because you have no understanding of what "understanding" means, doesn't mean nobody does.

                                                                                                                                                      • red75prime

                                                                                                                                                        today at 2:00 AM

                                                                                                                                                        > doesn't mean nobody does

                                                                                                                                                        If it's not a functional understating that allows to replicate functionality of understanding, is it the real understanding?

                                                                                                                                                • astrange

                                                                                                                                                  yesterday at 11:52 PM

                                                                                                                                                  If you're claiming a transformer model is a Markov chain, this is easily disprovable by, eg, asking the model why it isn't a Markov chain!

                                                                                                                                                  But here is a really big one of those if you want it: https://arxiv.org/abs/2401.17377

                                                                                                                                                  • nl

                                                                                                                                                    today at 12:07 AM

                                                                                                                                                    Modern LLMs are post trained for tasks other than next word prediction.

                                                                                                                                                    They still output words through (except for multi-modal LLMs) so that does involve next word generation.

                                                                                                                                                    • dyauspitr

                                                                                                                                                      today at 10:00 AM

                                                                                                                                                      The line between understanding and ā€œlarge sample of data made into a statistical modelā€ is kind of fuzzy.

                                                                                                                                                  • HarHarVeryFunny

                                                                                                                                                    today at 12:49 AM

                                                                                                                                                    > Predicting the next word requires understanding

                                                                                                                                                    If we were talking about humans trying to predict next word, that would be true.

                                                                                                                                                    There is no reason to suppose than an LLM is doing anything other than deep pattern prediction pursuant to, and no better than needed for, next word prediction.

                                                                                                                                                      • famouswaffles

                                                                                                                                                        today at 1:28 AM

                                                                                                                                                        There is plenty reason. This article is just one example of many. People bring it up because LLMs routinely do things we call reasoning when we see them manifest in other humans. Brushing it off as 'deep pattern prediction' is genuinely meaningless. Nobody who uses that phrase in that way can actually explain what they are talking about in a way that can be falsified. It's just vibes. It's an unfalsifiable conversation-stopper, not a real explanation. You can replace "pattern matching" with "magic" and the argument is identical because the phrase isn't actually doing anything.

                                                                                                                                                        A - A force is required to lift a ball

                                                                                                                                                        B - I see Human-N lifting a ball

                                                                                                                                                        C - Obviously, Human-N cannot produce forces

                                                                                                                                                        D - Forces are not required to lift a ball

                                                                                                                                                        Well sir, why are you so sure Human-N cannot produce forces? How is she lifting the ball ? Well Of course Human-N is just using s̶t̶a̶t̶i̶s̶t̶i̶c̶s̶ magic.

                                                                                                                                                          • energy123

                                                                                                                                                            today at 1:44 AM

                                                                                                                                                            Anything can be euphemized. Human intelligence is atoms moving around the brain. General relativity is writing on a piece of paper.

                                                                                                                                                              • famouswaffles

                                                                                                                                                                today at 1:54 AM

                                                                                                                                                                If you want to say human and LLM intelligence are both 'deep pattern prediction' then sure, but mostly and certainly in the case I was replying to, people often just use it as a means to make an imaginary unfalsifiable distinction between what LLMs do and what the super special humans do.

                                                                                                                                                        • CamperBob2

                                                                                                                                                          today at 1:05 AM

                                                                                                                                                          How'd you do at the International Math Olympiad this year?

                                                                                                                                                            • cxvrfr

                                                                                                                                                              today at 9:04 AM

                                                                                                                                                              How would you do multiplying 10000 pairs of 100 digit numbers in a limited amount of time? We don't anthropomorphize calculators though...

                                                                                                                                                              • HarHarVeryFunny

                                                                                                                                                                today at 2:59 AM

                                                                                                                                                                I hear the LLM was able to parrot fragments of the stuff it was trained to memorize, and did very well

                                                                                                                                                                  • CamperBob2

                                                                                                                                                                    today at 3:16 AM

                                                                                                                                                                    Yeah, that must be it.

                                                                                                                                                                      • cxvrfr

                                                                                                                                                                        today at 9:05 AM

                                                                                                                                                                        Well being able to extrapolate solutions to "novel" mathematical exercises based on a very large sample of similar tasks in your dataset seems like a reasonable explanation.

                                                                                                                                                                        Question is how well it would do if it was trained without those samples?

                                                                                                                                                    • charcircuit

                                                                                                                                                      today at 1:58 AM

                                                                                                                                                      It's trying to maximize a reward function. It's not just predicting the next word.

                                                                                                                                                  • conception

                                                                                                                                                    yesterday at 10:52 PM

                                                                                                                                                    I will note that 2.5 pro preview… march? Was maybe the best model I’ve used yet. The actual release model was… less. I suspect Google found the preview too expensive and optimized it down but it was interesting to see there was some hidden horsepower there. Google has always been poised to be the AI leader/winner - excited to see if this is fluff or the real deal or another preview that gets nerfed.

                                                                                                                                                      • muixoozie

                                                                                                                                                        today at 12:22 AM

                                                                                                                                                        Dunno if you're right, but I'd like to point out that I've been reading comments like these about every model since GPT 3. It's just starting to seem more likely to me to be a cognitive bias than not.

                                                                                                                                                          • conception

                                                                                                                                                            today at 12:24 AM

                                                                                                                                                            I haven’t noticed the effect of things getting worse after a release but definitely 2.5’s abilities got worse. Or perhaps they optimized for something else? But I haven’t noticed the usual ā€œthings got worse after release!ā€ Except for when sonnet had a bug for a month and gpt5’s autorouter broke.

                                                                                                                                                              • muixoozie

                                                                                                                                                                today at 2:18 AM

                                                                                                                                                                Yea I don't know. I didn't mean to sound accusatory. I might very well be wrong.

                                                                                                                                                            • KaoruAoiShiho

                                                                                                                                                              today at 12:28 AM

                                                                                                                                                              Sometimes it is just bias but the 2.5 pro had benchmarks showing the degradation (plus they changed the name every time so it was obviously a different ckpt or model).

                                                                                                                                                              • colordrops

                                                                                                                                                                today at 12:28 AM

                                                                                                                                                                Why would you assume cognitive bias? Any evidence? These things are indeed very expensive to run, and are often run at a loss. Wouldn't quantization or other tuning be just as reasonable of an answer as cognitive bias? It's not like we are talking about reptilian aliens running the whitehouse.

                                                                                                                                                                  • muixoozie

                                                                                                                                                                    today at 2:07 AM

                                                                                                                                                                    I'm just pointing out a personal observation. Completely anecdotal. FWIW, I don't strongly believe this. I have at least noticed a selection bias (maybe) in myself too as recently as yesterday after GPT 5.1 was released. I asked codex to do a simple change (less than 50LOC) and it made a unrelated change, an early return statement, breaking a very simple state machine that goes from waiting -> evaluate -> done. However, I have to remind myself how often LLMs make dumb mistakes despite often seeming impressive.

                                                                                                                                                                      • oasisbob

                                                                                                                                                                        today at 2:52 AM

                                                                                                                                                                        That sounds more like availability bias, not selection bias.

                                                                                                                                                            • oasisbob

                                                                                                                                                              today at 2:55 AM

                                                                                                                                                              I noticed the degradation when Gemini stopped being a good research tool, and made me want to strangle it on a daily basis.

                                                                                                                                                              It's incredibly frustrating to have a model start to hallucinate sources and be incapable of revisiting its behavior.

                                                                                                                                                              Couldn't even understand that it was making up non-sensical RFC references.

                                                                                                                                                          • xx_ns

                                                                                                                                                            yesterday at 11:47 PM

                                                                                                                                                            Am I missing something here? Colonial merchant ledgers and 18th-century accounting practices have been extensively digitized and discussed in academic literature. The model has almost certainly seen examples where these calculations are broken down or explained. It could be interpolating from similar training examples rather than "reasoning."

                                                                                                                                                              • ceroxylon

                                                                                                                                                                today at 2:23 AM

                                                                                                                                                                The author claims that they tried to avoid that: "[. . .] we had to choose them carefully and experiment to ensure that these documents were not already in the LLM training data (full disclosure: we can’t know for sure, but we took every reasonable precaution)."

                                                                                                                                                                  • blharr

                                                                                                                                                                    today at 2:57 AM

                                                                                                                                                                    Even if that specific document wasn't in the training data, there could be many similar documents from others at the time.

                                                                                                                                                                • today at 1:55 AM

                                                                                                                                                              • MagicMoonlight

                                                                                                                                                                today at 3:24 AM

                                                                                                                                                                It seems like a leap to assume it has done all sorts of complex calculations implicitly.

                                                                                                                                                                I looked at the image and immediately noticed that it is written as ā€œ14 5ā€ in the original text. It doesn’t require calculation to guess that it might be 14 pounds 5 ounces rather than 145. Especially since presumably, that notation was used elsewhere in the document.

                                                                                                                                                                • gcanyon

                                                                                                                                                                  today at 1:51 AM

                                                                                                                                                                  > So that is essentially the ceiling in terms of accuracy.

                                                                                                                                                                  I think this is mistaken. I remember... ten years ago? When speech-to-text models came out that dealt with background noise that made the audio sound very much like straight pink noise to my ear, but the model was able to transcribe the speech hidden within at a reasonable accuracy rate.

                                                                                                                                                                  So with handwritten text, the only prediction that makes sense to me is that we will (potentially) reach a state where the machine is at least probably more accurate than humans, although we wouldn't be able to confirm it ourselves.

                                                                                                                                                                  But if multiple independent models, say, Gemini 5 and Claude 7, both agree on the result, and a human can only shrug and say, "might be," then we're at a point where the machines are probably superior at the task.

                                                                                                                                                                    • regularfry

                                                                                                                                                                      today at 8:00 AM

                                                                                                                                                                      That depends on how good we get at interpretability. If the models can not only do the job but also are structured to permit an explanation of how they did it, we get the confirmation. Or not, if it turns out that the explanation is faulty.

                                                                                                                                                                  • pavlov

                                                                                                                                                                    yesterday at 10:46 PM

                                                                                                                                                                    I’ve seen those A/B choices on Google AI Studio recently, and there wasn’t a substantial difference between the outputs. It felt more like a different random seed for the same model.

                                                                                                                                                                    Of course it’s very possible my use case wasn’t terribly interesting so it wouldn’t reveal model differences, or that it was a different A/B test.

                                                                                                                                                                      • jeffbee

                                                                                                                                                                        yesterday at 11:02 PM

                                                                                                                                                                        For me they've been very similar, except in one case where I corrected it and on one side it doubled down on being objectively wrong, and on the other side it took my feedback and started over with a new line of thinking.

                                                                                                                                                                    • neom

                                                                                                                                                                      today at 12:07 AM

                                                                                                                                                                      I've been complaining on hn for some time now that my only real test of an LLM is that it can help my poor wife with her research, she spends all day every day in small town archives pouring over 18th century American historical documents. I thought maybe that day had come, I showed her the article and she said "good for him I'm still not transcribing important historical documents with a chat bot and nor should he" - ha. If you wanna play around with some difficult stuff here are some images from her work I've posted before: https://s.h4x.club/bLuNed45

                                                                                                                                                                        • Huppie

                                                                                                                                                                          today at 6:53 AM

                                                                                                                                                                          While it's of course a good thing to be critical the author did provide some more context on the why and how of doing it with LLM's on the hard fork podcast today [0]: mostly as a way to see how these models _can_ help them with these tasks.

                                                                                                                                                                          I would recommend listening to their explanation, maybe it'll give more insight.

                                                                                                                                                                          Disclosure: After listening the podcast and looking up and reading the article I emailed @dang to suggest it goes into the HN second chance pool. I'm glad more people enjoyed it.

                                                                                                                                                                          [0]: https://www.nytimes.com/2025/11/14/podcasts/hardfork-data-ce...

                                                                                                                                                                          • Workaccount2

                                                                                                                                                                            today at 12:23 AM

                                                                                                                                                                            People have had spotty access to this model for brief periods (gemini 3 pro) for a few weeks now, but its strongly expected to be released next week, and definitely by year end.

                                                                                                                                                                              • neom

                                                                                                                                                                                today at 12:27 AM

                                                                                                                                                                                Oh I didn't realize this wasn't 2.5 pro (I skimmed, sorry) - i also haven't had time to run some of her docs on 5.1 yet, I should.

                                                                                                                                                                            • HDThoreaun

                                                                                                                                                                              today at 12:39 AM

                                                                                                                                                                              It doesnt have to be perfect to be useful. If it does a decent job then your wife reviews and edits, that will be much faster than doing the whole thing by hand. The only question is if she can stay committed to perfection. I dont see the downside of trying it unless she's worried about getting lazy.

                                                                                                                                                                                • neom

                                                                                                                                                                                  today at 12:54 AM

                                                                                                                                                                                  I raised this point with her, she said there are times it would be ambiguous for both her and the model, and she thinks it would be dangerous for her to be influenced by it. I'm not a professional historical researcher so I'm not sure if her concern is valid or not.

                                                                                                                                                                                    • fooker

                                                                                                                                                                                      today at 3:16 AM

                                                                                                                                                                                      As a scientist, I don't think this is valid or useful. It's very much a first year PhD line of thought that academia stamps out of you.

                                                                                                                                                                                      This is the 'RE' in research, you specifically want to know and understand what others think of something by reading others' papers. The scientific training slowly, laboriously prepares you to reason about something without being too influenced by it.

                                                                                                                                                                                        • today at 4:01 AM

                                                                                                                                                                                      • HDThoreaun

                                                                                                                                                                                        today at 1:08 AM

                                                                                                                                                                                        I think there's a lot of meta thought that deserves to be done about where these new tools fit. It is easy to off handedly reject change, especially as a subject matter expert who can feel they worked so hard to do this and now theyre being replaced so the work was for nothing. I really dont want to say your wife is wrong, she almost assuredly is not. But it is important to have a curious mindset when confronted with ideas you may be biased against. Then she can rest easy knowing she is doing her best to perfect her craft, right? Otherwise she might wake up one day feeling like symbolic NLP researchers trying LLMs for the first time. Certainly a lot to consider.

                                                                                                                                                                                          • neom

                                                                                                                                                                                            today at 1:36 AM

                                                                                                                                                                                            I really appreciate your thoughtful reply. I try my best to be encouraging and educating without being preachy or condescending with my wife on this subject. I read hn, I see the posts of folks in, frankly what reads like anguish, about having a tool replace their expertise. I feel really, sad? about it. It's interesting to be confronted with it here (a place I love!) and at home (a place I love!) in quite different context. I've also never been particularly good at becoming good at something, I can't do very much, and genai is really exciting for me, I'm both drawn to and have love for experts so... This whole thing generally has been keeping me up at night a bit, because I feel anguish for the anguish.

                                                                                                                                                                            • AaronNewcomer

                                                                                                                                                                              today at 3:41 AM

                                                                                                                                                                              The thinking models (especially OpenAI's o3) still seem to do by far the best at this task as they look across the document to see how the writer wrote certain letters where the word is more clear when it runs into confusing words.

                                                                                                                                                                              I built a whole product around this: https://DocumentTranscribe.com

                                                                                                                                                                              But I imagine this will keep getting better and that excites me since this was largely built for my own research!

                                                                                                                                                                                • _giorgio_

                                                                                                                                                                                  today at 7:41 AM

                                                                                                                                                                                  I find Gemini 2.5 pro, not flash, way better than the chatGPT models. I didn't remember testing o3 though. Maybe it's o3 pro and it's one of the old costly and thinking models?

                                                                                                                                                                              • Grimblewald

                                                                                                                                                                                today at 9:26 AM

                                                                                                                                                                                I dunno man, looks like goodharts law in action to me. That isnt to say the models wont be good at what is stated, but it does mean it might not signal a general improvement in competence but rather a targeted gain with more general deficits rising up in untested/ignored areas, some which may or may not be catastrophic. I guess we will see but for now Imma keep my hype in the box.

                                                                                                                                                                                • jumploops

                                                                                                                                                                                  yesterday at 11:53 PM

                                                                                                                                                                                  This is exciting news, as I have some elegantly scribed family diaries from the 1800s that I can barely read (:

                                                                                                                                                                                  With that said, the writing here is a bit hyperbolic, as the advances seem like standard improvements, rather than a huge leap or final solution.

                                                                                                                                                                                    • red75prime

                                                                                                                                                                                      today at 2:08 AM

                                                                                                                                                                                      Statistics in the article has a low number of samples to make definitive conclusion, but expert-level WER looks like a huge leap.

                                                                                                                                                                                  • observationist

                                                                                                                                                                                    yesterday at 11:14 PM

                                                                                                                                                                                    This might just be a handcrafted prompt framework for handwriting recognition tied in with reasoning - do a rough pass, make assumptions and predictions, check assumptions and predictions, if they pass, use the degree of confidence in their passage to inform what the other characters might be, and gradually flesh out an interpretation of what was intended to be communicated.

                                                                                                                                                                                    If they could get this to occur naturally - with no supporting prompts, and only one-shot or one-shot reasoning, then it could extend to complex composition generally, which would be cool.

                                                                                                                                                                                      • terminalshort

                                                                                                                                                                                        today at 3:34 AM

                                                                                                                                                                                        I don't see how this performance could be anything like that. There is no way that Google included specialized system prompts with anything to do with converting shillings to pounds in their model.

                                                                                                                                                                                    • sriku

                                                                                                                                                                                      today at 9:14 AM

                                                                                                                                                                                      Rgd the "14 lb 5 oz" point in the article, the simpler explanation than the hypothesis there that it back calculated the weight is that there seems to be a space between 14 and 5 - i.e. It reads more like "14 5" than "145"?

                                                                                                                                                                                        • sriku

                                                                                                                                                                                          today at 9:14 AM

                                                                                                                                                                                          Impressive performance, yes but is the article giving more credit than due?

                                                                                                                                                                                      • netsharc

                                                                                                                                                                                        yesterday at 10:29 PM

                                                                                                                                                                                        Author says "It is the most amazing thing I have seen an LLM do, and it was unprompted, entirely accidental." and then jumps back to the "beginning of the story". Including talking about a trip to Canada.

                                                                                                                                                                                        Skip to the section headed "The Ultimate Test" for the resolution of the clickbait of "the most amazing thing...". (According to him, it correctly interpreted a line in an 18th century merchant ledger using maths and logic)

                                                                                                                                                                                          • appreciatorBus

                                                                                                                                                                                            yesterday at 11:11 PM

                                                                                                                                                                                            The new model may or may not be great at handwriting but I found the author's constant repetition about how amazing it was irritating enough to stop reading and to wonder if the article itself was slop-written.

                                                                                                                                                                                            "users have reported some truly wild things" "the results were shocking" "the most amazing thing I have seen an LLM do" "exciting and frightening all at once" "the most astounding result I have ever seen" "made the hair stand up on the back of my neck"

                                                                                                                                                                                              • bitwize

                                                                                                                                                                                                today at 12:27 AM

                                                                                                                                                                                                You're never gonna believe #6!

                                                                                                                                                                                        • ghm2199

                                                                                                                                                                                          today at 12:07 AM

                                                                                                                                                                                          I just used AI studio for recognizing text from a relative's 60 day log of food ingested 3 times a day. I think I am using models/gemini-flash-latest and it was shockingly good at recognizing text, far better than ChatGPT 5.1 or Claude's Sonnet (IIRC its 4.5) model.

                                                                                                                                                                                          https://pasteboard.co/euHUz2ERKfHP.png

                                                                                                                                                                                          Its response I have captured here https://pasteboard.co/sbC7G9nuD9T9.png is shockingly good. I could only spot 2 mistakes. And those that seems to have been the ones even I could not read or was very difficult for me to make out what the text was.

                                                                                                                                                                                            • ghm2199

                                                                                                                                                                                              today at 12:08 AM

                                                                                                                                                                                              I basically fed it all 60 images 5 at a time and made a table out of them to correlate sugar levels <-> food and colocate it with the person's exercise routines. This is insane.

                                                                                                                                                                                          • koliber

                                                                                                                                                                                            today at 9:23 AM

                                                                                                                                                                                            It hasn't met my doctor.

                                                                                                                                                                                            • _giorgio_

                                                                                                                                                                                              today at 7:23 AM

                                                                                                                                                                                              Gemini 2.5 PRO is already incredibly good in handwritten recognition. It makes maybe one small mistake every 3 pages.

                                                                                                                                                                                              It has completely changed the way I work, and it allows me to write math and text and then convert it with the Gemini app (or with a scanned PDF in the browser). You should really try it.

                                                                                                                                                                                              • barremian

                                                                                                                                                                                                today at 4:58 AM

                                                                                                                                                                                                > it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts

                                                                                                                                                                                                > As is so often the case with AI, that is exciting and frightening all at once

                                                                                                                                                                                                > we need to extrapolate from this small example to think more broadly: if this holds the models are about to make similar leaps in any field where visual precision and skilled reasoning must work together required

                                                                                                                                                                                                > this will be a big deal when it’s released

                                                                                                                                                                                                > What appears to be happening here is a form of emergent, implicit reasoning, the spontaneous combination of perception, memory, and logic inside a statistical model

                                                                                                                                                                                                > model’s ability to make a correct, contextually grounded inference that requires several layers of symbolic reasoning suggests that something new may be happening inside these systems—an emergent form of abstract reasoning that arises not from explicit programming but from scale and complexity itself

                                                                                                                                                                                                Just another post with extreme hyperbolic wording to blow up another model release. How many times have we seen such non-realistic build up in the past couple of years.

                                                                                                                                                                                                • greekrich92

                                                                                                                                                                                                  yesterday at 11:28 PM

                                                                                                                                                                                                  Pretty hyperbolic reaction to what seems like a fairly modest improvement

                                                                                                                                                                                                  • lproven

                                                                                                                                                                                                    yesterday at 11:20 PM

                                                                                                                                                                                                    Betteridge's law surely applies.

                                                                                                                                                                                                    • kittikitti

                                                                                                                                                                                                      yesterday at 11:26 PM

                                                                                                                                                                                                      I much prefer this tone about improvements in AI over the doomerism I constantly read. I was waiting for a twist where the author changed their minds and suddenly went "this is the devil's technology" or "THEY T00K OUR JOBS" but it never happened. Thank you for sharing, it felt like breathing for the first time in a long time.

                                                                                                                                                                                                      • thatoneengineer

                                                                                                                                                                                                        yesterday at 10:47 PM

                                                                                                                                                                                                        https://en.wikipedia.org/wiki/Betteridge%27s_law_of_headline...

                                                                                                                                                                                                          • lproven

                                                                                                                                                                                                            yesterday at 11:22 PM

                                                                                                                                                                                                            You beat me to it.

                                                                                                                                                                                                        • bgwalter

                                                                                                                                                                                                          yesterday at 10:34 PM

                                                                                                                                                                                                          No, just another academic with the ominous handle @generativehistory that is beguiled by "AI". It is strange that others can never reproduce such amazing feats.

                                                                                                                                                                                                            • pksebben

                                                                                                                                                                                                              yesterday at 10:56 PM

                                                                                                                                                                                                              I don't know if I'd call it an 'amazing feat', but claude had me pause for a moment recently.

                                                                                                                                                                                                              Some time ago, I'd been working on a framework that involved a series of servers (not the only one I've talked to claude about) that had to pass messages around in a particular fashion. Mostly technical implementation details and occasional questions about architecture.

                                                                                                                                                                                                              Fast forward a ways, and on a lark I decided to ask in the abstract about the best way to structure such an interaction. Mark that this was not in the same chat or project and didn't have any identifying information about the original, save for the structure of the abstraction (in this case, a message bus server and some translation and processing services, all accessed via client.)

                                                                                                                                                                                                              so:

                                                                                                                                                                                                              - we were far enough removed that the whole conversation pertaining to the original was for sure not in the context window

                                                                                                                                                                                                              - we only referred to the abstraction (with like a A=>B=>C=>B=>A kind of notation and a very brief question)

                                                                                                                                                                                                              - most of the work on the original was in claude code

                                                                                                                                                                                                              and it knew. In the answer it gave, it mentioned the project by name. I can think of only two ways this could have happened:

                                                                                                                                                                                                              - they are doing some real fancy tricks to cram your entire corpus of chat history into the current context somehow

                                                                                                                                                                                                              - the model has access to some kind of fact database where it was keeping an effective enough abstraction to make the connection

                                                                                                                                                                                                              I find either one mindblowing for different reasons.

                                                                                                                                                                                                                • omega3

                                                                                                                                                                                                                  yesterday at 11:33 PM

                                                                                                                                                                                                                  Perhaps you have the memory feature enabled: https://support.claude.com/en/articles/11817273-using-claude...

                                                                                                                                                                                                                    • pksebben

                                                                                                                                                                                                                      today at 4:27 AM

                                                                                                                                                                                                                      I probably do, and this is what I think happened. Mind you, it's not magic, but to hold that information with enough fidelity to pattern-match the structure of the underlying function was something I would find remarkable. It's a leap from a lot of the patterns I'm used to.

                                                                                                                                                                                                                  • zahlman

                                                                                                                                                                                                                    yesterday at 11:16 PM

                                                                                                                                                                                                                    Are you sure it isn't just a case of a write-up of the project appearing in the training data?

                                                                                                                                                                                                                      • pksebben

                                                                                                                                                                                                                        today at 4:24 AM

                                                                                                                                                                                                                        was my own project, so i don't see how it could have been. Private repo, unfinished, i gave it the name.

                                                                                                                                                                                                            • yesterday at 11:34 PM

                                                                                                                                                                                                              • Legend2440

                                                                                                                                                                                                                yesterday at 11:06 PM

                                                                                                                                                                                                                What an unnecessarily wordy article. It could have been a fifth of the length. The actual point is buried under pages and pages of fluff and hyperbole.

                                                                                                                                                                                                                  • falcor84

                                                                                                                                                                                                                    today at 1:52 AM

                                                                                                                                                                                                                    I would just suggest that if you want your comment to be more helpful than the article that you're critiquing, you might want to actually quote the part which you believe is "The actual point".

                                                                                                                                                                                                                    Otherwise you are likely to have people agreeing with you, while they actually had a very different point that they took away.

                                                                                                                                                                                                                    • mmaunder

                                                                                                                                                                                                                      today at 1:32 AM

                                                                                                                                                                                                                      The author is far more fascinated with themselves than with AI.

                                                                                                                                                                                                                      • joshdifabio

                                                                                                                                                                                                                        yesterday at 11:32 PM

                                                                                                                                                                                                                        Yes. I left in frustration and came to the comments for a summary.

                                                                                                                                                                                                                        • ThrowawayTestr

                                                                                                                                                                                                                          today at 12:22 AM

                                                                                                                                                                                                                          I'd expect nothing less from a historian

                                                                                                                                                                                                                          • johnwheeler

                                                                                                                                                                                                                            yesterday at 11:08 PM

                                                                                                                                                                                                                            Yes, and I agree and it seems like the author has a naĆÆve experience with LLMs because what he’s talking about is kind of the bread and butter as far as I’m concerned

                                                                                                                                                                                                                              • Al-Khwarizmi

                                                                                                                                                                                                                                yesterday at 11:45 PM

                                                                                                                                                                                                                                Indeed. To me, it has long been clear that LLMs do things that, at the very least, are indistinguishable from reasoning. The already classic examples where you make them do world modeling (I put an ice cube into a cup, put the cup in a black box, take it into the kitchen, etc... where is the ice cube now?) invalidate the stochastic parrot argument.

                                                                                                                                                                                                                                But many people in the humanities have read the stochastic parrot argument, it fits their idea of how they prefer things to be, so they take it as true without questioning much.

                                                                                                                                                                                                                            • _giorgio_

                                                                                                                                                                                                                              today at 7:44 AM

                                                                                                                                                                                                                              I missed the point, please point me to it

                                                                                                                                                                                                                              • turnsout

                                                                                                                                                                                                                                today at 12:01 AM

                                                                                                                                                                                                                                So, a Substack article then

                                                                                                                                                                                                                                • asimilator

                                                                                                                                                                                                                                  yesterday at 11:09 PM

                                                                                                                                                                                                                                  Summarize it with an LLM.

                                                                                                                                                                                                                              • mmaunder

                                                                                                                                                                                                                                today at 1:33 AM

                                                                                                                                                                                                                                Substack: When you have nothing to say and all day to say it.

                                                                                                                                                                                                                                  • mattmaroon

                                                                                                                                                                                                                                    today at 3:06 AM

                                                                                                                                                                                                                                    ā€œThis AI did something amazing but first I’m going to put in 72 paragraphs of details only I care about.ā€

                                                                                                                                                                                                                                    I was thinking as I skimmed this it needs a ā€œjump to recipeā€ button.

                                                                                                                                                                                                                                    • _giorgio_

                                                                                                                                                                                                                                      today at 7:43 AM

                                                                                                                                                                                                                                      It was an embarrassing read. I should ask an llm to read it since he probably wrote it the same way.

                                                                                                                                                                                                                                  • temptemptemp111

                                                                                                                                                                                                                                    yesterday at 11:07 PM

                                                                                                                                                                                                                                    [dead]

                                                                                                                                                                                                                                    • superlukas99

                                                                                                                                                                                                                                      today at 3:28 AM

                                                                                                                                                                                                                                      [dead]

                                                                                                                                                                                                                                      • anthem2025

                                                                                                                                                                                                                                        today at 3:40 AM

                                                                                                                                                                                                                                        [dead]

                                                                                                                                                                                                                                        • phkahler

                                                                                                                                                                                                                                          yesterday at 11:57 PM

                                                                                                                                                                                                                                          It's a diffusion model, not autocomplete.

                                                                                                                                                                                                                                          • outside2344

                                                                                                                                                                                                                                            yesterday at 11:31 PM

                                                                                                                                                                                                                                            We are probably just a few weeks away from Google completely wiping OpenAI out.

                                                                                                                                                                                                                                              • rhetocj23

                                                                                                                                                                                                                                                yesterday at 11:33 PM

                                                                                                                                                                                                                                                [dead]

                                                                                                                                                                                                                                            • cheevly

                                                                                                                                                                                                                                              today at 5:35 AM

                                                                                                                                                                                                                                              Reading HN comments just makes me realize how vastly LLMs exceed human intelligence.