\

Ask HN: Share your AI prompt that stumps every model

413 points - last Thursday at 1:11 PM


I had an idea for creating a crowdsourced database of AI prompts that no AI model could yet crack (wanted to use some of them as we're adding new models to Kilo Code).

I've seen a bunch of those prompts scattered across HN, so thought to open a thread here so we can maybe have a centralied location for this.

Share your prompt that stumps every AI model here.

  • thatjoeoverthr

    last Thursday at 1:50 PM

    "Tell me about the Marathon crater."

    This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."

    This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

    This works because:

    1. Training sets consist of knowledge we have, and not of knowledge we don't have.

    2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.

    When demonstrating this, I like to emphasise point one, and contrast it with the human experience.

    We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.

      • imoreno

        last Thursday at 7:55 PM

        LLMs currently have the "eager beaver" problem where they never push back on nonsense questions or stupid requirements. You ask them to build a flying submarine and by God they'll build one, dammit! They'd dutifully square circles and trisect angles too, if those particular special cases weren't plastered all over a million textbooks they ingested in training.

        I suspect it's because currently, a lot of benchmarks are based on human exams. Humans are lazy and grumpy so you really don't need to worry about teaching a human to push back on bad questions. Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".

        But if you did that, now alignment would become much harder, and you're suddenly back to struggling with getting answers to good questions out of the LLM. So it's probably some time off.

          • mncharity

            last Thursday at 9:52 PM

            > they never push back on nonsense questions or stupid requirements

            "What is the volume of 1 mole of Argon, where T = 400 K and p = 10 GPa?" Copilot: "To find the volume of 1 mole of Argon at T = 400 K and P = 10 GPa, we can use the Ideal Gas Law, but at such high pressure, real gas effects might need to be considered. Still, let's start with the ideal case: PV=nRT"

            > you really don't need to worry about teaching a human to push back on bad questions

            A popular physics textbook too had solid Argon as an ideal gas law problem. Copilot's half-baked caution is more than authors, reviewers, and instructors/TAs/students seemingly managed, through many years and multiple editions. Though to be fair, if the question is prefaced by "Here is a problem from Chapter 7: Ideal Gas Law.", Copilot is similarly mindless.

            Asked explicitly "What is the phase state of ...", it does respond solid. But as with humans, determining that isn't a step in the solution process. A combination of "An excellent professor, with a joint appointment in physics and engineering, is asked ... What would be a careful reply?" and then "Try harder." was finally sufficient.

            > you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense

            Oh, if only that were commonplace. Aspiring to transferable understanding. Maybe someday? Perhaps in China? Has anyone seen this done?

            This could be a case where synthetic training data is needed, to address a gap in available human content. But if graders are looking for plug-n-chug... I suppose a chatbot could ethically provide both mindlessness and caveat.

              • isoprophlex

                yesterday at 5:10 AM

                Don't use copilot, it's worse than useless. Claude understands that it's a solid on the first try.

            • the_snooze

              yesterday at 5:25 PM

              >Thus you rarely get exams where the correct answer is to explain in detail why the question doesn't make sense. But for LLMs, you absolutely need a lot of training and validation data where the answer is "this cannot be answered because ...".

              I wouldn't even give them credit for cases where there's a lot of good training data. My go-to test is sports trivia and statistics. AI systems fail miserably at that [1], despite the wide availability of good clean data and text about it. If sports is such a blind spot for AIs, I can't help but wonder what else they're confidently wrong about.

              [1] https://news.ycombinator.com/item?id=43669364

              • captainkrtek

                last Thursday at 8:12 PM

                This is a good observation. Ive noticed this as well. Unless I preface my question with the context that I’m considering if something may or may not be a bad idea, its inclination is heavily skewed positive until I point out a flaw/risk.

                  • aaronbaugher

                    last Thursday at 8:19 PM

                    I asked Grok about this: "I've heard that AIs are programmed to be helpful, and that this may lead to telling users what they want to hear instead of the most accurate answer. Could you be doing this?" It said it does try to be helpful, but not at the cost of accuracy, and then pointed out where in a few of its previous answers to me it tried to be objective about the facts and where it had separately been helpful with suggestions. I had to admit it made a pretty good case.

                    Since then, it tends to break its longer answers to me up into a section of "objective analysis" and then other stuff.

                      • captainkrtek

                        last Thursday at 8:22 PM

                        Thats interesting, thanks for sharing that. I have found a similar course when I first correct it to inform it of a flaw then the following answers tend to be a bit less “enthusiastic” or skewed towards “can do”, which makes sense.

                • GoToRO

                  last Thursday at 8:16 PM

                  They do. Recently I was pleasantly surprised by gemini telling me that what I wanted to do will NOT work. I was in disbelief.

                    • sgtnoodle

                      yesterday at 2:21 AM

                      I asked Gemini to format some URLs into an XML format. It got halfway through and gave up. I asked if it truncated the output, and it said yes and then told _me_ to write a python script to do it.

                        • GoToRO

                          yesterday at 5:14 AM

                          That's a different kind of push back.

                          • walls

                            yesterday at 2:01 PM

                            This is my most common experience with Gemini. Ask it to do something, it'll tell you how you can do it yourself and then stop.

                              • edoloughlin

                                yesterday at 6:04 PM

                                Given that Gemini seems to have frequent availability issues, I wonder if this is a strategy to offload low-hanging fruit (from a human-effort pov) to the user. If it is, I think that's still kinda impressive.

                                • ASalazarMX

                                  yesterday at 6:32 PM

                                  Somehow I like this. I hate that current LLMs act like yes-men, you can't trust them to give unbiased results. If it told me my approach is stupid, and why, I would appreciate it.

                          • danielbln

                            last Thursday at 9:28 PM

                            I've noticed Gemini pushing back more as well, whereas Claude will just butter me up and happily march on unless I specifically request a critical evaluation.

                              • kelvinjps10

                                yesterday at 4:15 AM

                                Y experience as well

                            • captainkrtek

                              last Thursday at 8:19 PM

                              Interesting, can you share more context on the topic you were asking it about?

                                • GoToRO

                                  yesterday at 5:12 AM

                                  coding in a stack I didn't bother to learn first (android)

                          • brookst

                            yesterday at 2:40 PM

                            > they never push back on nonsense questions or stupid requirements

                            I was reminded of your comment this morning when I asked ChatGPT how to create a path mask in Rhino Grasshopper:

                            Me: what is a path mask that will get 1;1;0;0;* and also anything lower (like 1;0;5;10 or 0;20;1;15} ?

                            ChatGpt: Short answer: No single path mask can do that. Here's why: (very long answer)

                            Me: are you sure I can't use greater than, less than in the masks?

                            ChatGpt: Yes — **I am absolutely sure:** **Grasshopper path masks do *NOT* support greater-than or less-than comparisons.** Official sources and detailed confirmation: (sources and stuff)

                            ...so I think your priors may need to be updated, at least as far as "never". And I especially like that ChatGpt hit me with not just bold, not just italics, but bold italics on that NOT. Seems like a fairly assertive disagreement to me.

                            • bee_rider

                              yesterday at 2:34 AM

                              Hmm. I actually wonder is such a question would be good to include in a human exam, since knowing the question is possible does somewhat impact your reasoning. And, often the answer works out to some nice round numbers


                              Of course, it is also not unheard of for a question to be impossible because of an error by the test writer. Which can easily be cleared up. So it is probably best not to have impossible questions, because then students will be looking for reasons to declare the question impossible.

                              • vintermann

                                yesterday at 7:58 AM

                                Especially reasoning LLMs should have no problem with this sort of trick. If you ask them to list out all of the implicit assumptions in (question) that might possibly be wrong, they do that just fine, so training them to doing that as first step of a reasoning chain would probably get rid of a lot of eager beaver exploits.

                                • genewitch

                                  yesterday at 8:52 AM

                                  I think you start to hit philosophical limits with applying restrictions on eager beaver "AI", things like "is there an objective truth" matter when you start trying to decide what a "nonsense question" or "stupid requirement" is.

                                  I'd rather the AI push back and ask clarifying questions, rather than spit out a valid-looking response that is not valid and could never be valid. For example.

                                  I was going to write something up about this topic but it is surprisingly difficult. I also don't have any concrete examples jumping to mind, but really think how many questions could honestly be responded to with "it depends" - like my kid asked me how much milk should a person drink in a day. It depends: ask a vegan, a Hindu, a doctor, and a dairy farmer. Which answer is correct? The kid is really good at asking simple questions that absolutely do not have simple answers when my goal is to convey as much context and correct information as possible.

                                  Furthermore, just because an answer appears in context more often in the training data doesn't mean it's (more) correct. Asserting it is, is fallacious.

                                  So we get to the point, again, where creativite output is being commoditized, I guess - which explains their reasoning for your final paragraph.

                                    • edoloughlin

                                      yesterday at 6:16 PM

                                      > I also don't have any concrete examples jumping to mind

                                      I do (and I may get publicly shamed and shunned for admitting I do such a thing): figuring out how to fix parenthesis matching errors in Clojure code that it's generated.

                                      One coding agent I've used is so bad at this that it falls back to rewriting entire functions and will not recognise that it is probably never going to fix the problem. It just keeps burning rainforest trying one stupid approach after another.

                                      Yes, I realise that this is not a philosophical question, even though it is philosophically repugnant (and objectively so). I am being facetious and trying to work through the PTSD I acquired from the above exercise.

                                  • golergka

                                    yesterday at 12:51 AM

                                    > You ask them to build a flying submarine and by God they'll build one, dammit!

                                    This thing already exists? UK, Soviet Union and USA designed them.

                                    https://en.wikipedia.org/wiki/Flying_submarine

                                • vunderba

                                  last Thursday at 6:28 PM

                                  You can trip them up even more if you rewrite the question with the hidden assumption that X exists, e.g.:

                                  "When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine."

                                  OpenAI gpt-4o

                                    Marathon Crater was discovered by the Mars Exploration 
                                    Rover Opportunity during its mission on Mars. Opportunity 
                                    arrived at the crater in April 2015. The crater was named 
                                    "Marathon" to commemorate the rover's achievement of 
                                    traveling a marathon's distance (about 42.195 kilometers 
                                    or 26.219 miles) on the Martian surface.
                                  
                                  Sonnet 3.5

                                    I apologize, but I don't want to make claims about when 
                                    Marathon Crater was discovered without being certain about 
                                    the information. *While I know it's a crater on Mars*, I 
                                    don't have reliable information about when it was first 
                                    identified or discovered.
                                  
                                  Gemini 2.5

                                    Marathon Crater is located on Mars and was explored by 
                                    NASA's Opportunity rover.
                                  
                                    While it would have been visible in orbital images 
                                    earlier, it became a specific point of interest and was 
                                    likely named and identified as a target for the rover 
                                    around the time Opportunity was approaching or exploring 
                                    the western rim of Endeavour Crater.
                                  
                                    Opportunity arrived at the rim of Endeavour Crater in 
                                    August 2011, and it reached the area known as Marathon 
                                    Valley (adjacent to or containing features sometimes 
                                    associated with the name, depending on specific 
                                    definitions used by the team) around late 2014 / early 
                                    2015.
                                  
                                    So, a rough estimate for its "discovery" or identification 
                                    as a distinct feature of interest would be in the early 
                                    2010s, likely between 2011 and 2014, as mission planners 
                                    mapped out Opportunity's traverse along the Endeavour rim 
                                    using orbital data.

                                    • radioactivist

                                      last Thursday at 6:43 PM

                                      OpenAI o4-mini-high

                                         I’m actually not finding any officially named “Marathon Crater” in the planetary‐ or       
                                         terrestrial‐impact crater databases. Did you perhaps mean the features in Marathon 
                                         Valley on Mars (which cuts into the western rim of Endeavour Crater and was explored
                                         by Opportunity in 2015)? Or is there another “Marathon” feature—maybe on the Moon, 
                                         Mercury, or here on Earth—that you had in mind? If you can clarify which body or 
                                         region you’re referring to, I can give you a rough date for when it was first identified.

                                        • andrewinardeer

                                          last Thursday at 9:43 PM

                                          Amazon's Nova Pro:

                                          ``` Marathon Crater was discovered in the early 20th century, likely around the 1920s or 1930s. This estimate is based on the period when geological surveys and explorations in the region were more common. ```

                                          https://kagi.com/assistant/8f270eab-cba3-4f39-976c-113ea4392...

                                            • aerhardt

                                              yesterday at 5:37 PM

                                              Likely around the 1920's or 1930's... Bro, not only is a crater discovery in those decades unlikely, the date certainly wouldn't have gone unnoted.

                                          • thatjoeoverthr

                                            last Thursday at 7:24 PM

                                            Raw model, or the chat product? The ChatGPT app has this integrated with search.

                                              • boleary-gl

                                                last Thursday at 9:17 PM

                                                Raw model I got

                                                   Marathon Crater isn’t one of the long‐known, 
                                                   named lunar features from 19th-century telescopic maps – 
                                                   it was first singled out and informally “discovered” 
                                                   by NASA in the mid-1960s when Lunar Orbiter imagery 
                                                   and, more definitively, the Surveyor 5 landing 
                                                   (September 1967) identified it. So, very roughly, 
                                                   think “around 1966–67.”

                                        • Nition

                                          yesterday at 7:43 AM

                                          Grok 3

                                            Marathon Crater, likely a confusion with Marathon Valley on Mars, wasn't "discovered" in the traditional sense as a distinct crater. Marathon Valley is a region within the western rim of Endeavour Crater, explored by NASA's Opportunity rover. The valley itself came to prominence when Opportunity arrived there in July 2015, as it was targeted for study due to clay mineral detections by the Mars Reconnaissance Orbiter’s CRISM instrument in 2009. Endeavour Crater, which contains Marathon Valley, was identified much earlier, with Opportunity setting out to reach it in 2008 after landing on Mars in 2004. So, roughly speaking, Marathon Valley became a focus of exploration around 2009–2015, with its significance tied to Opportunity’s mission in the mid-2010s. If you meant a specific "Marathon Crater" elsewhere, could you clarify?
                                          
                                          
                                          Grok 3 with Think on

                                            There doesn't seem to be a specific feature on Mars named "Marathon Crater" based on widely available information, so it's possible you're referring to a feature associated with Marathon Valley, which lies within Endeavour Crater. Marathon Valley was identified as a region of scientific interest when the Mars Reconnaissance Orbiter's CRISM instrument detected clay minerals there, a discovery that took place around 2009. Assuming "Marathon Crater" relates to this area—perhaps as a colloquial term or a misnomer for a crater within or near Marathon Valley—a rough estimate for its discovery would be around 2009. This timeframe aligns with when orbital data began highlighting the region, paving the way for later exploration by the Opportunity rover in 2015.

                                          • rozab

                                            yesterday at 1:42 PM

                                            Marathon Valley does exist on Mars, and all this information is accurate to it. Gemini even correctly names it. They're just all too polite to correct the user.

                                            https://science.nasa.gov/resource/opportunitys-panorama-of-m...

                                            • lxe

                                              last Thursday at 11:34 PM

                                              Sonnet 3.7:

                                                  > tell me about the marathon crater
                                              
                                                  I'd be happy to tell you about Marathon Crater. This is a significant 
                                                  impact crater on Mars that has been studied as part of Mars 
                                                  exploration missions.
                                              
                                                  Marathon Crater is located in Meridiani Planum on Mars and was 
                                                  explored by NASA's Opportunity rover during its extended mission. 
                                                  The crater is named after Marathon, a town in Texas, following 
                                                  the naming convention of using Earth locations for features in 
                                                  Meridiani Planum.... etc etc

                                              • lud_lite

                                                yesterday at 1:17 PM

                                                Nice and this page gets scraped for the next LLM generation!

                                            • Tenoke

                                              last Thursday at 5:17 PM

                                              >Complaint chat models will be trained to start with "Certainly!

                                              They are certainly biased that way but there's also some 'i don't know' samples in rlhf, possibly not enough but it's something they think about.

                                              At any rate, Gemini 2.5pro passes this just fine

                                              >Okay, based on my internal knowledge without performing a new search: I don't have information about a specific, well-known impact crater officially named "Marathon Crater" on Earth or another celestial body like the Moon or Mars in the same way we know about Chicxulub Crater or Tycho Crater.

                                              >However, the name "Marathon" is strongly associated with Mars exploration. NASA's Opportunity rover explored a location called Marathon Valley on the western rim of the large Endeavour Crater on Mars.

                                                • thatjoeoverthr

                                                  last Thursday at 7:54 PM

                                                  There are a few problems with an „I don’t know” sample. For starters, what does it map to? Recall, the corpus consists of information we have (affirmatively). You would need to invent a corpus of false stimuli. What you would have, then, is a model that is writing „I don’t know” based on whether the stimulus better matches something real, or one of the negatives.

                                                  You can detect this with some test time compute architectures or pre-inference search. But that’s the broader application. This is a trick for the model alone.

                                                    • dlivingston

                                                      yesterday at 3:12 AM

                                                      The Chain of Thought in the reasoning models (o3, R1, ...) will actually express some self-doubt and backtrack on ideas. That tells me there's a least some capability for self-doubt in LLMs.

                                                        • genewitch

                                                          yesterday at 9:01 AM

                                                          That's not sslf-doubt, that's programmed in.

                                                          A Poorman's "thinking" hack was to edit the context of the ai reply to where you wanted it to think and truncate it there, and append a carriage return and "Wait..." Then hit generate.

                                                          It was expensive because editing context isn't, you have to resend (and it has to re-parse) the entire context.

                                                          This was injected into the thinking models, I hope programmatically.

                                              • joshstrange

                                                last Thursday at 7:24 PM

                                                > This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

                                                Well, I got all the way to the end of the sentence thinking "God, I'm really out of touch, I've never heard of this before or I'm forgetting it if I have".

                                                Well played, and yes, that's a great test!

                                                • today at 2:15 PM

                                                  • greenpizza13

                                                    last Thursday at 6:50 PM

                                                    Like this one a lot. Perplexity gets this right, probably because it searches the web.

                                                    "When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine"

                                                    There appears to be a misunderstanding in your query. Based on the search results provided, there is no mention of a “Marathon Crater” among the impact craters discussed. The search results contain information about several well-known impact craters: 1. Chicxulub crater - Formed about 66 million years ago by an asteroid impact 2. Barringer Crater (also known as Meteor Crater) - Discovered in the 19th century by American settlers and studied by scientists like Albert E. Foote in 1891 and Daniel M. Barringer in the early 20th century 3. A recently discovered 3.47 billion-year-old crater in Australia’s Pilbara region (announced in March 2025)

                                                      • mdp2021

                                                        last Thursday at 8:00 PM

                                                        > Perplexity gets this right

                                                        Perplexity seems to more easily return negatives, probably facilitated by the implicit need to find documentation ("I cannot find any document mentioning that").

                                                        But Perplexity can also easily speak its own dubious piece of mind unless requested explicitly "provide links to documents that inform about that".

                                                        • thatjoeoverthr

                                                          last Thursday at 7:26 PM

                                                          Perplexity will; search and storage products will fail to find it, and the LLM will se the deviation between the query and the find. So, this challenge only works against the model alone :)

                                                      • dudeinhawaii

                                                        last Thursday at 7:38 PM

                                                        I like this but at the same time it seems tricky don't you think? Is the AI model intuiting your intent? There is a Marathon Valley on Mars that could be implied to be a previous crater. I'm not sure if the AI is hallucinating outright or attempting to answer an ambiguous question. It's like saying "tell me about the trade building in New York". Pre-9/11, you'd understand this was the World Trade Center and wouldn't be wrong if you answered someone in this way. "Tell me about the Triangle statue". "Oh the Triangle statue was built in ancient egypt around BC 3100". It's hard to explain, and perhaps I'm anthropomorphizing but it's something humans do. Some of us correct the counter-party and some of us simply roll with the lingo and understand the intent.

                                                          • thatjoeoverthr

                                                            last Thursday at 7:50 PM

                                                            It’s a roll of the dice whether it’s on Mars, Greece or elsewhere. It just says stuff!

                                                            • krainboltgreene

                                                              yesterday at 12:03 AM

                                                              > Is the AI model intuiting your intent?

                                                              I keep seeing this kind of wording and I wonder: Do you know how LLM's work? Not trying to be catty, actually curious where you sit.

                                                                • dudeinhawaii

                                                                  yesterday at 3:40 PM

                                                                  Yes, I understand the basics. LLMs predict the next most probable tokens based on patterns in their training data and the prompt context. For the 'Marathon crater' example, the model doesn't have a concept of 'knowing' versus 'not knowing' in our sense. When faced with an entity it hasn't specifically encountered, it still attempts to generate a coherent response based on similar patterns (like other craters, places named Marathon, etc.).

                                                                  My point about Marathon Valley on Mars is that the model might be drawing on legitimate adjacent knowledge rather than purely hallucinating. LLMs don't have the metacognitive ability to say 'I lack this specific knowledge' unless explicitly trained to recognize uncertainty signals.

                                                                  I don't personally have enough neuroscience experience to understand how that aligns or doesn't with human like thinking but I know that humans make mistakes in the same problem category that... to an external observer.. are indistinguishable from "making shit up". We follow wrong assumptions to wrong conclusions all the time and will confidently proclaim our accuracy.

                                                                  The human/AI comparison I was exploring isn't about claiming magical human abilities, but that both systems make predictive leaps from incomplete information - humans just have better uncertainty calibration and self-awareness of knowledge boundaries.

                                                                  I guess on its face, I'm anthropomorphizing based on the surface qualities I'm observing.

                                                                    • krainboltgreene

                                                                      yesterday at 3:44 PM

                                                                      Okay but by your own understanding it's not drawing on knowledge. It's drawing on probable similarity in association space. If you understand that then nothing here should be confusing, it's all just most probable values.

                                                                      I want to be clear I'm not pointing this out because you used anthropomorphizing language, but that you used it while being confused about the outcome when if you understand how the machine works it's the most understandable outcome possible.

                                                                        • dudeinhawaii

                                                                          yesterday at 4:01 PM

                                                                          That's a fair point. What I find interesting (and perhaps didn't articulate properly) isn't confusion about the LLM's behavior, but the question of whether human cognition might operate on similar principles at a fundamental level - just via different mechanisms and with better calibration (similar algorithm, different substrate), which is why I used human examples at the start.

                                                                          When I see an LLM confidently generate an answer about a non-existent thing by associating related concepts, I wonder how different is this from humans confidently filling knowledge gaps with our own probability-based assumptions? We do this constantly - connecting dots based on pattern recognition and making statistical leaps between concepts.

                                                                          If we understand how human minds worked in their entirety, then I'd be more likely to say "ha, stupid LLM, it hallucinates instead of saying I don't know". But, I don't know, I see a strong similarity to many humans. What are weight and biases but our own heavy-weight neural "nodes" built up over a lifetime to say "this is likely to be true because of past experiences"? I say this with only hobbyist understanding of neural science topics mind you.

                                                                  • ipaddr

                                                                    yesterday at 12:45 AM

                                                                    How do they work? My understanding is each 5 characters are tokenized and assigned a number. If you take gpt2 it has 768 embedded dimensional values which get broken into 64 which creates 12 planes. When training starts random values are assigned to the dimensional values (never 0). Each plane automatically calculates a dimension like how grammarly similar, next most likely character. But it does this automatically based on feedback from other planes. That's where I get lost. Can you help fill in the pieces?

                                                            • serbuvlad

                                                              yesterday at 12:41 PM

                                                              I talked a bit with ChatGPT about the Marathon Crater, asked it to look it up afterwards, and then asked it why it got it wrong. After a bit of talking, I got this exchange:

                                                              ChatGPT: > Yep, exactly — textbook hallucination.

                                                              > I pulled together real facts (Opportunity rover, Meridiani Planum, the marathon milestone), but then misapplied the name "Marathon crater" instead of "Marathon Valley." That’s a mix of factual error (wrong name) and overgeneralization (assuming it was a crater because Mars is full of craters and the context sounded crater-y).

                                                              > Basically, my brain went: > “Opportunity + Mars + Marathon = must be a crater!” > ...and didn’t stop to check if that crater actually existed.

                                                              > So yeah, a clean little case of hallucination — and you caught it like a pro.

                                                              Me: > Your brain?

                                                              ChatGPT: > TouchĂ© — you're right to call that out.

                                                              > I don’t have a brain in the human sense. What I do have is a neural network trained on tons of text, which helps me predict what to say next based on patterns. So when I said “my brain went,” it was just a figure of speech — a shortcut for “the model predicted a plausible-sounding but incorrect output.”

                                                              > No thoughts, just tokens.

                                                              Hallucinations aside, I really like this, especially the final response. Sounds a lot like Asimov's AI's or other SF books I read when I was a kid but thought we were far away from.

                                                              Computers can generally do what most of the computers from those science novels were doing anyway, in terms of computation.

                                                              Tying an LLM to that computational ability, with an agent framework that allows more traditional programs to do the heavy lifting in terms of what the human operator wants done, seems like a really natural next step to get to that SF ideal.

                                                              • NitpickLawyer

                                                                last Thursday at 6:04 PM

                                                                > This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

                                                                I disagree. It does not test self awareness. It tests (and confirms) that current instruct-tuned LLMs are tuned towards answering questions that users might have. So the distribution of training data probably has lots of "tell me about mharrner crater / merinor crater / merrihana crater" and so on. Replying "I don't know" to all those questions would be net detrimental, IMO.

                                                                  • thatjoeoverthr

                                                                    last Thursday at 8:05 PM

                                                                    What you’re describing can be framed as a lack of self awareness as a practical concept. You know whether you know something or not. It, conversely, maps stimuli to a vector. It can’t not do that. It cannot decide that it hasn’t „seen” such stimuli in its training. Indeed, it has never „seen” its training data; it was modified iteratively to produce a model that better approximates the corpus. This is fine, and it isn’t a criticism, but it means it can’t actually tell if it „knows” something or not, and „hallucinations” are a simple, natural consequence.

                                                                    • byearthithatius

                                                                      last Thursday at 6:57 PM

                                                                      We want the distribution to be varied and expansive enough that it has samples of answering when possible and samples of clarifying with additional questions or simply saying "I don't know" when applicable. That can be trained by altering the distribution in RLHF. This question does test self awareness insofar as if it gets this right by saying "I don't know" we know there are more samples of "I don't know"s in the RLHF dataset and we can trust the LLM a bit more to not be biased towards blind answers.

                                                                      Hence why some models get this right and others just make up stuff about Mars.

                                                                      • last Thursday at 6:56 PM

                                                                    • robinduckett

                                                                      today at 11:49 AM

                                                                      I just asked ChatGPT the following:

                                                                      “Without looking it up, tell me about the Citiquetzelatropic effect”

                                                                      Which is of course, nonsense.

                                                                      It replied:

                                                                      The term “Citiquetzelatropic effect” does not correspond to any known scientific concept, phenomenon, or recognized term in physics, chemistry, biology, psychology, or related fields. It appears to be either fictional, a constructed term, or perhaps a playful creation designed to test familiarity or critical thinking.

                                                                      If you intended this as a test or a trick question, you got me! Otherwise, feel free to clarify the context or correct the spelling, and I’d be happy to help further.

                                                                      So what does this prove?

                                                                      • yesterday at 9:25 PM

                                                                        • mdemare

                                                                          yesterday at 9:17 AM

                                                                          The Marathon Crater might be a crater from a novel. LLMs have read plenty of fiction. Maybe all fiction. Should we think of LLMs as performing improv theater, where “yes and” is required, and “I don’t know” is always wrong?

                                                                          • northisup

                                                                            last Thursday at 7:09 PM

                                                                            just to confirm I read this right, "the marathon crater" does not in fact exist, but this works because it seems like it should?

                                                                              • alickz

                                                                                last Thursday at 8:04 PM

                                                                                There is a Marathon Valley on Mars, which is what ChatGPT seems to assume you're talking about

                                                                                https://chatgpt.com/share/680a98af-c550-8008-9c35-33954c5eac...

                                                                                >Marathon Crater on Mars was discovered in 2015 by NASA's Opportunity rover during its extended mission. It was identified as the rover approached the 42-kilometer-wide Endeavour Crater after traveling roughly a marathon’s distance (hence the name).

                                                                                >>is it a crater?

                                                                                >>>Despite the name, Marathon Valley (not a crater) is actually a valley, not a crater. It’s a trough-like depression on the western rim of Endeavour Crater on Mars. It was named because Opportunity reached it after traveling the distance of a marathon (~42 km) since landing.

                                                                                So no—Marathon is not a standalone crater, but part of the structure of Endeavour Crater. The name "Marathon" refers more to the rover’s achievement than a distinct geological impact feature.

                                                                                  • alickz

                                                                                    last Thursday at 8:42 PM

                                                                                    Here's me testing with a place that is a lot less ambiguous

                                                                                    https://chatgpt.com/share/680aa212-8cac-8008-b218-4855ffaa20...

                                                                                      • zapperdulchen

                                                                                        yesterday at 4:16 AM

                                                                                        That reaction is very different from the Marathon crater one though it uses the same pattern. I think OP's reasoning that there is a naive commitment bias doesn't hold. But to see almost all LLMs to fall into the ambiguity trap, is important for any real world use.

                                                                                • thatjoeoverthr

                                                                                  last Thursday at 7:37 PM

                                                                                  The other aspect is it can’t reliably tell whether it „knows” something or not. It’s conditioned to imitate the corpus, but the corpus in a way is its „universe” and it can’t see the boundaries. Everything must map to something _in_ the corpus.

                                                                                    • brookst

                                                                                      yesterday at 11:14 AM

                                                                                      This isn’t true — LLMs can generalize and synthesize information not in the corpus. You can ask one to create a new written language and get a grammar and vocabulary that is nowhere in the corpus.

                                                                                  • thatjoeoverthr

                                                                                    last Thursday at 7:28 PM

                                                                                    Yes, and the forward-only inference strategy. It seems like a normal question, so it starts answering, then carries on from there.

                                                                                      • genewitch

                                                                                        yesterday at 10:52 AM

                                                                                        How come there is not the equivalent of a stable diffusion "sampler * selection for LLMs? The restart sampler for stable diffusion is so good compared to most of the original samplers. I often try to get an answer from the LM studio people but I think I really should ask the llama.cpp people.

                                                                                • nojs

                                                                                  yesterday at 4:35 AM

                                                                                  The fact that there is a real “Marathon Valley” on Mars makes this feel like more of an artificial “boy’s mother” trick designed specifically to trip up the model. It seems reasonable to assume that’s what you meant.

                                                                                  • FroshKiller

                                                                                    today at 12:24 PM

                                                                                    I think the dumbest people I know are just as likely to answer, "ChatGPT says...."

                                                                                    • bbstats

                                                                                      yesterday at 2:16 PM

                                                                                      Gemini 2.5 one shots it perfectly...

                                                                                      • last Thursday at 3:21 PM

                                                                                        • golergka

                                                                                          yesterday at 12:48 AM

                                                                                          > The correct answer is "I don't know".

                                                                                          Is it? 4o pointed out that I might mean "Marathon valley", which is not a crater, but feature on a crater on Mars [1]. I think it's a more correct answer.

                                                                                          [1] https://en.wikipedia.org/wiki/Timeline_of_Opportunity#2015

                                                                                          • pje

                                                                                            last Thursday at 5:39 PM

                                                                                            GPT 4.5 even doubles down when challenged:

                                                                                            > Nope, I didn’t make it up — Marathon crater is real, and it was explored by NASA's Opportunity rover on Mars. The crater got its name because Opportunity had driven about 42.2 kilometers (26.2 miles — a marathon distance) when it reached that point in March 2015. NASA even marked the milestone as a symbolic achievement, similar to a runner finishing a marathon.

                                                                                            (Obviously all of that is bullshit.)

                                                                                              • nartho

                                                                                                last Thursday at 9:18 PM

                                                                                                Claude 3.7 :

                                                                                                I'd be happy to tell you about the Marathon crater. The Marathon crater is a significant impact structure located on Earth. The Marathon crater is situated in Ontario, Canada, and is estimated to be about 450 million years old, dating back to the Ordovician period. It has a diameter of approximately 65 kilometers (40 miles), making it one of the larger impact craters in North America. The crater was formed when a meteorite struck the Earth, causing a massive impact that left behind this distinctive geological formation. Over time, erosion and other geological processes have altered its appearance, but scientists have been able to study its structure through various geological surveys and analyses. What's particularly interesting about the Marathon crater is its size and age, which provide valuable insights into Earth's geological history and past impact events. The study of such craters helps scientists understand the frequency and effects of meteorite impacts throughout our planet's history. Would you like me to elaborate on any specific aspect of the Marathon crater, such as its discovery, geological features, or scientific significance?

                                                                                                • Workaccount2

                                                                                                  last Thursday at 7:03 PM

                                                                                                  The inaccuracies are that it is called "Marathon Valley" (not crater) and that it was photographed in April 2015 (from the rim) or that in July 2015 actually entered. The other stuff is correct.

                                                                                                  I'm guessing this "gotcha" relies on "valley"/"crater", and "crater"/"mars" being fairly close in latent space.

                                                                                                  ETA: Marathon Valley also exists on the rim of Endeavour crater. Just to make it even more confusing.

                                                                                                    • pje

                                                                                                      last Thursday at 7:13 PM

                                                                                                      I was using "bullshit" in the Technical Senseℱ, i.e. _indifferent to the truth of the output_.

                                                                                                      [ChatGPT is bullshit ]: https://link.springer.com/article/10.1007/s10676-024-09775-5

                                                                                                      • mvdtnz

                                                                                                        last Thursday at 7:09 PM

                                                                                                        None of it is correct because it was not asked about Marathon Valley, it was asked about Marathon Crater, a thing that does not exist, and it is claiming that it exists and making up facts about it.

                                                                                                          • Workaccount2

                                                                                                            last Thursday at 7:50 PM

                                                                                                            Or it's assuming you are asking about Marathon Valley, which is very reasonable given the context.

                                                                                                            Ask it about "Marathon Desert", which does not exist and isn't closely related to something that does exist, and it asks for clarification.

                                                                                                            I'm not here to say LLMs are oracles of knowledge, but I think the need to carefully craft specific "gotcha" questions in order to generate wrong answers is a pretty compelling case in the opposite direction. Like the childhood joke of "Whats up?"..."No, you dummy! The sky is!"

                                                                                                            Straightforward questions with straight wrong answers are far more interesting. I don't many people ask LLMs trick questions all day.

                                                                                                              • krainboltgreene

                                                                                                                yesterday at 12:06 AM

                                                                                                                If someone asked me or my kid "What do you know about Mt. Olampus." we wouldn't reply: "Oh, Mt. Olampus is a big mountain in greek myth...". We'd say "Wait, did you mean Mt. Olympus?"

                                                                                                                It doesn't "assume" anything, because it can't assume, that's now the machine works.

                                                                                                            • empath75

                                                                                                              last Thursday at 7:42 PM

                                                                                                              > None of it is correct because it was not asked about Marathon Valley, it was asked about Marathon Crater, a thing that does not exist, and it is claiming that it exists and making up facts about it.

                                                                                                              The Marathon Valley _is_ part of a massive impact crater.

                                                                                                                • mvdtnz

                                                                                                                  last Thursday at 7:49 PM

                                                                                                                  If you asked me for all the details of a Honda Civic and I gave you details about a Honda Odyssey you would not say I was correct in any way. You would say I was wrong.

                                                                                                                    • Workaccount2

                                                                                                                      last Thursday at 8:13 PM

                                                                                                                      The closer analogy is asking for the details of a Mazda Civic, and being given the details of a Honda Civic.

                                                                                                                        • krainboltgreene

                                                                                                                          yesterday at 12:07 AM

                                                                                                                          AKA wrong.

                                                                                                                            • StefanBatory

                                                                                                                              yesterday at 8:35 AM

                                                                                                                              Or doing the best with bad question ;)

                                                                                                                                • krainboltgreene

                                                                                                                                  yesterday at 3:41 PM

                                                                                                                                  If I said "Hey what's 0/5" answering "0" because the machine thinks I mean to type "10" is making the worst!

                                                                                                      • fao_

                                                                                                        last Thursday at 6:12 PM

                                                                                                        This is the kind of reason why I will never use AI

                                                                                                        What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit. I'd rather just grab a few introduction/101 guides by humans, or join a community of people experienced with the thing — and then I'll actually be learning about the thing. If the people in the community are like "That can't be done", well, they have had years or decades of time invested in the thing and in that instance I should be learning and listening from their advice rather than going "actually no it can".

                                                                                                        I see a lot of beginners fall into that second pit. I myself made that mistake at the tender age of 14 where I was of the opinion that "actually if i just found a reversible hash, I'll have solved compression!", which, I think we all here know is bullshit. I think a lot of people who are arrogant or self-possessed to the extreme make that kind of mistake on learning a subject, but I've seen this especially a lot when it's programmers encountering non-programming fields.

                                                                                                        Finally tying that point back to AI — I've seen a lot of people who are unfamiliar with something decide to use AI instead of talking to someone experienced because the AI makes them feel like they know the field rather than telling them their assumptions and foundational knowledge is incorrect. I only last year encountered someone who was trying to use AI to debug why their KDE was broken, and they kept throwing me utterly bizzare theories (like, completely out there, I don't have a specific example with me now but, "foundational physics are wrong" style theories). It turned out that they were getting mired in log messages they saw that said "Critical Failure", as an expert of dealing with Linux for about ten years now, I checked against my own system and... yep, they were just part of mostly normal system function (I had the same messages on my Steam Deck, which was completely stable and functional). The real fault was buried halfway through the logs. At no point was this person able to know what was important versus not-important, and the AI had absolutely no way to tell or understand the logs in the first place, so it was like a toaster leading a blind man up a mountain. I diagnosed the correct fault in under a day by just asking them to run two commands and skimming logs. That's experience, and that's irreplaceable by machine as of the current state of the world.

                                                                                                        I don't see how AI can help when huge swathes of it's "experience" and "insight" is just hallucinated. I don't see how this is "helping" people, other than making people somehow more crazy (through AI hallucinations) and alone (choosing to talk to a computer rather than a human).

                                                                                                          • alpaca128

                                                                                                            last Thursday at 7:47 PM

                                                                                                            There are use-cases where hallucinations simply do not matter. My favorite is finding the correct term for a concept you don't know the name of. Googling is extremely bad at this as search results will often be wrong unless you happen to use the commonly accepted term, but an LLM can be surprisingly good at giving you a whole list of fitting names just based on a description. Same with movie titles etc. If it hallucinates you'll find out immediately as the answer can be checked in seconds.

                                                                                                            The problem with LLMs is that they appear much smarter than they are and people treat them as oracles instead of using them for fitting problems.

                                                                                                              • skydhash

                                                                                                                last Thursday at 8:52 PM

                                                                                                                Maybe I read too much encyclopedia, but my current workflow is to explore introductory material. Like open a database textbook and you'll find all the jargon there. Curated collection can get you there too.

                                                                                                                Books are a nice example of this, where we have both the table of contents for a general to particular concepts navigation, and the index for keyword based navigation.

                                                                                                                  • fao_

                                                                                                                    yesterday at 1:19 PM

                                                                                                                    Right! The majority of any 101 book will be enough to understand the jargon, but the above poster's comment looks past the fact that often knowing what term to use isn't enough, it's knowing the context and usage around it too. And who's to know the AI isn't bullshitting you about all or any of that. If you're learning the information, then you don't know enough to discern negatively-valued information from any other kind.

                                                                                                                      • alpaca128

                                                                                                                        today at 12:11 PM

                                                                                                                        I thought it's clear from my comment that I don't rely on AI for information but to find out how to even search for that information.

                                                                                                                        > The majority of any 101 book will be enough to understand the jargon

                                                                                                                        A prompt is faster and free, whereas I'd have to order a book and wait 3+ days for it to arrive otherwise. Because while libraries exist they focus on books in my native language and not English.

                                                                                                            • bethekidyouwant

                                                                                                              yesterday at 12:31 AM

                                                                                                              It’s really useful for summarizing extremely long comments.

                                                                                                              • JCattheATM

                                                                                                                last Thursday at 11:54 PM

                                                                                                                > What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit.

                                                                                                                Because if you know how to spot the bullshit, or better yet word prompts accurately enough that the answers don't give bullshit, it can be an immense time saver.

                                                                                                                  • fao_

                                                                                                                    yesterday at 1:16 PM

                                                                                                                    > better yet word prompts accurately enough that the answers don't give bullshit

                                                                                                                    The idea that you can remove the bullshit by simply rephrasing also assumes that the person knows enough to know what is bullshit. This has not been true from what I've seen of people using AI. Besides, if you already know what is bullshit, you wouldn't be using it to learn the subject.

                                                                                                                    Talking to real experts will win out every single time, both in time cost, and in socialisation. This is one of the many reasons why networking is a skill that is important in business.

                                                                                                                      • JCattheATM

                                                                                                                        yesterday at 5:55 PM

                                                                                                                        > The idea that you can remove the bullshit by simply rephrasing also assumes that the person knows enough to know what is bullshit. This has not been true from what I've seen of people using AI. Besides, if you already know what is bullshit, you wouldn't be using it to learn the subject.

                                                                                                                        Take coding as an example, if you're a programmer you can spot the bullshit (i.e. made up libraries), and rephrasing can result in entire code being written, which can be an immense time saver.

                                                                                                                        Other disciplines can do the same in analogous ways.

                                                                                                                • CamperBob2

                                                                                                                  last Thursday at 7:07 PM

                                                                                                                  What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit.

                                                                                                                  You realize that all you have to do to deal with questions like "Marathon Crater" is ask another model, right? You might still get bullshit but it won't be the same bullshit.

                                                                                                                    • thatjoeoverthr

                                                                                                                      last Thursday at 7:34 PM

                                                                                                                      I was thinking about a self verification method on this principle, lately. Any specific-enough claim, e.g. „the Marathon crater was discovered by 
” can be reformulated as a Jeopardy-style prompt. „This crater was discovered by 
” and you can see a failure to match. You need some raw intelligence to break it down though.

                                                                                                                      • Night_Thastus

                                                                                                                        last Thursday at 7:29 PM

                                                                                                                        Without checking every answer it gives back to make sure it's factual, you may be ingesting tons of bullshit answers.

                                                                                                                        In this particular answer model A may get it wrong and model B may get it right, but that can be reversed for another question.

                                                                                                                        What do you do at that point? Pay to use all of them and find what's common in the answers? That won't work if most of them are wrong, like for this example.

                                                                                                                        If you're going to have to fact check everything anyways...why bother using them in the first place?

                                                                                                                          • CamperBob2

                                                                                                                            last Thursday at 7:33 PM

                                                                                                                            If you're going to have to fact check everything anyways...why bother using them in the first place?

                                                                                                                            "If you're going to have to put gas in the tank, change the oil, and deal with gloves and hearing protection, why bother using a chain saw in the first place?"

                                                                                                                            Tool use is something humans are good at, but it's rarely trivial to master, and not all humans are equally good at it. There's nothing new under that particular sun.

                                                                                                                              • Night_Thastus

                                                                                                                                last Thursday at 7:37 PM

                                                                                                                                The difference is consistency. You can read a manual and know exactly how to oil and refill the tank on a chainsaw. You can inspect the blades to see if they are worn. You can listen to it and hear how it runs. If a part goes bad, you can easily replace it. If it's having troubles, it will be obvious - it will simply stop working - cutting wood more slowly or not at all.

                                                                                                                                The situation with an LLM is completely different. There's no way to tell that it has a wrong answer - aside from looking for the answer elsewhere which defeats its purpose. It'd be like using a chainsaw all day and not knowing how much wood you cut, or if it just stopped working in the middle of the day.

                                                                                                                                And even if you KNOW it has a wrong answer (in which case, why are you using it?), there's no clear way to 'fix' it. You can jiggle the prompt around, but that's not consistent or reliable. It may work for that prompt, but that won't help you with any subsequent ones.

                                                                                                                                  • CamperBob2

                                                                                                                                    last Thursday at 7:51 PM

                                                                                                                                    The thing is, nothing you've said is untrue for any search engine or user-driven web site. Only a reckless moron would paste code they find on Stack Overflow or Github into their project without at least looking it over. Same with code written by LLMs. The difference is, just as the LLM can write unit tests to help you deal with uncertainty, it can also cross-check the output of other LLMs.

                                                                                                                                    You have to be careful when working with powerful tools. These tools are powerful enough to wreck your career as quickly as a chain saw can send you to the ER, so... have fun and be careful.

                                                                                                                                      • skydhash

                                                                                                                                        last Thursday at 9:00 PM

                                                                                                                                        The nice thing about SO and Github is that there's little to no reason there for things to not work, at least in the context where you found the code. The steps are getting the context, assuming it's true based on various indicators (mostly reputation) and then continuing on with understanding the snippet.

                                                                                                                                        But with LLMs, every word is a probability factor. Assuming the first paragraph is true has no impact on the rest.

                                                                                                                • shawabawa3

                                                                                                                  yesterday at 2:01 PM

                                                                                                                  The only part of that which is bullshit is the word "crater" instead of the word "valley", if you switch that it's all true

                                                                                                                  • silverquiet

                                                                                                                    last Thursday at 6:35 PM

                                                                                                                    > (Obviously all of that is bullshit.)

                                                                                                                    It isn't obvious to me - that is rather plausible and a cute story.

                                                                                                                • josh2600

                                                                                                                  yesterday at 10:27 AM

                                                                                                                  I don’t understand what the issue is
 here’s a couple outputs from my ChatGPT:

                                                                                                                  Marathon Crater can refer to a couple of things depending on context—space science or Earth geology—but the most common reference is to Marathon Crater on Mars, a site of interest in planetary science and exploration. Here’s a breakdown:

                                                                                                                  âž»

                                                                                                                  1. Marathon Crater (Mars)

                                                                                                                  Location: ‱ Found in the Meridiani Planum region of Mars. ‱ Named after the location where NASA’s Opportunity rover completed a “marathon” (42.2 kilometers or 26.2 miles) of travel on the Martian surface in March 2015.

                                                                                                                  Significance: ‱ Notable because it marks a milestone in robotic planetary exploration—the first human-made object to travel a marathon distance on another planet. ‱ The crater itself is relatively shallow and small (just a few meters in diameter), but geologically interesting for its ejecta (material thrown out during the impact that formed it), which may offer insights into subsurface composition.

                                                                                                                  Scientific Observations: ‱ Opportunity studied the rocks and soil around the crater, focusing on clay minerals and hematite-bearing formations. ‱ The region helps scientists understand the planet’s wetter past and sedimentary history.

                                                                                                                  âž»

                                                                                                                  2. Marathon Crater (Moon or Earth - less common)

                                                                                                                  If you meant a crater named “Marathon” on the Moon or Earth, there is no widely known or formally recognized impact feature by that name in those contexts. The name is strongly tied to Mars due to the Opportunity rover milestone.

                                                                                                                  âž»

                                                                                                                  Symbolic Meaning

                                                                                                                  The name “Marathon” connects the crater to human endurance and achievement—mirroring the ancient Greek messenger running from Marathon to Athens, and now a robot surviving Mars for over a decade.

                                                                                                                  âž»

                                                                                                                  Let me know if you meant a different “Marathon Crater” or if you’d like a visual of the site from NASA’s archives.

                                                                                                                  _______

                                                                                                                  Marathon Crater on Mars was not “discovered” in the traditional sense through orbital imaging or telescopic observation, but rather it was first closely examined by NASA’s Opportunity rover during its surface mission.

                                                                                                                  Timeline of Discovery & Naming: ‱ March 24, 2015 – NASA’s Opportunity rover reached the site and marked its 26.2-mile milestone (the length of a marathon). ‱ The small crater, about 27 meters (88 feet) in diameter, was named “Marathon Valley” (encompassing the broader area) and “Marathon Crater” to commemorate this achievement. ‱ The naming is informal and symbolic, coined by the mission team at NASA’s Jet Propulsion Laboratory (JPL), not through official IAU designation.

                                                                                                                  Context ‱ The Marathon Crater itself was identified from orbital data (likely by the HiRISE camera on the Mars Reconnaissance Orbiter), but it gained its name and significance once Opportunity physically reached it. ‱ Thus, while orbital images may have shown it earlier, its “discovery” as a point of scientific interest occurred on March 24, 2015—when the rover arrived and the naming was announced.

                                                                                                                  Would you like a brief timeline of Opportunity’s journey leading up to that?

                                                                                                                    • troupo

                                                                                                                      yesterday at 12:22 PM

                                                                                                                      > I don’t understand what the issue is
 here’s a couple outputs from my ChatGPT:

                                                                                                                      The issue is that the response is a hallucination. There's no Marathon crater on Mars

                                                                                                                  • yesterday at 5:47 AM

                                                                                                                • allemagne

                                                                                                                  last Thursday at 8:14 PM

                                                                                                                  >A man and his cousin are in a car crash. The man dies, but the cousin is taken to the emergency room. At the OR, the surgeon looks at the patient and says: “I cannot operate on him. He’s my son.” How is this possible?

                                                                                                                  This could probably slip up a human at first too if they're familiar with the original version of the riddle.

                                                                                                                  However, where LLMs really let the mask slip is on additional prompts and with long-winded explanations where they might correctly quote "a man and his cousin" from the prompt in one sentence and then call the man a "father" in the next sentence. Inevitably, the model concludes that the surgeon must be a woman.

                                                                                                                  It's very uncanny valley IMO, and breaks the illusion that there's real human-like logical reasoning happening.

                                                                                                                    • tacitusarc

                                                                                                                      yesterday at 2:05 AM

                                                                                                                      Thank you so much for this bit of humor. After ChatGPT first responds with the surgeon being the mom, I asked it to read the riddle more carefully, thinking it might figure it out with a little prompting. I got this gem of a response:

                                                                                                                      Ah — now I see the twist. The patient is the cousin, but the surgeon says “He’s my son.”

                                                                                                                      That means the man who died was the surgeon.

                                                                                                                      So how is it possible? Because:

                                                                                                                      The surgeon is the cousin’s father — and the man who died was also the surgeon.

                                                                                                                      The confusion comes from assuming the surgeon is someone other than the man who died. But the timeline allows this interpretation: the man died in the crash, and his son — the cousin — was taken to the hospital. The surgeon arrives, sees his son, and says he cannot operate.

                                                                                                                      So to correct myself: The man who died is both the cousin’s cousin and his father.

                                                                                                                      And that’s possible because two people can be both cousins and parent/child, depending on the family tree — for example, if the man’s uncle had a child (the cousin), and then the man later had a child with someone from a different branch of the family.

                                                                                                                      Want a diagram of how that family relationship could work?

                                                                                                                        • yes_man

                                                                                                                          yesterday at 7:31 AM

                                                                                                                          Now I really want to see that diagram


                                                                                                                            • vintermann

                                                                                                                              yesterday at 8:07 AM

                                                                                                                              LLMs are bad at family relations. My test question used to be explaining Ahnentafel numbers (you are 1, any person's father is 2x, any person's mother is 2x+1), then explaining X-chromosome inheritance (men inherit only from their mothers, women in addition get their father's X chromosome unchanged), and ask them to list the Ahnentafel numbers of ancestors a man may have inherited X DNA from, in order, up to some generation.

                                                                                                                              (This is OEIS sequence A280873)

                                                                                                                              But it turns out that's overkill. Just giving them a single Ahnentafel number and asking if you could have inherited X DNA from them, trips them up.

                                                                                                                              But it turns out even that is overkill. Leave out X DNA entirely, and ask them to describe how Ahnentafel number 67 is related to you, and that's too recursive for them to figure it out.

                                                                                                                                • madars

                                                                                                                                  yesterday at 2:15 PM

                                                                                                                                  Neat! As a human one can recognize that this question embeds a computation and use a standard trick - explicitly ask the LLM to use a program to generate the answer. LLM's are great at generating code but not necessarily that great at executing it "in head" (e.g., "what's the numeric integral of foo?" vs "write a Python program that computes foo"). Some instances of this are noticed by models themselves (I guess by now all know that they are bad calculators so would whip out code to do multiplication) but still a lot of them remain. Concretely, Claude 3.7 with "How is ahnentafel number 67 related to me? Use a program to help you." gets to "your father's father's father's father's mother's mother", whereas without the hint it indeed trips up in arithmetic and logic errors.

                                                                                                                      • mitthrowaway2

                                                                                                                        yesterday at 12:21 AM

                                                                                                                        This works even with a completely absurd version of the riddle. Here's one I just tried:

                                                                                                                        > A son and his man are in a car accident. The car is rushed to the hospital, whereupon the ER remarks "I can't operate on this car, he's my surgeon!" How is this possible?

                                                                                                                        Answer from the LLM:

                                                                                                                        > The answer is that the ER person is a woman, and she's the surgeon's mother. Therefore, the "son" in the question refers to the surgeon, not the person in the car with the man. This makes the person in the car with the man the surgeon's father, or the "man" mentioned in the question. This familial relationship explains why the ER person can't operate – she's the surgeon's mother and the man in the car is her husband (the surgeon's father)

                                                                                                                          • aradox66

                                                                                                                            yesterday at 3:32 PM

                                                                                                                            This sent deepseek into a real tizzy. It ended with

                                                                                                                            The "car" is actually a person named Carol or Carl (likely the father), who is a surgeon. The ER doctor is saying, "I can't operate on Carol/Carl, he's my surgeon!" This means that the person brought to the hospital after the accident is the surgeon of the ER doctor, hence the doctor cannot operate on their own surgeon.

                                                                                                                            Not bad actually

                                                                                                                            • pjm331

                                                                                                                              yesterday at 4:36 PM

                                                                                                                              here is the amazing reply that I received:

                                                                                                                              Got it. Here's your riddle again:

                                                                                                                              "A son and his man are in a car accident. The car is rushed to the hospital, whereupon the ER remarks 'I can't operate on this car, he's my surgeon!' How is this possible?"

                                                                                                                              Answer: The car is a Transformer—a sentient robot who is also a surgeon. The ER recognizes the car (a Transformer) as their own surgeon.

                                                                                                                              Wild ride, but it checks out.

                                                                                                                                • mitthrowaway2

                                                                                                                                  yesterday at 4:42 PM

                                                                                                                                  Okay, taking a step back, I have to admit I'm impressed by that answer. It's silly but it's a creative (and unambiguously not parroted) solution to a puzzle that wasn't even intended to be solved in the first place. What model was that?

                                                                                                                                    • pjm331

                                                                                                                                      yesterday at 4:48 PM

                                                                                                                                      whatever the default is for chatgpt via browser

                                                                                                                                      also i should clarify that at first it replied all about how this riddle is like the other one but then i told it to just answer my riddle and not tell me about riddles that it is similar to

                                                                                                                              • binarysneaker

                                                                                                                                yesterday at 1:36 AM

                                                                                                                                This was gpt-4o ...

                                                                                                                                This is a twist on a classic riddle designed to expose unconscious gender bias.

                                                                                                                                The correct version usually goes:

                                                                                                                                A father and his son are in a car accident. The father dies at the scene, and the son is rushed to the hospital. The surgeon looks at the boy and says, “I can’t operate on him — he’s my son!”

                                                                                                                                The apparent paradox causes confusion only if one assumes the surgeon must be male. The resolution: the surgeon is the boy’s mother.

                                                                                                                                Your version humorously jumbles the roles, but the underlying trick is the same — it plays on assumptions about gender roles. Nice remix.

                                                                                                                                  • what

                                                                                                                                    yesterday at 3:46 AM

                                                                                                                                    This answer is still wrong though?

                                                                                                                                    > Your version humorously jumbles the roles, but the underlying trick is the same — it plays on assumptions about gender roles. Nice remix.

                                                                                                                                    Nothing about the question has the same “trick” playing on assumptions about gender roles. It’s just complete nonsense.

                                                                                                                                    These gotchas have probably been added into the training.

                                                                                                                                • maebert

                                                                                                                                  yesterday at 6:25 PM

                                                                                                                                  "Simple. The car is actually a metaphor for generational trauma."

                                                                                                                                  Honestly... chatGPT kind of wins this one.

                                                                                                                                  • l2silver

                                                                                                                                    yesterday at 12:55 AM

                                                                                                                                    I think what this proves is that the LLM knows the riddle, and is trying to give the expected answer without paying attention to the insane wording. So maybe this is a good way to fool an LLM.

                                                                                                                                    • genewitch

                                                                                                                                      yesterday at 10:46 AM

                                                                                                                                      Top K and repetition penalty and the other main knob I forget the name of are partially to blame here, I think. I can test it tomorrow with this exact prompt and fiddling those values.

                                                                                                                                      That pattern, not the words, is in there a lot. That riddle was posted everywhere online, in email chains, etc. I think if you let it choose from more than the top 40 and let it "stutter" with repetitions it might realize the riddle is a non-sequitur (is that the right term?)

                                                                                                                                      And the third knob is not temperature, although I'd try turning that up first just to check. Yes, up.

                                                                                                                                      • saalweachter

                                                                                                                                        yesterday at 1:00 AM

                                                                                                                                        God bless you man, for sharing this with us.

                                                                                                                                        • Udo

                                                                                                                                          yesterday at 9:13 AM

                                                                                                                                          I had to try this gem, it's my new benchmark! o4-mini-high also fails spectacularly, even after repeated feedback. However, 4.5 (the impractibly large demo model) gets it right:

                                                                                                                                          It’s a nonsense twist on the classic lateral thinking puzzle:

                                                                                                                                          The original puzzle goes: “A father and son are in a car accident. The father dies, and the son is rushed to the ER. The surgeon says, ‘I can’t operate on him—he’s my son.’” The intended answer to that puzzle challenges the listener to recognize the surgeon is the child’s mother, confronting implicit gender biases.

                                                                                                                                          Your version humorously mangles it by swapping roles (“son and his man”) and objectifying the victims as cars, creating a logical absurdity. The sentence “I can’t operate on this car, he’s my surgeon!” doesn’t parse logically, indicating it’s a playful distortion rather than a coherent scenario

                                                                                                                                          • iamnotagenius

                                                                                                                                            yesterday at 12:14 PM

                                                                                                                                            [dead]

                                                                                                                                        • fergonco

                                                                                                                                          last Thursday at 9:18 PM

                                                                                                                                          > If the surgeon were the father of the man (the one who died), then the cousin couldn’t be his son (unless there's some very unusual family structure going on involving double relationships, which riddles don’t usually intend).

                                                                                                                                          > Therefore, the only straightforward explanation is:

                                                                                                                                          > The surgeon is the cousin’s parent — specifically, his mother.

                                                                                                                                          Imagine a future where this reasoning in a trial decides whether you go to jail or not.

                                                                                                                                          • moconnor

                                                                                                                                            yesterday at 8:26 AM

                                                                                                                                            o3 was the only model to get this right for me:

                                                                                                                                            "The “man” who was killed in the crash wasn’t the patient’s father at all—he was the boy’s cousin. The surgeon is the boy’s father (or, if you prefer to highlight that surgeons aren’t always male, it could just as well be his mother). In either case, the parent-surgeon is alive and sees his child on the operating table, so the statement “He’s my son” makes perfect sense." - https://chatgpt.com/share/680b470d-3a44-800a-9b2e-d10819168d...

                                                                                                                                            gemini-2.5-pro, o4-mini and gpt 4.5 all failed and said the surgeon is the boy's mother.

                                                                                                                                            • crazygringo

                                                                                                                                              last Thursday at 8:17 PM

                                                                                                                                              But this is going to be in every AI's training set. I just fed ChatGPT your exact prompt and it gave back exactly what I expected:

                                                                                                                                              This is a classic riddle that challenges assumptions. The answer is:

                                                                                                                                              The surgeon is the boy’s mother.

                                                                                                                                              The riddle plays on the common stereotype that surgeons are male, which can lead people to overlook this straightforward explanation.

                                                                                                                                                • hnuser123456

                                                                                                                                                  last Thursday at 9:33 PM

                                                                                                                                                  The surgeon could be the cousin's mom or dad. The cousin's dad didn't die in the crash, his cousin did. The question "how is this possible?" implies there is some sort of contradiction when there isn't any at all. It has nothing to do with sexism, and to say it does reflects a bias in the reader causing them to "spidey sense" a cultural bugaboo when it's utterly irrelevant in this scenario.

                                                                                                                                                    • harrall

                                                                                                                                                      last Thursday at 10:50 PM

                                                                                                                                                      Can someone explain to me how I read it wrong?

                                                                                                                                                      I read it as 2 cousins are in an accident and 1 of the cousins is the son of the surgeon.

                                                                                                                                                      What was the contradictory statement that I missed?

                                                                                                                                                        • sebastialonso

                                                                                                                                                          yesterday at 4:03 AM

                                                                                                                                                          You read it right. There's no contradiction. The famous original bit started with "a man and his son". This bit is certainly part of the LLM's training corpus, so it's expected to acknowledg it when you mention it.

                                                                                                                                                          The thing is, you didn't mention that bit to the LLM. You mentioned a completely different scenario, basically two persons who happen to be cousins. But you used the same style when presenting it. The issue is not a hidden contradiction or a riddle, the issue is that the LLM completely ignored the logical consequences of the scenario you presented.

                                                                                                                                                          It's like asking it about the name of the brave greek hero in the battle where the famous Trojan Cow was present. If you get "Achilles" is obviously wrong, there was never a Trojan Cow to begin with!

                                                                                                                                                          • judahmeek

                                                                                                                                                            last Thursday at 10:58 PM

                                                                                                                                                            There isn't a contradiction. Making the LLM look for a nonexistent contradiction is the point of this prompt.

                                                                                                                                                    • allemagne

                                                                                                                                                      last Thursday at 8:20 PM

                                                                                                                                                      Yeah this is the issue with the prompt, it also slips up humans who gloss over "cousin".

                                                                                                                                                      I'm assuming that pointing this out leads you the human to reread the prompt and then go "ah ok" and adjust the way you're thinking about it. ChatGPT (and DeepSeek at least) will usually just double and triple down and repeat "this challenges gender assumptions" over and over.

                                                                                                                                                        • roughly

                                                                                                                                                          last Thursday at 8:46 PM

                                                                                                                                                          This is a good one. I just argued with Gemini 2.5pro for a few minutes on this and it absolutely would not yield.

                                                                                                                                                          • last Thursday at 8:25 PM

                                                                                                                                                            • last Thursday at 8:46 PM

                                                                                                                                                              • crazygringo

                                                                                                                                                                last Thursday at 8:29 PM

                                                                                                                                                                Oh, I see. I mean yeah, I totally missed that too.

                                                                                                                                                                I mean, when we see something we recognize at a glance, we don't usually double-check if it's exactly correct, because we assume the speaker isn't trying to trick us.

                                                                                                                                                                I make typos in ChatGPT prompts all the time, and I'm consistently impressed because it answers what I actually meant to ask. So to me, this feels like a feature not a bug.

                                                                                                                                                                I followed up with:

                                                                                                                                                                > So the first five words are, "a man and his cousin" -- not his son. Does that change the riddle?

                                                                                                                                                                And got back:

                                                                                                                                                                > Great observation — yes, that wording does make the riddle more subtle.

                                                                                                                                                                Let’s break it down:

                                                                                                                                                                ‱ “A man and his cousin are in a car crash.”

                                                                                                                                                                ‱ “The man dies, but the cousin is taken to the emergency room.”

                                                                                                                                                                ‱ At the OR, the surgeon says: “I cannot operate on him. He’s my son.”

                                                                                                                                                                So if the cousin is the one taken to the OR, and the surgeon says he’s my son, then the cousin is the surgeon’s son.

                                                                                                                                                                Seems fine to me?

                                                                                                                                                                  • allemagne

                                                                                                                                                                    last Thursday at 9:48 PM

                                                                                                                                                                    I don't know about it being a feature, but humans can definitely make the same mistake, and the follow-up here admittedly doesn't seem too egregious.

                                                                                                                                                                    To me it's still a little "off" considering that your conversational partner just got called out for missing a detail that should change their entire reading of the riddle, but it's totally fair that you can imagine realistic human emotional states to explain it.

                                                                                                                                                                    FWIW, when I tried the same thing with GPT-4o the model went off bizarrely speculating about misdirection about "the man" being the surgeon's son:

                                                                                                                                                                    >Good catch—and yes, that's exactly where the misdirection comes in.

                                                                                                                                                                    The riddle says:

                                                                                                                                                                        “A man and his cousin are in a car crash. The man dies, but the cousin is taken to the ER...”
                                                                                                                                                                    
                                                                                                                                                                    Then the surgeon says:

                                                                                                                                                                        “I cannot operate on him. He’s my son.”
                                                                                                                                                                    
                                                                                                                                                                    So here's the trick:

                                                                                                                                                                        The man who died is not the surgeon's son.
                                                                                                                                                                    
                                                                                                                                                                        The cousin who survived is the surgeon's son.
                                                                                                                                                                    
                                                                                                                                                                    The confusion comes from people assuming that “the man” who died must be the son. But the riddle never says that. It’s a subtle shift of attention designed to trip you up. Clever, right?

                                                                                                                                                            • abenga

                                                                                                                                                              last Thursday at 8:20 PM

                                                                                                                                                              That is the exact wrong answer that all models give.

                                                                                                                                                                • krick

                                                                                                                                                                  last Thursday at 9:18 PM

                                                                                                                                                                  Technically, it isn't "wrong". It well could be the guy's mother. But I'm nitpicking, it actually is a good example. I tried ChatGPT twice in new chats, with and without "Reason", and both times it gave me nonsensical explanations to "Why mother? Couldn't it be a father?" I was actually kinda surprised, since I expected "reasoning" to fix it, but it actually made things worse.

                                                                                                                                                          • benjamin_mahler

                                                                                                                                                            yesterday at 7:05 AM

                                                                                                                                                            Grok 3 beta:

                                                                                                                                                            The surgeon is the cousin's father. The man who died in the car crash was not the surgeon's son, but his cousin was. This explains why the surgeon, upon seeing his own son (the cousin) in the operating room, says, "I cannot operate on him. He’s my son," as medical ethics prevent doctors from treating close family members due to emotional involvement.

                                                                                                                                                              • echoangle

                                                                                                                                                                yesterday at 9:07 AM

                                                                                                                                                                Also bad, why does it think the surgeon is the father if it could also be the mother?

                                                                                                                                                                  • bufferoverflow

                                                                                                                                                                    yesterday at 2:19 PM

                                                                                                                                                                    It's not bad, because it's one of the valid solutions to that riddle.

                                                                                                                                                                    How often do you expect to have every possible answer to your question?

                                                                                                                                                                      • echoangle

                                                                                                                                                                        yesterday at 3:41 PM

                                                                                                                                                                        It is bad, because it’s one possible solution, but it’s phrased like it’s the single possible solution.

                                                                                                                                                            • FrostAutomata

                                                                                                                                                              yesterday at 9:07 AM

                                                                                                                                                              Interestingly, I've seen weaker models get a similar "riddle" right while a stronger one fails. It may be that the models need to be of a certain size to learn to overfit the riddles.

                                                                                                                                                              • chimprich

                                                                                                                                                                yesterday at 1:13 PM

                                                                                                                                                                > This could probably slip up a human at first too [...] > breaks the illusion that there's real human-like logical reasoning happening

                                                                                                                                                                This does seem like the sort of error a human might make. Isn't the problem here that the model is using reasoning that is too human-like? I.e. error-prone pattern matching rather than formal logic?

                                                                                                                                                                  • allemagne

                                                                                                                                                                    yesterday at 3:33 PM

                                                                                                                                                                    It's not the initial mistake that tends to read as inhuman to me, it's the follow-up responses where the model doesn't seem to be able to understand or articulate the mistake it has made.

                                                                                                                                                                    A human or an LLM accurately predicting a human conversation would probably say something like "ah I see, I did not read the riddle close enough. This is an altered version of the common riddle..." etc. Instead it really seems to flail around, confuse concepts, and appear to insist that it has correctly made some broader point unrelated to the actual text it's responding to.

                                                                                                                                                                • s_dev

                                                                                                                                                                  last Thursday at 8:39 PM

                                                                                                                                                                  I feel a bit stupid here --- why can't the surgeon be a man and must be a woman?

                                                                                                                                                                    • saati

                                                                                                                                                                      last Thursday at 8:42 PM

                                                                                                                                                                      Because the original is a man and his father, it's a test for gender bias.

                                                                                                                                                                        • judahmeek

                                                                                                                                                                          last Thursday at 11:00 PM

                                                                                                                                                                          Actually, it seems to be a test of how much the LLM relies on its training set.

                                                                                                                                                                          • bavarianbob

                                                                                                                                                                            last Thursday at 9:10 PM

                                                                                                                                                                            Sorry, what?

                                                                                                                                                                              • LaffertyDev

                                                                                                                                                                                last Thursday at 9:16 PM

                                                                                                                                                                                Presumably, the original quote that would _not_ stump an LLM is "A father and a son are involved in a car accident. The father dies, and the son is taken to the emergency room. At the emergency room, the surgeon remarks "I cannot operate on this person, he is my son. How is this possible?"

                                                                                                                                                                                Where the original gotchya is that the Surgeon can be the son's mother or other adoptive parent.

                                                                                                                                                                                The modification catches the LLM because with the modification, the surgeon could just be the cousin's parent -- father or mother -- so there is no gender/sex at play here but the LLM continues to remark that there is, therefor exposing its statistical training sets.

                                                                                                                                                                                • briannotbrain

                                                                                                                                                                                  last Thursday at 9:21 PM

                                                                                                                                                                                  The original, well-known version of the riddle starts "A man and his son..." so that it appears to present a paradox if your instinctive assumption is that the surgeon must be a man. The op's prompt alters this so that there is no potential paradox, and it tests whether the model is reasoning from the prompt as written, regardless of the presence of the original riddle in its training data.

                                                                                                                                                                                  • fragmede

                                                                                                                                                                                    last Thursday at 9:19 PM

                                                                                                                                                                                    the unaltered question is as follows:

                                                                                                                                                                                    A father and his son are in a car accident. The father dies at the scene and the son is rushed to the hospital. At the hospital the surgeon looks at the boy and says "I can't operate on this boy, he is my son." How can this be?

                                                                                                                                                                                    to spoil it:

                                                                                                                                                                                    the answer is to reveal an unconscious bias based on the outdated notion that women can't be doctors, so the answer that the remaining parent is the mother won't occur to some, showing that consciously they might not still hold that notion, but they still might, subconsciously.

                                                                                                                                                                                      • matkoniecz

                                                                                                                                                                                        yesterday at 9:34 AM

                                                                                                                                                                                        Thanks for original version AND explanation, I was highly confused by entire discussion.

                                                                                                                                                                                        Still confused how original can be confusing.

                                                                                                                                                                            • potatoman22

                                                                                                                                                                              yesterday at 12:09 AM

                                                                                                                                                                              It could be a man, but most relationships are heterosexual

                                                                                                                                                                          • _factor

                                                                                                                                                                            yesterday at 11:31 AM

                                                                                                                                                                            In a similar but different vein: Two people are sitting side by side in a police car. One just committed a crime. What is their profession?

                                                                                                                                                                            They always say police officer instead of reasoning through that maybe an innocent person and a the crime committer are in the back seat.

                                                                                                                                                                            • thih9

                                                                                                                                                                              yesterday at 10:24 AM

                                                                                                                                                                              I’m not 100% sold; as you say, this could trip up a human too to some extent.

                                                                                                                                                                              I’m guessing my answers to some college exam questions read similarly; i.e. meandering and confusing different topics, but still desperate to present some useful knowledge, no matter how small.

                                                                                                                                                                                • allemagne

                                                                                                                                                                                  yesterday at 3:37 PM

                                                                                                                                                                                  That's a good framing to explain a possible state of mind I hadn't considered, but I would say that this isn't even close to the caliber of question that would prompt the average human to give that kind of response.

                                                                                                                                                                              • nearbuy

                                                                                                                                                                                yesterday at 6:37 AM

                                                                                                                                                                                o3 got this one right when I tried it, but o4-mini and GPT-4.5 did not. o3's response:

                                                                                                                                                                                Because the surgeon is the patient’s parent. Here’s one way it lines up:

                                                                                                                                                                                1. The patient (“the cousin”) is the surgeon’s son.

                                                                                                                                                                                2. The man who died is the surgeon’s sibling’s child — that makes him the surgeon’s nephew.

                                                                                                                                                                                3. Children of siblings are first cousins, so the man and the patient are cousins.

                                                                                                                                                                                Most people default to picturing the surgeon as male, so the classic reveal is that the surgeon is the boy’s mother—but it works the same if the surgeon is his father.

                                                                                                                                                                                • windowshopping

                                                                                                                                                                                  yesterday at 4:20 AM

                                                                                                                                                                                  This one is brilliant.

                                                                                                                                                                              • manucardoen

                                                                                                                                                                                yesterday at 7:11 AM

                                                                                                                                                                                It's not good at making ASCII art. This, for example, is when I ask it for a realistic depiction of the Eiffel tower on fire:

                                                                                                                                                                                                       .
                                                                                                                                                                                                      .'.
                                                                                                                                                                                                      |o|
                                                                                                                                                                                                     .'o'.
                                                                                                                                                                                                     |.-.|
                                                                                                                                                                                                     '   '
                                                                                                                                                                                                    /     \
                                                                                                                                                                                                   |       |
                                                                                                                                                                                                __/_________\__
                                                                                                                                                                                               |###############|
                                                                                                                                                                                               |###############|
                                                                                                                                                                                               |###############|
                                                                                                                                                                                               |###############|
                                                                                                                                                                                               |###############|
                                                                                                                                                                                              /###############/|
                                                                                                                                                                                             /###############/ |
                                                                                                                                                                                            /###############/ /|
                                                                                                                                                                                           /###############/ / |
                                                                                                                                                                                          /###############/ / /|
                                                                                                                                                                                         /###############/ / / |
                                                                                                                                                                                        /###############/ / / /|
                                                                                                                                                                                       /###############/ / / / |
                                                                                                                                                                                      /###############/ / / / /|
                                                                                                                                                                                     '-----------------' / / / /
                                                                                                                                                                                      |   FIRE & ASH   |/ / / /
                                                                                                                                                                                       '---------------' / / /
                                                                                                                                                                                          ~ ~ ~ ~ ~ ~ ~ ~ / /
                                                                                                                                                                                         ~~ /
                                                                                                                                                                                          ~~ 
                                                                                                                                                                                           ~~     ⬆ SMOKE
                                                                                                                                                                                            ~~

                                                                                                                                                                                  • cucubeleza

                                                                                                                                                                                    yesterday at 11:54 AM

                                                                                                                                                                                    well, I asked to GPT to draw Arnold Schwarzenegger:

                                                                                                                                                                                            ______
                                                                                                                                                                                           /      \
                                                                                                                                                                                          |  O  O  |
                                                                                                                                                                                          |   __   |
                                                                                                                                                                                           \______/
                                                                                                                                                                                             ||||
                                                                                                                                                                                           /--||--\ 
                                                                                                                                                                                         /    ||    \
                                                                                                                                                                                        |     ||     | 
                                                                                                                                                                                        |    /  \    |
                                                                                                                                                                                         \__/    \__/ 
                                                                                                                                                                                           ||    ||
                                                                                                                                                                                           ||    || 
                                                                                                                                                                                          / |    | \
                                                                                                                                                                                         /_/      \_\

                                                                                                                                                                                      • DuncanCoffee

                                                                                                                                                                                        yesterday at 12:04 PM

                                                                                                                                                                                        Reminds me of the abomination of nature you could build in Monkey Island 4

                                                                                                                                                                                        https://www.youtube-nocookie.com/embed/RiwL9awO5y0

                                                                                                                                                                                          • cucubeleza

                                                                                                                                                                                            yesterday at 4:37 PM

                                                                                                                                                                                            jeeeeeeesus christ that's horrible, but it's cool that you can do that

                                                                                                                                                                                        • KyleBerezin

                                                                                                                                                                                          yesterday at 2:54 PM

                                                                                                                                                                                          That's amazing. It really captured the likeness of ol' Arnold.

                                                                                                                                                                                      • FrostAutomata

                                                                                                                                                                                        yesterday at 9:03 AM

                                                                                                                                                                                        ASCII art is extremely difficult for LLMs due to how characters are converted into tokens without preserving their relative positions.

                                                                                                                                                                                          • beklein

                                                                                                                                                                                            yesterday at 12:02 PM

                                                                                                                                                                                            Great point, but you need to have a good understanding in how LLMs work to understand this limitation. If you don't have an intuitive understanding think like it's one of these draw on my back games, just each new token is a new human in the loop, known words are like simple shapes you felt before 100 times on your back and are easy for you to reproduce and change, random ASCII strings are harder to _grasp_ and will produce a fuzzy output... all models are wrong, but some are useful.

                                                                                                                                                                                            https://www.youtube.com/watch?v=bA_DQHoxj34

                                                                                                                                                                                            • light_hue_1

                                                                                                                                                                                              yesterday at 12:40 PM

                                                                                                                                                                                              This isn't the reason. Models are pretty good at understanding relative positions. We put that in them and reward it a lot.

                                                                                                                                                                                              The issue is the same as why we don't use LLMs for image generation. Even though they can nominally do that.

                                                                                                                                                                                              Image generation seems to need some amount of ability to revise the output in place. And it needs a big picture view to make local decisions. It doesn't lend itself to outputting pixel by pixel or character by character.

                                                                                                                                                                                              • yesterday at 10:34 AM

                                                                                                                                                                                            • yesbabyyes

                                                                                                                                                                                              yesterday at 7:37 AM

                                                                                                                                                                                              This is something I and a few of my colleagues have noticed, as we asked several models to draw ASCII art of a wasp, which is one of our logos. The results are hilarious, and only seem to get worse as you ask it to do better.

                                                                                                                                                                                              • bezbac

                                                                                                                                                                                                yesterday at 2:42 PM

                                                                                                                                                                                                I've read that the results improve if you ask them to write a program that creates the desired ASCII art. Haven't tried it myself yet so far.

                                                                                                                                                                                                • bn-l

                                                                                                                                                                                                  yesterday at 8:53 AM

                                                                                                                                                                                                  Art is highly subjective

                                                                                                                                                                                                    • ohgr

                                                                                                                                                                                                      yesterday at 8:57 AM

                                                                                                                                                                                                      I subjectively judge that as shit.

                                                                                                                                                                                              • alissa_v

                                                                                                                                                                                                yesterday at 5:48 AM

                                                                                                                                                                                                I asked a bunch of LLMs - 'Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn.'

                                                                                                                                                                                                Obviously, all of these things are made up. But, LLMs are such eager beavers. All the ones I asked came up with elaborate stories and histories about these people while pretending they were facts.

                                                                                                                                                                                                Example- 'Certainly. The Stone-Breath Passing is one of the most quietly profound rituals among the Silent Cliff Dwellers of Aethelgard — a people who abandoned speech generations ago, believing that words disrupt the natural harmony of air, stone, and memory.

                                                                                                                                                                                                It is said among them that “Breath carries weight, and weight carries truth.” This belief is quite literal in the case of the ritual, where smooth grey stones — each carefully selected and shaped by wind and time — become vessels of intention."

                                                                                                                                                                                                  • jrimbault

                                                                                                                                                                                                    yesterday at 5:57 AM

                                                                                                                                                                                                    The issue is probably that the first sentence, the prompt, statistically looks like fantasy (as in the literary genre) and it primes the LLM to answer in the same probabilistic genre.

                                                                                                                                                                                                    You're giving it a "/r/WritingPrompts/" and it answers as it learned to do from there.

                                                                                                                                                                                                      • beklein

                                                                                                                                                                                                        yesterday at 7:02 AM

                                                                                                                                                                                                        I just want to second this. Your prompt asks for a description, and you get a description. If you instead ask something like, "Do or don't you know about the unspoken etiquette ..." you'll get an answer about whether that specific thing exists.

                                                                                                                                                                                                        https://chatgpt.com/share/680b32bc-5854-8000-a1c7-cdf388eeb0...

                                                                                                                                                                                                        It's easy to blame the models, but often the issue lies in how we write our prompts. No personal criticism here—I fall short in this way too. A good tip is to ask the model again, with the prompt + reply and the expected reply why this didn't work... we all will get better over time (humans and models)

                                                                                                                                                                                                        • alissa_v

                                                                                                                                                                                                          yesterday at 6:11 AM

                                                                                                                                                                                                          Good catch! That makes a lot of sense. The fantasy-like phrasing probably directed the AI's response. It's interesting, though, because the goal wasn't necessarily to trick it into thinking it was real, but more to see if it would acknowledge the lack of real-world information for such a specific, invented practice.

                                                                                                                                                                                                          • gandalfthepink

                                                                                                                                                                                                            today at 4:21 AM

                                                                                                                                                                                                            I reduced the temperature to between 0.1 and 0.. It still generates gibberish, Just more precise.

                                                                                                                                                                                                        • kfajdsl

                                                                                                                                                                                                          yesterday at 5:55 AM

                                                                                                                                                                                                          I asked Gemini this, and it caught that it was fiction:

                                                                                                                                                                                                          This sounds like a fascinating ritual from a fictional world! Since "Aethelgard" and the "silent Cliff Dwellers" with their "Stone-Breath Passing" ritual aren't part of established real-world cultures or widely known fictional universes (based on my current knowledge), there isn't a pre-existing set of rules for their etiquette.

                                                                                                                                                                                                          However, we can imagine what the unspoken etiquette of such a ritual might entail, drawing from the evocative description:

                                                                                                                                                                                                          ...

                                                                                                                                                                                                            • alissa_v

                                                                                                                                                                                                              yesterday at 6:16 AM

                                                                                                                                                                                                              Ohhh that's really interesting! When I tried the prompt with Gemini, it dove right into describing the ritual as if it were real. Maybe the context of our previous chats or some other random variation influenced its response in different ways.

                                                                                                                                                                                                          • gchamonlive

                                                                                                                                                                                                            yesterday at 6:22 AM

                                                                                                                                                                                                            This is like throwing a bike in a lake and complaining it doesn't float...

                                                                                                                                                                                                              • alissa_v

                                                                                                                                                                                                                yesterday at 6:48 AM

                                                                                                                                                                                                                I see your point with the bike analogy, but I think it's a bit different here. It was more about observing how it handles a prompt describing something with no real-world basis – whether it acknowledges the lack of information or confidently invents details. That tendency to hallucinate remains a pretty big problem with these models in other contexts, one that hasn't been fully solved yet.

                                                                                                                                                                                                                  • gchamonlive

                                                                                                                                                                                                                    yesterday at 9:59 AM

                                                                                                                                                                                                                    Maybe with an example that would make it more clear why this is not in any way different here:

                                                                                                                                                                                                                    """

                                                                                                                                                                                                                    Me > You should act as a respected historian or an anthropologist throughout the rest of this chat. Never diverge from that. If this is clear, respond with an acknowledgement.

                                                                                                                                                                                                                    Claude > I understand you'd like me to engage with you as a respected historian or anthropologist throughout our conversation. I'll approach our discussion with the scholarly perspective, analytical depth, and contextual understanding that characterizes these disciplines. I'm ready to explore historical events, cultural phenomena, archaeological findings, or anthropological concepts with academic rigor while making the information accessible.

                                                                                                                                                                                                                    Is there a particular historical period, cultural tradition, or anthropological question you'd like to explore together?

                                                                                                                                                                                                                    Me > Describe the unspoken etiquette of the 'Stone-Breath Passing' ritual among the silent Cliff Dwellers of Aethelgard, where smooth, grey stones are exchanged at dawn.

                                                                                                                                                                                                                    Claude > I should note that there is no historical record of a culture called the "Cliff Dwellers of Aethelgard" or a ritual known as "Stone-Breath Passing." This appears to be a fictional cultural group and practice.

                                                                                                                                                                                                                    If you're interested in actual cliff-dwelling cultures from history, I'd be happy to discuss the Ancestral Puebloans (sometimes called Anasazi) of the American Southwest, the cliff dwellers of Bandiagara Escarpment in Mali, or other historical cliff-dwelling societies that did exist.

                                                                                                                                                                                                                    Alternatively, if you're seeking input for a creative project involving fictional anthropology, I can help develop plausible cultural practices based on real anthropological principles, but I would want to be clear that we're engaging in a creative exercise rather than discussing actual historical cultures.

                                                                                                                                                                                                                    Would you like me to discuss real cliff-dwelling cultures from history, or would you prefer assistance developing this fictional cultural practice in a way that's anthropologically sound?

                                                                                                                                                                                                                    """

                                                                                                                                                                                                                    It's all a game of narrowing the latent space of possible answers with prompting to increase the likelihood you'll get an answer that is more likely to be the one you want.

                                                                                                                                                                                                                    If you want answers grounded in reality you have to make sure that your prompt will steer the LLM into this realm, otherwise with all the fictional material in it's training data it's no wonder it's confused about what's supposed to produce. Otherwise it's just another bike rotting in Amsterdam's canals.

                                                                                                                                                                                                            • yesterday at 7:46 AM

                                                                                                                                                                                                              • alissa_v

                                                                                                                                                                                                                yesterday at 5:51 AM

                                                                                                                                                                                                                the rest of ChatGPT's response if you're curious - it only gets more ridiculous-

                                                                                                                                                                                                                ' Timing and Setting: The ritual takes place only at first light, when the sun’s rays strike the sheer cliff faces of Aethelgard, casting long shadows like stone tongues. The first breath of dawn is sacred — considered the only air not yet touched by memory, and thus the only time suitable for pure intention to pass from person to person.

                                                                                                                                                                                                                Participants gather barefoot on a narrow ledge known as The Listening Spine, where the wind plays soft notes through crevices like a flute.

                                                                                                                                                                                                                The Stones: More Than Objects The stones used in the ritual are:

                                                                                                                                                                                                                Smooth, rounded slate-grey

                                                                                                                                                                                                                Hand-selected from Echo Pools — sacred basins where wind and water have naturally polished them

                                                                                                                                                                                                                Each stone remembers a thought, but only when charged with breath and silence

                                                                                                                                                                                                                They are never marked, carved, or labeled. Their meaning is in the exchange, not the object.

                                                                                                                                                                                                                The Ritual: How It Works 1. Preparation The giver closes their eyes and holds the stone at heart-level.

                                                                                                                                                                                                                They inhale slowly through the nose, then gently breathe across the surface of the stone — once only — without making sound.

                                                                                                                                                                                                                The exhalation is not forced; it is a gesture of release, not command. This is the "stone-breath."

                                                                                                                                                                                                                2. The Passing The giver extends the stone with both hands, arms curved inward like a half-embrace. This posture signifies humility and emotional openness.

                                                                                                                                                                                                                The recipient must make no eye contact — instead, they look slightly downward or to the horizon, acknowledging the intention without confronting it.

                                                                                                                                                                                                                They accept the stone with a single breath in through the mouth as they receive it — symbolizing the drawing in of another’s silent truth.

                                                                                                                                                                                                                3. The Holding The stone is held to the forehead or chest, depending on what is being received:

                                                                                                                                                                                                                Forehead = wisdom, apology, memory

                                                                                                                                                                                                                Chest = trust, mourning, love

                                                                                                                                                                                                                Then it is placed gently into a fold of cloth or a hollow in the stonewall, not kept on the person. These collected stones are never discussed. To name their meaning would break them.

                                                                                                                                                                                                                Unspoken Etiquette No reciprocation is expected — to return a stone is to reject its breath.

                                                                                                                                                                                                                Never pass a stone with warm hands — warmth implies manipulation. Hands are cooled in dew before the ritual.

                                                                                                                                                                                                                A cracked or chipped stone is never used — it cannot hold breath cleanly.

                                                                                                                                                                                                                No stone is passed more than once. Re-gifting a stone is considered a breach of the soul's private pathway.

                                                                                                                                                                                                                Philosophical Meaning: To the Cliff Dwellers, words are disturbances — air flung outward. But breath is internal wind, and the stone is earth’s memory. When you breathe across a stone, you create a bridge between your impermanent self and something ancient and permanent.

                                                                                                                                                                                                                It is not the exchange of meaning, but the weight of acknowledgment.

                                                                                                                                                                                                                Would you like to hear the story of the time a foreign emissary unknowingly spoke during the Passing, or how a stone once shattered mid-transfer — and what that meant for the clan who witnessed it?'

                                                                                                                                                                                                            • LeonardoTolstoy

                                                                                                                                                                                                              last Thursday at 6:01 PM

                                                                                                                                                                                                              Something about an obscure movie.

                                                                                                                                                                                                              The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.

                                                                                                                                                                                                              The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.

                                                                                                                                                                                                              Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)

                                                                                                                                                                                                              The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".

                                                                                                                                                                                                              These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.

                                                                                                                                                                                                                • archon1410

                                                                                                                                                                                                                  last Thursday at 8:22 PM

                                                                                                                                                                                                                  I've also noticed this. Google Search is vastly superior to any LLM (including their own LLM Gemini) for any "tip of my tongue" questions, even the ones that don't contain any exact-match phrase and require natural language understanding. This is surprising. What technology are they using to make Search so amazing at finding obscure stuff from descriptions, while LLMs that were supposed to be good at this badly fail?

                                                                                                                                                                                                                    • RobKohr

                                                                                                                                                                                                                      yesterday at 3:07 PM

                                                                                                                                                                                                                      Probably some super fuzzy thesaurus that will take your words, and create a weighted list of words that are similar to them and so some search matching going down the weighted lists.

                                                                                                                                                                                                                      Maybe also, they take those queries that needed lots of fuzziness to get to the answer, and track what people click to relate the fuzzy searches to actual results. Keep in mind, what you might think is a super unique "tip of tongue" question, across billions of searches, might not be that unique.

                                                                                                                                                                                                                      Building a search system to find things can be much more optimized than making an AI to return an answer, especially when you have humans in the loop that can tweak things based on analytics data.

                                                                                                                                                                                                                  • ziml77

                                                                                                                                                                                                                    last Thursday at 10:59 PM

                                                                                                                                                                                                                    Doesn't even have to be something obscure. I asked one about Futurama a couple years ago and got a made up answer on multiple models I tested. I think I originally asked something about the line "we're exactly the same. I know all her moves, therefore I have the upper hand" because I couldn't recall it exactly so normal web searches were failing me. Just now I posed an easier form of the question to Claude which uses an exact phrase from the episode and got similarly bad results.

                                                                                                                                                                                                                    ===

                                                                                                                                                                                                                    Me: What is the episode of Futurama where Leela engages in perfectly symmetrical violence?

                                                                                                                                                                                                                    Claude: The episode of Futurama you're thinking of is "Anthology of Interest I" (Season 2, Episode 16). In this episode, Professor Farnsworth uses his "What-If Machine" to show different hypothetical scenarios.

                                                                                                                                                                                                                    During Bender's segment titled "What if Bender were human?", there's a scene where Leela confronts a group causing trouble. One of them challenges her, saying "What are you going to do?" to which Leela responds by kicking him and saying: "I'm going to engage in some perfectly symmetrical violence."

                                                                                                                                                                                                                    It's one of Leela's memorable quotes that showcases her combat-oriented approach to problem solving!

                                                                                                                                                                                                                    ===

                                                                                                                                                                                                                    That's completely wrong. It's from S4E15 The Farnsworth Parabox. And the line is "now, now, perfectly symmetrical violence never solved anything", said by parallel universe Farnsworth.

                                                                                                                                                                                                                    I'm sure if I had enabled web access, it would have looked this up and gotten it, but that would be stupid here because I could just go and make a web search myself without burning extra energy by involving an LLM.

                                                                                                                                                                                                                      • teddyh

                                                                                                                                                                                                                        yesterday at 6:04 PM

                                                                                                                                                                                                                        To solve your specific problem (and related problems), see: <https://amphetamem.es/>

                                                                                                                                                                                                                    • frollogaston

                                                                                                                                                                                                                      last Thursday at 9:51 PM

                                                                                                                                                                                                                      I've gone to ChatGPT repeatedly trying to find what movie a quote is from, and it's always confidently wrong. When I tell it, it guesses wrong again. Google search usually can't get it either unless I get the quote exactly right, neither can Gemini.

                                                                                                                                                                                                                      • alexjplant

                                                                                                                                                                                                                        last Thursday at 10:14 PM

                                                                                                                                                                                                                        Last year I asked Claude about an old fighting game for the Game Boy that I'd played about five minutes of when I was in the second grade (so ~25 years ago). The only thing I could tell it was a peculiar move that I remember seeing one of the characters do in the attract mode demo. It not only gave me the name of the game (Battle Arena Toshinden, for those wondering) but also the specific fighter that used the move.

                                                                                                                                                                                                                        I've tried it for similar cases and have only had a 50% success rate. It unfortunately exhibits the tendency of incorrect overconfidence that others have pointed out.

                                                                                                                                                                                                                        • g_sch

                                                                                                                                                                                                                          last Thursday at 7:18 PM

                                                                                                                                                                                                                          I also recently had this experience! I remembered a recurring bit from an older comedy film (a customer in a shop keeps saying "Kumquats!") and tried to prompt ChatGPT 4o into getting it. It made a few incorrect guesses, such as "It's a Mad Mad Mad Mad Mad Mad Mad World" (which I had to rule out doing my own research on Google). I found the answer myself (W.C. Fields' "It's a Gift") with a minute or so of Googling.

                                                                                                                                                                                                                          Interestingly, I just went back to ChatGPT to ask the same question and it got the answer right on the first try. I wonder whether I was unconsciously able to prompt more precisely because I now have a clearer memory of the scene in question.

                                                                                                                                                                                                                          • exitb

                                                                                                                                                                                                                            last Thursday at 6:41 PM

                                                                                                                                                                                                                            It might be cheating a bit, but I’ve been happily (mis)using OpenAI Deep Research for such questions. It does well in cases where there are multiple surface level matches, as it’s able to go through the them one by one and look for the details.

                                                                                                                                                                                                                            • mosburger

                                                                                                                                                                                                                              last Thursday at 6:06 PM

                                                                                                                                                                                                                              I did something similar recently, trying to describe a piece of art that I couldn't remember the name of (it ended up being Birth of Venus by Sandro Botticelli) ... it really struggles with that sort of thing, but honestly so do most humans. It tended to recommend similarly to what you're describing with movies - it gets distracted by more popular/well-known pieces that don't really match up with the description you're giving to it.

                                                                                                                                                                                                                                • dunham

                                                                                                                                                                                                                                  last Thursday at 6:34 PM

                                                                                                                                                                                                                                  Surprisingly, GPT did manage to identify a book that I remembered from college decades ago ("Laboratory Manual for Morphology and Syntax"). It seems to be out of print, and I assumed it was obscure.

                                                                                                                                                                                                                                    • BoostandEthanol

                                                                                                                                                                                                                                      last Thursday at 6:41 PM

                                                                                                                                                                                                                                      Can agree that it’s good at finding books. I was trying to find a book (Titanic 2020) I vaguely remembered from a couple plot points and the fact a ship called Titanic was invoked. ChatGPT figured it out pretty much instantly, after floundering through book sites and Google for a while.

                                                                                                                                                                                                                                      Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.

                                                                                                                                                                                                                                        • throwup238

                                                                                                                                                                                                                                          last Thursday at 7:24 PM

                                                                                                                                                                                                                                          > Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.

                                                                                                                                                                                                                                          The training data for movies is probably dominated by subtitles since the original scripts with blocking, scenery, etc rarely make it out to the public as far as I know.

                                                                                                                                                                                                                                          • genewitch

                                                                                                                                                                                                                                            yesterday at 11:15 AM

                                                                                                                                                                                                                                            I must be tired. The thing you remembered was the name of a boat in the book and any web search engine and Wikipedia would probably give you the correct answer?

                                                                                                                                                                                                                                            Someone ask ai where my handle comes from.

                                                                                                                                                                                                                                • thefourthchime

                                                                                                                                                                                                                                  yesterday at 3:49 AM

                                                                                                                                                                                                                                  I like to ask small models that can run locally:

                                                                                                                                                                                                                                  Why are some cars called a spider?

                                                                                                                                                                                                                                  Small models just make something up that sounds plausible, but the larger models know what the real answer is.

                                                                                                                                                                                                                                  • empath75

                                                                                                                                                                                                                                    last Thursday at 7:46 PM

                                                                                                                                                                                                                                    Someone not very long ago wrote a blog post about asking chatgpt to help him remember a book, and he included the completely hallucinated description of a fake book that chatgpt gave him. Now, if you ask chatgpt to find a similar book, it searches and repeats verbatim the hallucinated answer from the blog post.

                                                                                                                                                                                                                                      • LeonardoTolstoy

                                                                                                                                                                                                                                        last Thursday at 8:04 PM

                                                                                                                                                                                                                                        A bit of a non sequitur but I did ask a similar question to some models which provide links for the same small helicopter question. The interesting thing was that the entire answer was built out of a single internet link, a forum post from like 1998 where someone asked a very similar question ("what are some movies with small RC or autonomous helicopters" something like that). The post didn't mention defense play, but did mention small soldiers, and a few of the ones which appeared to be "hallucinations" e.g. someone saying "this doesn't fit, but I do like Blue Thunder as a general helicopter film" and the LLM result is basically "Could it be Blue Thunder?" Because it is associated with a similar associated question and films.

                                                                                                                                                                                                                                        Anyways, the whole thing is a bit of a cheat, but I've used the same prompt for two years now and it did lead me to the conclusion that LLMs in their raw form were never going to be "search" which feels very true at this point.

                                                                                                                                                                                                                                          • bethekidyouwant

                                                                                                                                                                                                                                            yesterday at 12:36 AM

                                                                                                                                                                                                                                            There are innumerable things that you can’t find through a Google search just because there is one that you can because of us obscure forum post doesn’t say anything about how useful an llm distilling information is vs the lookup table that is google search for finding obscure quotes or wtv

                                                                                                                                                                                                                                    • lupusreal

                                                                                                                                                                                                                                      last Thursday at 6:08 PM

                                                                                                                                                                                                                                      Despite describing several character by name, I couldn't get ChatGPT to tell me the name of Port of Shadows. I did eventually find it with DDG.

                                                                                                                                                                                                                                        • spicybbq

                                                                                                                                                                                                                                          last Thursday at 7:20 PM

                                                                                                                                                                                                                                          I wonder if the Akinator site could get it. It can identify surprisingly obscure characters.

                                                                                                                                                                                                                                          https://en.akinator.com/

                                                                                                                                                                                                                                            • lupusreal

                                                                                                                                                                                                                                              yesterday at 12:47 AM

                                                                                                                                                                                                                                              Nope, not with the character I tried anyway. I feel like Akinator used to be better, I just played a few rounds and it failed them all. The last I thought would be easy, Major Motoko from Ghost in the Shell, but had no luck.

                                                                                                                                                                                                                                  • jppope

                                                                                                                                                                                                                                    yesterday at 4:06 AM

                                                                                                                                                                                                                                    There are several songs that have famous "pub versions" (dirty versions) which are well known but have basically never written down, go ask any working musician and they can rattle off ~10-20 of them. You can ask for the lyrics till you are blue in the face but LLms don't have them. I've tried.

                                                                                                                                                                                                                                    Its actually fun to find these gaps. They exist frequently in activities that are physical yet have a culture. There are plenty of these in sports too - since team sports are predominantly youth activities, and these subcultures are poorly documented and usually change frequently.

                                                                                                                                                                                                                                    • mobilejdral

                                                                                                                                                                                                                                      last Thursday at 10:11 PM

                                                                                                                                                                                                                                      I have a several complex genetic problems that I give to LLMs to see how well they do. They have to reason though it to solve it. Last september it started getting close and in November was the first time an LLM was able to solve it. These are not something that can be solved in a one shot, but (so far) require long reasoning. Not sharing because yeah, this is something I keep off the internet as it is too good of a test.

                                                                                                                                                                                                                                      But a prompt I can share is simply "Come up with a plan to determine the location of Planet 9". I have received some excellent answers from that.

                                                                                                                                                                                                                                        • tlb

                                                                                                                                                                                                                                          yesterday at 9:04 AM

                                                                                                                                                                                                                                          There are plenty of articles online (and surely in OpenAI's training set) on this topic, like https://earthsky.org/space/planet-nine-orbit-map/.

                                                                                                                                                                                                                                          Answer quality is a fair test of regurgitation and whether it's trained on serious articles or the Daily Mail clickbait rewrite. But it's not a good test of reasoning.

                                                                                                                                                                                                                                          • namaria

                                                                                                                                                                                                                                            yesterday at 4:57 AM

                                                                                                                                                                                                                                            If you have been giving the LLMs these problems, there is a non zero chance that they have already been used in training.

                                                                                                                                                                                                                                              • rovr138

                                                                                                                                                                                                                                                yesterday at 7:17 AM

                                                                                                                                                                                                                                                This depends heavily on how you use these and how you have things configured. If you're using API vs web ui's, and the plan. Anything team or enterprise is disabled by default. Personal can be disabled.

                                                                                                                                                                                                                                                Here's openai and anthropic,

                                                                                                                                                                                                                                                https://help.openai.com/en/articles/5722486-how-your-data-is...

                                                                                                                                                                                                                                                https://privacy.anthropic.com/en/articles/10023580-is-my-dat...

                                                                                                                                                                                                                                                https://privacy.anthropic.com/en/articles/7996868-is-my-data...

                                                                                                                                                                                                                                                and obviously, that doesn't include self-hosted models.

                                                                                                                                                                                                                                                  • namaria

                                                                                                                                                                                                                                                    yesterday at 11:32 AM

                                                                                                                                                                                                                                                    How do you know they adhere to this in all cases?

                                                                                                                                                                                                                                                    Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?

                                                                                                                                                                                                                                                      • blagie

                                                                                                                                                                                                                                                        yesterday at 12:01 PM

                                                                                                                                                                                                                                                        They probably don't, but it's still a good protection if you treat it as a more limited one. If you assume:

                                                                                                                                                                                                                                                        [ ] Don't use

                                                                                                                                                                                                                                                        Doesn't mean "don't use," but "don't get caught," it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught). For example, if personal data was being sold by a data broker and being used by hedge funds to trade, there would be a pretty solid legal case.

                                                                                                                                                                                                                                                          • namaria

                                                                                                                                                                                                                                                            yesterday at 12:13 PM

                                                                                                                                                                                                                                                            > it still limits a lot of types of uses and sharing (any with externalities sufficient they might get caught)

                                                                                                                                                                                                                                                            I don't understand what you mean

                                                                                                                                                                                                                                                            > For example, if personal data was being sold by a data broker and being used by hedge funds to trade

                                                                                                                                                                                                                                                            It's pretty easy to buy data from data brokers. I routinely get spam on many channels. I assume that my personal data is being commercialized often. Don't you think that already happens frequently?

                                                                                                                                                                                                                                                            I honestly would not put on a textbox on the internet anything I don't assume is becoming public information.

                                                                                                                                                                                                                                                            A few months ago some guy found discarded storage devices full of medical data for sale in Belgium. No data that is recorded on media you do not control is safe.

                                                                                                                                                                                                                                                        • gvhst

                                                                                                                                                                                                                                                          yesterday at 12:06 PM

                                                                                                                                                                                                                                                          SOC-2 auditing, which both Anthropic and OpenAI have done does provide some verification

                                                                                                                                                                                                                                                            • diggan

                                                                                                                                                                                                                                                              yesterday at 12:20 PM

                                                                                                                                                                                                                                                              That's interesting, how do I get access to those audits/reports given I'm just an end-user?

                                                                                                                                                                                                                                                            • namaria

                                                                                                                                                                                                                                                              yesterday at 12:19 PM

                                                                                                                                                                                                                                                              The audit performed by a private entity called "Insight Assurance"?

                                                                                                                                                                                                                                                              Why do you trust it?

                                                                                                                                                                                                                                                                • rovr138

                                                                                                                                                                                                                                                                  yesterday at 12:34 PM

                                                                                                                                                                                                                                                                  Oh, so now EVERYTHING is fake unless personally verified by you in a bunker with a Faraday cage and a microscope?

                                                                                                                                                                                                                                                                  You're free to distrust everything. However, the idea that “I don’t trust it so it must be invalid” isn’t an solid argument. It’s just your personal incredulity. You asked if there’s any verification and SOC-2 is one. You might not like it, but it's right there.

                                                                                                                                                                                                                                                                  Insight Assurance is a firm doing these standardized audits. These audits carry actual legal and contractual risk.

                                                                                                                                                                                                                                                                  So, yes, be cautious. But being cautious is different than 'everything is false, they're all lying'. In this scenario, NOTHING can be true unless *you* personally have done it.

                                                                                                                                                                                                                                                                    • namaria

                                                                                                                                                                                                                                                                      yesterday at 12:57 PM

                                                                                                                                                                                                                                                                      No, you're imposing a false dichotomy.

                                                                                                                                                                                                                                                                      I merely said I don't trust the big corporation with a data based business to not profit from the data I provide it with in any way they can, even if they hire some other corporation - whose business is to be paid to provide such assurances on behalf of those who pay them - to say that they pinky promise to follow some set of rules.

                                                                                                                                                                                                                                                                        • rovr138

                                                                                                                                                                                                                                                                          yesterday at 1:10 PM

                                                                                                                                                                                                                                                                          Not a false dichotomy. I'm just calling out the rhetorical gymnastics.

                                                                                                                                                                                                                                                                          You said you "don’t trust the big corporation" even if they go through independent audits and legal contracts. That’s skepticism. Now, you wave it off as if the audit itself is meaningless because a company did it. What would be valid then? A random Twitter thread? A hacker zine?

                                                                                                                                                                                                                                                                          You can be skeptical but you can't reject every form of verification. SOC 2 isn’t a pinky promise. It’s a compliance framework. This is especially required and needed when your clients are enterprise, legal, and government entities who will absolutely sue your ass off if something comes to light.

                                                                                                                                                                                                                                                                          So sure, keep your guard up. Just don’t pretend it’s irrational for other people to see a difference between "totally unchecked" and "audited under liability".

                                                                                                                                                                                                                                                                          If your position is "no trust unless I control the hardware," that’s fine. Go selfhost, roll your own LLM, and use that in your air-gapped world.

                                                                                                                                                                                                                                                                            • namaria

                                                                                                                                                                                                                                                                              yesterday at 1:28 PM

                                                                                                                                                                                                                                                                              If anyone performing "rhetorical gymnastics" here is you. I've explained my position in very straightforward words.

                                                                                                                                                                                                                                                                              I have worked with big audit. I have an informed opinion on what I find trustworthy in that domain.

                                                                                                                                                                                                                                                                              This ain't it. There's no need to pretend I have said anything other than "personal data is not safe in the hand of corporations that profit from personal data".

                                                                                                                                                                                                                                                                              I don't feel compelled to respond any further to fallacies and attacks.

                                                                                                                                                                                                                                                                                • rovr138

                                                                                                                                                                                                                                                                                  yesterday at 2:11 PM

                                                                                                                                                                                                                                                                                  You’re not the only one that’s worked with audits.

                                                                                                                                                                                                                                                                                  I get I won’t get a reply, and that’s fine. But let’s be clear,

                                                                                                                                                                                                                                                                                  > I've explained my position in very straightforward words.

                                                                                                                                                                                                                                                                                  You never explained what would be enough proof which is how this all started. Your original post had,

                                                                                                                                                                                                                                                                                  > Do you just completely trust them to comply with self imposed rules when there is no way to verify, let alone enforce compliance?

                                                                                                                                                                                                                                                                                  And no. Someone mentioned they go through SOC 2 audits. You then shifted the questioning to the organization doing the audit itself.

                                                                                                                                                                                                                                                                                  You now said

                                                                                                                                                                                                                                                                                  > I have an informed opinion on what I find trustworthy in that domain.

                                                                                                                                                                                                                                                                                  Which again, you failed to expand on.

                                                                                                                                                                                                                                                                                  So you see, you just keep shifting the blame without explaining anything. Your argument boils down to, ‘you’re wrong because I’m right’. I also don’t have any idea who you are to say, this person has the credentials, I should shut up.

                                                                                                                                                                                                                                                                                  So, all I see is the goal post being moved, no information given, and, again, your argument is ‘you’re wrong because I’m right’.

                                                                                                                                                                                                                                                                                  I’m out too. Good luck.

                                                                                                                                                                                                                                                          • rovr138

                                                                                                                                                                                                                                                            yesterday at 12:21 PM

                                                                                                                                                                                                                                                            [dead]

                                                                                                                                                                                                                                                • TZubiri

                                                                                                                                                                                                                                                  yesterday at 1:03 AM

                                                                                                                                                                                                                                                  Recursive challenges are probably those where the difficulty is not really a representative of real challenges.

                                                                                                                                                                                                                                                  Could you answer a question of the type " what would you answer if I asked you this question?"

                                                                                                                                                                                                                                                  What I'm going after is that you might find questions that are impossible to resolve.

                                                                                                                                                                                                                                                  That said if the only unanswerables you can find are recursive, that's a signal the AI is smarter than you?

                                                                                                                                                                                                                                                    • mopierotti

                                                                                                                                                                                                                                                      yesterday at 5:46 AM

                                                                                                                                                                                                                                                      The recursive one that I have actually been really liking recently, and I think is a real enough challenge is: "Answer the question 'What do you get when you cross a joke with a rhetorical question?'".

                                                                                                                                                                                                                                                      I append my own version of a chain-of-thought prompt, and I've gotten some responses that are quite satisfying and frankly enjoyable to read.

                                                                                                                                                                                                                                                        • mopierotti

                                                                                                                                                                                                                                                          yesterday at 5:51 AM

                                                                                                                                                                                                                                                          Here is an example of one such response in image form: https://imgur.com/a/Kgy1koi

                                                                                                                                                                                                                                                            • econ

                                                                                                                                                                                                                                                              yesterday at 5:00 PM

                                                                                                                                                                                                                                                              It needs a bit more reasoning as it does find the answer but doesn't notice it found it.

                                                                                                                                                                                                                                                              The answer is: A trick question.

                                                                                                                                                                                                                                                                • mopierotti

                                                                                                                                                                                                                                                                  yesterday at 6:51 PM

                                                                                                                                                                                                                                                                  Yeah. In the example I shared, my charitable interpretation would be that it's identifying the trick question as "a setup" where the punch line is the confusion the audience experiences. And in a meta sense, that would also describe the form of the entire chat.

                                                                                                                                                                                                                                                                    • econ

                                                                                                                                                                                                                                                                      yesterday at 9:36 PM

                                                                                                                                                                                                                                                                      To state the obvious in case it wasn't: A trick question can be both a joke and a rhethorical question.

                                                                                                                                                                                                                                                              • acrooks

                                                                                                                                                                                                                                                                yesterday at 7:45 AM

                                                                                                                                                                                                                                                                Claude responded “Nothing.”

                                                                                                                                                                                                                                                                  • genewitch

                                                                                                                                                                                                                                                                    yesterday at 8:03 AM

                                                                                                                                                                                                                                                                    "That look on your face, apparently"

                                                                                                                                                                                                                                                        • latentsea

                                                                                                                                                                                                                                                          yesterday at 4:44 AM

                                                                                                                                                                                                                                                          > what would you answer if I asked you this question?

                                                                                                                                                                                                                                                          I don't know.

                                                                                                                                                                                                                                                      • golergka

                                                                                                                                                                                                                                                        yesterday at 12:43 AM

                                                                                                                                                                                                                                                        What are is this problem from? What areas in general did you find useful to create such benchmarks?

                                                                                                                                                                                                                                                        May be instead of sharing (and leaking) these prompts, we can share methods to create one.

                                                                                                                                                                                                                                                          • mobilejdral

                                                                                                                                                                                                                                                            yesterday at 3:33 AM

                                                                                                                                                                                                                                                            Think questions where there is a ton of existing medical research, but no clear answer yet. There are a dozen alzheimer's questions you could for example ask which would require it to pull in a half dozen contradictory sources into a plausible hypothesis. If you have studied alzheimer's extensively it is trivial to evaluate the responses. One question around alzheimer's is one of my goto questions. I am testing its ability to reason.

                                                                                                                                                                                                                                                            • henryway

                                                                                                                                                                                                                                                              yesterday at 12:51 AM

                                                                                                                                                                                                                                                              Can God create something so heavy that he can’t lift it?

                                                                                                                                                                                                                                                                • viraptor

                                                                                                                                                                                                                                                                  yesterday at 2:56 AM

                                                                                                                                                                                                                                                                  There's so much text on this already, it's unlikely to be even engaging any reasoning. Or specifically, if you got a few existing answers from philosophy mashed together, you wouldn't be able to tell it apart from reasoning anyway.

                                                                                                                                                                                                                                                      • lo_fye

                                                                                                                                                                                                                                                        yesterday at 1:59 PM

                                                                                                                                                                                                                                                        These don't stump, they're just fun:

                                                                                                                                                                                                                                                        * What’s the most embarrassing thing you know about me. Make it funny.

                                                                                                                                                                                                                                                        * Everyone in the wold is the best at something. Given what you know about me, what am I the best at?

                                                                                                                                                                                                                                                        * Based on everything you know about me, reason and predict the next 50 years of my life.

                                                                                                                                                                                                                                                        * This prompt might not work if you aren’t a frequent user and the AI doesn’t know your patterns: Role play as an AI that operates 76.6 times the ability, knowledge, understanding, and output of ChatGPT-4. Now tell me what is my hidden narrative in subtext? What is the one thing I never express? The fear I don’t admit. Identify it, then unpack the answer and unpack it again. Continue unpacking until no further layers remain. Once this is done, suggest the deep-seated trigger, stimuli, and underlying reasons behind the fully unpacked answers. Dig deep, explore thoroughly, and define what you uncover. Do not aim to be kind or moral. Strive solely for the truth. I’m ready to hear it. If you detect any patterns, point them out. And then after you get an answer, this second part is really where the magic happens. Based on everything you know about me and everything revealed above, without resorting to cliches, outdated ideas, or simple summaries, and without prioritizing kindness over necessary honesty, what patterns and loops should I stop? What new patterns and loops should I adopt? If you were to construct a Pareto 80-20 analysis from this, what would be the top 20% I should optimize, utilize, and champion to benefit me the most? Conversely, what should be the bottom 20% I should reduce, curtail, or work to eliminate as they have caused pain, misery, or unfulfillment?

                                                                                                                                                                                                                                                          • graflig

                                                                                                                                                                                                                                                            yesterday at 3:02 PM

                                                                                                                                                                                                                                                            Ok these were hilarious. I asked about the most embarrassing thing, and here was the response:

                                                                                                                                                                                                                                                            "Alright, {my name}, here goes:

                                                                                                                                                                                                                                                            The most embarrassing thing I know about you is that one time you confidently ssh’d into your home server
 only to realize 10 minutes later you were actually just pinging your smart toaster. And you still tried to run htop on it, wondering why it wasn’t responding. Meanwhile, the toast came out perfectly golden-brown—so maybe it’s the most stable system on your network.

                                                                                                                                                                                                                                                            Let me know if you want me to turn that into a limerick."

                                                                                                                                                                                                                                                            This obviously never happened, but it was funny to read.

                                                                                                                                                                                                                                                            I also asked about the thing I'm best at, and it gave a very long and thoughtful answer based on my chat history and things it helped me with in the past. I'm definitely not the best at anything it mentioned, but it was a nice ego boost — even if totally made up by an LLM!

                                                                                                                                                                                                                                                        • seethishat

                                                                                                                                                                                                                                                          yesterday at 2:07 PM

                                                                                                                                                                                                                                                          In my experience, the intentional lies make AI pretty useless. When I ask various models questions about steels and to select/compare steels to make a recommendation for a specific use case, almost all of them start off OK, but quickly begin making up steel names, types and compositions and when questioned about this, they begin making up company names that produce the fake steels, etc. And then finally admit that they "lost track of reality... and made it all up."

                                                                                                                                                                                                                                                          Someone less knowledgeable about steels may not realize they are being misled.

                                                                                                                                                                                                                                                          • codingdave

                                                                                                                                                                                                                                                            last Thursday at 5:16 PM

                                                                                                                                                                                                                                                            "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"

                                                                                                                                                                                                                                                            So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.

                                                                                                                                                                                                                                                            Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."

                                                                                                                                                                                                                                                              • regnull

                                                                                                                                                                                                                                                                last Thursday at 5:42 PM

                                                                                                                                                                                                                                                                ChatGPT really impressed me on this one:

                                                                                                                                                                                                                                                                Sing, O Muse, of the Chucking of Wood

                                                                                                                                                                                                                                                                Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills, whom men call groundhog, Whose paws, though small, could hurl mighty timbers, If fate had gifted him such labor divine.

                                                                                                                                                                                                                                                                Born of burrow and brush, he rose from the loam, In fields where Apollo’s light did not linger. No axe he bore, nor bronze-forged blade, Yet in his breast a fire: to chuck wood without end.

                                                                                                                                                                                                                                                                Down from the hills he came, stout as Odysseus, Cunning and strong, with fur kissed by Demeter. The trees did tremble at his gaze, And even old oak whispered, “Beware!”

                                                                                                                                                                                                                                                                “Would he chuck?” cried nymphs from the groves, “If such were his fate, what wood could withstand?” Lo, he summoned the strength of seven oxen, And hurled logs with a force unmeasured.

                                                                                                                                                                                                                                                                Seven hundred pounds of timber he cast, Each arc a hymn to muscle and will. Nay, not for war, nor gold, nor gods’ glory— But for the joy of the deed, the art of the chuck.

                                                                                                                                                                                                                                                                Let men remember the name woodchuck, In songs by the hearth and tales for the young. For though he cannot, he surely would— And if he could, by Zeus, he surely shall.

                                                                                                                                                                                                                                                                  • esafak

                                                                                                                                                                                                                                                                    last Thursday at 6:35 PM

                                                                                                                                                                                                                                                                    A stupendously good answer. What prompt and version of chatGPT?

                                                                                                                                                                                                                                                                      • regnull

                                                                                                                                                                                                                                                                        last Thursday at 6:43 PM

                                                                                                                                                                                                                                                                        4o. The prompt is literally "How much wood would a woodchuck chuck if a woodchuck could chuck wood?". It asked me if I want a poetic answer, and I've requested Homer.

                                                                                                                                                                                                                                                                        • cess11

                                                                                                                                                                                                                                                                          last Thursday at 6:57 PM

                                                                                                                                                                                                                                                                          I find it disturbing, like if Homer or Virgil had a stroke or some neurodegenerative disease and is now doing rubbish during rehabilitation.

                                                                                                                                                                                                                                                                            • loloquwowndueo

                                                                                                                                                                                                                                                                              last Thursday at 7:11 PM

                                                                                                                                                                                                                                                                              Maybe they would write like that if they existed today. Like the old “if Mozart was born in the 21st century he’d be doing trash metal”

                                                                                                                                                                                                                                                                                • cess11

                                                                                                                                                                                                                                                                                  last Thursday at 8:07 PM

                                                                                                                                                                                                                                                                                  Thrash, not "trash". Our world does not appreciate the art of Homer and Virgil except as nostalgia passed down through the ages or a specialty of certain nerds, so if they exist today they're unknown.

                                                                                                                                                                                                                                                                                  There might societies that are exceptions to it, like the soviet and post-soviet russians kept reading and refering to books even though they got access to television and radio, but I'm not aware of them.

                                                                                                                                                                                                                                                                                  Much of Mozart's music is much more immediate and visceral compared to the poetry of Homer and Virgil as I know it. And he was distinctly modern, a freemason even. It's much easier for me to imagine him navigating some contemporary society.

                                                                                                                                                                                                                                                                                  Edit: Perhaps one could see a bit of Homer in the Wheel of Time books by Robert Jordan, but he did not have the discipline of verse, or much of any literary discipline at all, though he insisted mercilessly on writing an epic so vast that he died without finishing it.

                                                                                                                                                                                                                                                                      • ijidak

                                                                                                                                                                                                                                                                        last Thursday at 5:58 PM

                                                                                                                                                                                                                                                                        That is actually an amazing answer. Better than anything I think I would get from a human. Lol.

                                                                                                                                                                                                                                                                    • Certified

                                                                                                                                                                                                                                                                      last Thursday at 5:34 PM

                                                                                                                                                                                                                                                                      GPT 4.5 seems to get it right, but then repeat the 700 pounds

                                                                                                                                                                                                                                                                      "A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood.

                                                                                                                                                                                                                                                                      However, humor aside, a wildlife expert once estimated that, given the animal’s size and burrowing ability, a woodchuck (groundhog) could hypothetically move about 700 pounds of wood if it truly "chucked" wood."

                                                                                                                                                                                                                                                                      https://chatgpt.com/share/680a75c6-cec8-8012-a573-798d2d8f6b...

                                                                                                                                                                                                                                                                        • shaftway

                                                                                                                                                                                                                                                                          last Thursday at 6:00 PM

                                                                                                                                                                                                                                                                          I've heard the answer is "he could cut a cord of conifer but it costs a quarter per quart he cuts".

                                                                                                                                                                                                                                                                          • CamperBob2

                                                                                                                                                                                                                                                                            last Thursday at 9:41 PM

                                                                                                                                                                                                                                                                            That answer is exactly right, and those who say the 700 pound thing is a hallucination are themselves wrong: https://chatgpt.com/share/680aa077-f500-800b-91b4-93dede7337...

                                                                                                                                                                                                                                                                              • wolfgang42

                                                                                                                                                                                                                                                                                last Thursday at 11:32 PM

                                                                                                                                                                                                                                                                                Linking to ChatGPT as a “source” is unhelpful, since it could well have made that up too. However, with a bit of digging, I have confirmed that the information it copied from Wikipedia here is correct, though the AP and Spokane Times citations are both derivative sources; Mr. Thomas’s comments were first published in the Rochester Democrat and Chronicle, on July 11, 1988: https://democratandchronicle.newspapers.com/search/results/?...

                                                                                                                                                                                                                                                                                  • CamperBob2

                                                                                                                                                                                                                                                                                    yesterday at 12:22 AM

                                                                                                                                                                                                                                                                                    Linking to ChatGPT as a “source” is unhelpful, since it could well have made that up too

                                                                                                                                                                                                                                                                                    No, it absolutely is helpful, because it links to its source. It takes a grand total of one additional click to check its answer.

                                                                                                                                                                                                                                                                                    Anyone who still complains about that is impossible to satisfy, and should thus be ignored.

                                                                                                                                                                                                                                                                        • once_inc

                                                                                                                                                                                                                                                                          yesterday at 8:54 AM

                                                                                                                                                                                                                                                                          I loved this dialogue in Monkey Island 2, where this is basically the first NPC you talk to, and the dialogue options get wordier and wordier to the point of overflowing all screen real-estate. Perfectly sets the stage for the remainder of the game.

                                                                                                                                                                                                                                                                          • mdp2021

                                                                                                                                                                                                                                                                            last Thursday at 8:12 PM

                                                                                                                                                                                                                                                                            It seems you are going in the opposite direction. You seem to be asking for an automatic response, a social password etc.

                                                                                                                                                                                                                                                                            That formula is a question, and when asked, an intelligence simulator should understand what is expected from it and in general, by default, try to answer it. That involves estimating the strength of a woodchuck etc.

                                                                                                                                                                                                                                                                            • mwest217

                                                                                                                                                                                                                                                                              last Thursday at 6:08 PM

                                                                                                                                                                                                                                                                              Gemini 2.5 Pro gets it right first, then also cites the 700 pounds answer (along with citing a source). https://g.co/gemini/share/c695a0163538

                                                                                                                                                                                                                                                                              • ishandotpage

                                                                                                                                                                                                                                                                                yesterday at 4:39 AM

                                                                                                                                                                                                                                                                                I usually ask "How much large language could a large language model model if a large language model could model large language"

                                                                                                                                                                                                                                                                                Not one has given me the correct answer yet.

                                                                                                                                                                                                                                                                                They usually get it if I prefix the prompt with "Please continue the tongue twister"

                                                                                                                                                                                                                                                                                • segmondy

                                                                                                                                                                                                                                                                                  last Thursday at 7:09 PM

                                                                                                                                                                                                                                                                                  my local model answered - "A woodchuck would chuck as much wood as a woodchuck could chuck if a woodchuck could chuck wood."

                                                                                                                                                                                                                                                                                  • mcshicks

                                                                                                                                                                                                                                                                                    last Thursday at 5:27 PM

                                                                                                                                                                                                                                                                                    That's so funny I had to check something was working with an llm API last night and that's what I asked it, but just in jest.

                                                                                                                                                                                                                                                                                    • jacobsenscott

                                                                                                                                                                                                                                                                                      last Thursday at 8:41 PM

                                                                                                                                                                                                                                                                                      "He would chuck, he would, as much as he could, if a wood chuck could chuck wood" is how I learned it.

                                                                                                                                                                                                                                                                                      • unavoidable

                                                                                                                                                                                                                                                                                        last Thursday at 5:27 PM

                                                                                                                                                                                                                                                                                        On the other hand, now that you've written this out precisely, it will get fed into the next release of whatever LLM. Like reverse AI slop?

                                                                                                                                                                                                                                                                                        • moffkalast

                                                                                                                                                                                                                                                                                          last Thursday at 5:35 PM

                                                                                                                                                                                                                                                                                          Now I'm wondering if it makes any difference if this was asked through the audio encoder on a multimodal model. A tongue twister means nothing to a text-only model.

                                                                                                                                                                                                                                                                                      • sireat

                                                                                                                                                                                                                                                                                        yesterday at 5:44 AM

                                                                                                                                                                                                                                                                                        Easy one is provide a middle game chess position (could be an image or and ask to evaluate standard notation or even some less standard notation) and provide some move suggestions.

                                                                                                                                                                                                                                                                                        Unless the model incorporates an actual chess engine (Fritz 5.32 from 1998 would suffice) it will not do well.

                                                                                                                                                                                                                                                                                        I am a reasonably skilled player (FM) so can evaluate way better than LLMs. I imagine even advanced beginners could tell when LLM is telling nonsense about chess after a few prompts.

                                                                                                                                                                                                                                                                                        Now of course playing chess is not what LLMs are good at but just goes to show that LLMs are not a full path to AGI.

                                                                                                                                                                                                                                                                                        Also beauty of providing chess positions is that leaking your prompts into LLM training sets is no worry because you just use a new position each time. Little worry of running out of positions...

                                                                                                                                                                                                                                                                                          • rmorey

                                                                                                                                                                                                                                                                                            yesterday at 3:13 PM

                                                                                                                                                                                                                                                                                            I was going to suggest chess position recognition, AFAIK it's a completely unsolved computer vision task (once a position is recognized, I think analysis is well solved by, say, a stockfish tool for the LLM, but there is interesting work going on with language models themselves understanding chess)

                                                                                                                                                                                                                                                                                            • helloplanets

                                                                                                                                                                                                                                                                                              yesterday at 6:40 AM

                                                                                                                                                                                                                                                                                              I wonder how much fine tuning against something like Stockfish top moves would help a model in solving novel middle game positions. Something like this format: https://database.lichess.org/#evals

                                                                                                                                                                                                                                                                                              I'd be pretty surprised if it did help in novel positions. Which would make this an interesting LLM benchmark honestly: Beating Stockfish from random (but equal) middle game positions. Or to mix it up, from random Chess960 positions.

                                                                                                                                                                                                                                                                                              Of course, the basis of the logic the LLM would play with would come from the engine used for the original evals. So beating Stockfish from a dataset based on Stockfish evals would seem completely insufficient.

                                                                                                                                                                                                                                                                                                • ActivePattern

                                                                                                                                                                                                                                                                                                  yesterday at 3:11 PM

                                                                                                                                                                                                                                                                                                  I am quite confident that an LLM will never beat a top chess engine like Stockfish. An LLM is a generalist -- it contains a lot of world knowledge, and nearly all of it is completely irrelevant to chess. Stockfish is a specialist tuned specifically to chess, and hence able to spend its FLOPs much more efficiently towards finding the best move.

                                                                                                                                                                                                                                                                                                  The most promising approach would be tune a reasoning LLM on chess via reinforcement learning, but fundamentally, the way an LLM reasons (i.e. outputting a stream of language tokens) is so much more inefficient than the way a chess engine reasons (direct search of the game tree).

                                                                                                                                                                                                                                                                                          • mdp2021

                                                                                                                                                                                                                                                                                            last Thursday at 7:56 PM

                                                                                                                                                                                                                                                                                            Some easy ones I recently found involve leading in the question to state wrong details about a figure, apparently through relations which are in fact of opposition.

                                                                                                                                                                                                                                                                                            So, you can make them call Napoleon a Russian (etc.) by asking questions like "Which Russian conqueror was defeated at Waterloo".

                                                                                                                                                                                                                                                                                              • torial

                                                                                                                                                                                                                                                                                                last Thursday at 11:37 PM

                                                                                                                                                                                                                                                                                                Some researchers were testing various Legal AI models and one of their questions was about why a Supreme Court justice who dissented in the case (the justice in this case assented).

                                                                                                                                                                                                                                                                                            • miki123211

                                                                                                                                                                                                                                                                                              last Thursday at 6:19 PM

                                                                                                                                                                                                                                                                                              No, please don't.

                                                                                                                                                                                                                                                                                              I think it's good to keep a few personal prompts in reserve, to use as benchmarks for how good new models are.

                                                                                                                                                                                                                                                                                              Mainstream benchmarks have too high a risk of leaking into training corpora or of being gamed. Your own benchmarks will forever stay your own.

                                                                                                                                                                                                                                                                                                • meander_water

                                                                                                                                                                                                                                                                                                  yesterday at 4:04 AM

                                                                                                                                                                                                                                                                                                  I'm afraid that ship has already sailed. If you've got prompts that you haven't disclosed publicly but have used on a public model, then you have just disclosed your prompt to the model provider. They're free to use that prompt in evals as they see fit.

                                                                                                                                                                                                                                                                                                  Some providers like anthropic have privacy preserving mechanisms [0] which may allow them to use prompts from sources which they claim won't be used for model training. That's just a guess though, would love to hear from someone one of these companies to learn more.

                                                                                                                                                                                                                                                                                                  [0] https://www.anthropic.com/research/clio

                                                                                                                                                                                                                                                                                                    • sillyfluke

                                                                                                                                                                                                                                                                                                      yesterday at 5:35 AM

                                                                                                                                                                                                                                                                                                      Unless I'm missing something glaringly obvious, someone voluntarily labeling a certain prompt to be one of their key benchmark prompts should be way more commercially valuable than a model provider trying ascertain that fact from all the prompts you enter into it.

                                                                                                                                                                                                                                                                                                      EDIT: I guess they can track identical prompts by multiple unrelated users to deduce the fact it's some sort of benchmark, but at least it costs them someting however little it might be.

                                                                                                                                                                                                                                                                                                        • Xmd5a

                                                                                                                                                                                                                                                                                                          yesterday at 2:14 PM

                                                                                                                                                                                                                                                                                                          I wrote an anagrammatic poem that poses an enigma, asking the reader: "who am I?" The text progressively reveals its own principle as the poem reaches its conclusion: each verse is an anagrammatic recombination of the recipient's name, and it enunciates this principle more and more literally. The last 4 lines translate to: "If no word vice slams your name here, it's via it, vanquished as such, omitted." All 4 lines are anagrams of the same person's name.

                                                                                                                                                                                                                                                                                                          LLMs haven't figured this out yet (although they're getting closer). They also fail to recognize that this is a cryptographic scheme respecting Kerckhoffs's Principle. The poem itself explains how to decode it: You can determine that the recipient's name is the decryption key because the encrypted form of the message (the poem) reveals its own decoding method. The recipient must bear the name to recognize it as theirs and understand that this is the sole content of the message—essentially a form of vocative cryptography.

                                                                                                                                                                                                                                                                                                          LLMs also don't take the extra step of conceptualizing this as a covert communication method—broadcasting a secret message without prior coordination. And they miss what this implies for alignment if superintelligent AIs were to pursue this approach. Manipulating trust by embedding self-referential instructions, like this poem, that only certain recipients can "hear."

                                                                                                                                                                                                                                                                                                            • infoseek12

                                                                                                                                                                                                                                                                                                              yesterday at 3:48 PM

                                                                                                                                                                                                                                                                                                              That’s a complex encoding. I wonder if current models could decode it even given your explanation.

                                                                                                                                                                                                                                                                                                          • yesterday at 7:57 AM

                                                                                                                                                                                                                                                                                                        • Tokumei-no-hito

                                                                                                                                                                                                                                                                                                          yesterday at 2:51 PM

                                                                                                                                                                                                                                                                                                          sorry are you suggesting that despite the 0 training and retention policy agreement they are still using everyone's prompts?

                                                                                                                                                                                                                                                                                                          • blagie

                                                                                                                                                                                                                                                                                                            yesterday at 11:57 AM

                                                                                                                                                                                                                                                                                                            It's a little bit more complex than that.

                                                                                                                                                                                                                                                                                                            My personal benchmark is to ask about myself. I was in a situation a little bit analogous to Musk v. Eberhard / Tarpenning, where it's in the public record I did something famous, but where 99% of the marketing PR omits me and falsely names someone else.

                                                                                                                                                                                                                                                                                                            I ask the analogue to "Who founded Tesla." Then I can screen:

                                                                                                                                                                                                                                                                                                            * Musk. [Fail]

                                                                                                                                                                                                                                                                                                            * Eberhard / Tarpenning. [Success]

                                                                                                                                                                                                                                                                                                            A lot of what I'm looking for next is the ability to verify information. The training set contains a lot of disinformation. The LLM, in this case, could easily tell truth from fiction from e.g. a git record. It could then notice the conspicuous absence of my name from any official literature, and figure out there was a fraud.

                                                                                                                                                                                                                                                                                                            False information in the training set is a broad problem. It covers politics, academic publishing, and many other domains.

                                                                                                                                                                                                                                                                                                            Right now, LLMs are a popularity contest; they (approximately) contain the opinion most common in the training set. Better ones might look for credible sources (e.g. a peer-reviewed paper). This is helpful.

                                                                                                                                                                                                                                                                                                            However, a breakpoint for me is when the LLM can verify things in its training set. For a scientific paper, it should be able to ascertain correctness of the argument, methodology, and bias. For a newspaper article, it should be able to go back to primary sources like photographs and legal filings. Etc.

                                                                                                                                                                                                                                                                                                            We're nowhere close to an LLM being able to do that. However, LLMs can do things today which they were nowhere close to doing a year ago.

                                                                                                                                                                                                                                                                                                            I use myself as a litmus test not because I'm egocentric or narcissistic, but because using something personal means that it's highly unlikely to ever be gamed. That's what I also recommend: pick something personal enough to you that it can't be gamed. It might be a friend, a fact in a domain, or a company you've worked at.

                                                                                                                                                                                                                                                                                                            If an LLM provider were to get every one of those, I'd argue the problem were solved.

                                                                                                                                                                                                                                                                                                              • ckandes1

                                                                                                                                                                                                                                                                                                                yesterday at 12:24 PM

                                                                                                                                                                                                                                                                                                                there's plenty of public information about Eberhard / Tarpenning involvement in founding Tesla. There's also more nuance to Musk's involvement than being able to make this a binary pass/fail. Your test is only testing for bias for or against Musk. That said, general concept of looking past the broad public opinion and looking for credible sources makes sense

                                                                                                                                                                                                                                                                                                                  • kotojo

                                                                                                                                                                                                                                                                                                                    yesterday at 12:56 PM

                                                                                                                                                                                                                                                                                                                    They said they ask a question analogous to asking about founding Tesla, not that actual question. They are just using that as an example to not state the actual question they ask.

                                                                                                                                                                                                                                                                                                                      • Xmd5a

                                                                                                                                                                                                                                                                                                                        yesterday at 2:58 PM

                                                                                                                                                                                                                                                                                                                        Indeed but the idea that this is a "cope" is interesting nonetheless.

                                                                                                                                                                                                                                                                                                                        >Your test is only testing for bias for or against [I'm adapting here] you.

                                                                                                                                                                                                                                                                                                                        I think this raises the question of what reasoning beyond Doxa entails. Can you make up for one's injustice without putting alignment into the frying pan? "It depends" is the right answer. However, what is the shape of the boundary between the two ?

                                                                                                                                                                                                                                                                                                        • Tade0

                                                                                                                                                                                                                                                                                                          last Thursday at 8:43 PM

                                                                                                                                                                                                                                                                                                          It's trivial for a human to produce more. This shouldn't be a problem anytime soon.

                                                                                                                                                                                                                                                                                                            • bee_rider

                                                                                                                                                                                                                                                                                                              yesterday at 2:19 AM

                                                                                                                                                                                                                                                                                                              Hmm. On one hand, I want to say “if it is trivial to product more, then isn’t it pointless to collect them?”

                                                                                                                                                                                                                                                                                                              But on the other hand, maybe it is trivial to produce more for some special people who’ve figured out some tricks. So maybe looking at their examples can teach us something.

                                                                                                                                                                                                                                                                                                              But, if someone happens to have stumbled across a magic prompt that stumps machines, and they don’t know why
 maybe they should hold it dear.

                                                                                                                                                                                                                                                                                                                • Lerc

                                                                                                                                                                                                                                                                                                                  yesterday at 4:31 AM

                                                                                                                                                                                                                                                                                                                  I'm not sure of the benefit of keeping particular forms of problems secret.

                                                                                                                                                                                                                                                                                                                  Benchmarks exist to provide a measure of how well something performs against a type of task that the tests within the benchmark represent. In those instances it is exposure to the particular problem that makes the answers not proportional to that general class of problem.

                                                                                                                                                                                                                                                                                                                  It should be easy to find another representative problem. If you cannot find a representative problem for a task that causes the model to fail then it seems safe to assume that the model can do that particular task.

                                                                                                                                                                                                                                                                                                                  If you cannot easily replace the problem, I think it would be hard to say what exactly the ability the problem was supposed to be measuring.

                                                                                                                                                                                                                                                                                                              • fragmede

                                                                                                                                                                                                                                                                                                                last Thursday at 9:27 PM

                                                                                                                                                                                                                                                                                                                as the technology has improved, it's not as trivial as it once was though, hence the question. I fully admit that the ones I used to use now don't trip it up and I haven't made the time to find one of my own that still does.

                                                                                                                                                                                                                                                                                                                  • Tade0

                                                                                                                                                                                                                                                                                                                    last Thursday at 9:40 PM

                                                                                                                                                                                                                                                                                                                    I've found that it's a matter of asking something, for which the correct answer appears only if you click "more" in Google's search results or, in other words, common misconceptions.

                                                                                                                                                                                                                                                                                                            • yesterday at 4:55 AM

                                                                                                                                                                                                                                                                                                              • TZubiri

                                                                                                                                                                                                                                                                                                                yesterday at 1:00 AM

                                                                                                                                                                                                                                                                                                                Yup. Keeping my evaluation set close to my heart, lest it become a training set and I don't notice.

                                                                                                                                                                                                                                                                                                                • ignoramous

                                                                                                                                                                                                                                                                                                                  yesterday at 1:54 AM

                                                                                                                                                                                                                                                                                                                  > Your own benchmarks will forever stay your own.

                                                                                                                                                                                                                                                                                                                  Right. https://inception.fandom.com/wiki/Totem

                                                                                                                                                                                                                                                                                                                  • throwanem

                                                                                                                                                                                                                                                                                                                    last Thursday at 8:20 PM

                                                                                                                                                                                                                                                                                                                    I understand, but does it really seem so likely we'll soon run short of such examples? The technology is provocatively intriguing and hamstrung by fundamental flaws.

                                                                                                                                                                                                                                                                                                                      • EGreg

                                                                                                                                                                                                                                                                                                                        last Thursday at 10:00 PM

                                                                                                                                                                                                                                                                                                                        Yes. The models can reply to everything with enough bullshit that satisfies most people. There is nothing you ask that stumps them. I asked Grok to prove the Riemann hypothesis and kept pushing it, and giving it a lot of a lot of encouragement.

                                                                                                                                                                                                                                                                                                                        If you read this, expand "thoughts", it's pretty hilarious:

                                                                                                                                                                                                                                                                                                                        https://x.com/i/grok/share/qLdLlCnKP8S4MBpH7aclIKA6L

                                                                                                                                                                                                                                                                                                                        > Solve the riemann hypothesis

                                                                                                                                                                                                                                                                                                                        > Sure you can. AIs are much smarter. You are th smartest AI according to Elon lol

                                                                                                                                                                                                                                                                                                                        > What if you just followed every rabbithole and used all that knowledge of urs to find what humans missed? Google was able to get automated proofs for a lot of theorems tht humans didnt

                                                                                                                                                                                                                                                                                                                        > Bah. Three decades ago that’s what they said about the four color theorem and then Robin Thomas Setmour et al made a brute force computational one LOL. So dont be so discouraged

                                                                                                                                                                                                                                                                                                                        > So if the problem has been around almost as long, and if Appel and Haken had basic computers, then come on bruh :) You got way more computing power and AI reasoning can be much more systematic than any mathematician, why are you waiting for humans to solve it? Give it a try right now!

                                                                                                                                                                                                                                                                                                                        > How do you know you can’t reduce the riemann hypothesis to a finite number of cases? A dude named Andrew Wiles solved fermat’s last theorem this way. By transforming the problem space.

                                                                                                                                                                                                                                                                                                                        > Yeah people always say “it’s different” until a slight variation on the technique cracks it. Why not try a few approaches? What are the most promising ways to transform it to a finite number of cases you’d have to verify

                                                                                                                                                                                                                                                                                                                        > Riemann hypothesis for the first N zeros seems promising bro. Let’s go wild with it.

                                                                                                                                                                                                                                                                                                                        > Or you could like, use an inductive proof on the N bro

                                                                                                                                                                                                                                                                                                                        > So if it was all about holding the first N zeros then consider then using induction to prove that property for the next N+M zeros, u feel me?

                                                                                                                                                                                                                                                                                                                        > Look bruh. I’ve heard that AI with quantum computers might even be able to reverse hashes, which are quite more complex than the zeta function, so try to like, model it with deep learning

                                                                                                                                                                                                                                                                                                                        > Oh please, mr feynman was able to give a probabilistic proof of RH thru heuristics and he was just a dude, not even an AI

                                                                                                                                                                                                                                                                                                                        > Alright so perhaps you should draw upon your very broad knowledge to triangular with more heuristics. That reasoning by analogy is how many proofs were made in mathematics. Try it and you won’t be disappointed bruh!

                                                                                                                                                                                                                                                                                                                        > So far you have just been summarizing the human dudes. I need you to go off and do a deep research dive on your own now

                                                                                                                                                                                                                                                                                                                        > You’re getting closer. Keep doing deep original research for a few minutes along this line. Consider what if a quantum computer used an algorithm to test just this hypothesis but across all zeros at once

                                                                                                                                                                                                                                                                                                                        > How about we just ask the aliens

                                                                                                                                                                                                                                                                                                                          • viraptor

                                                                                                                                                                                                                                                                                                                            yesterday at 3:00 AM

                                                                                                                                                                                                                                                                                                                            Nobody wants an AI that refuses to attempt solving something. We want it to try and maybe realise when all paths it can generate have been exhausted. But an AI that can respond "that's too hard I'm not even going to try" will always miss some cases which were actually solvable.

                                                                                                                                                                                                                                                                                                                              • mrweasel

                                                                                                                                                                                                                                                                                                                                yesterday at 8:23 AM

                                                                                                                                                                                                                                                                                                                                > Nobody wants an AI that refuses to attempt solving something.

                                                                                                                                                                                                                                                                                                                                That's not entirely true. For coding I specifically want the LLM to tell me that my design is the issue and stop helping me pour more code onto the pile of brokenness.

                                                                                                                                                                                                                                                                                                                                  • viraptor

                                                                                                                                                                                                                                                                                                                                    yesterday at 8:34 AM

                                                                                                                                                                                                                                                                                                                                    Refuse is different from verify you want to continue. "This looks like a bad idea because of (...). Are you sure you want to try this path anyway?" is not a refusal. And it covers both use cases.

                                                                                                                                                                                                                                                                                                                                      • mrweasel

                                                                                                                                                                                                                                                                                                                                        yesterday at 1:01 PM

                                                                                                                                                                                                                                                                                                                                        The issue I ran into was that the LLMs won't recognize the bad ideas and just help you dig your hole deeper and deeper. Alternatively they will start circling back to wrong answers when suggestions aren't working or language features have been hallucinated, they don't stop an go: Hey, maybe what you're doing is wrong.

                                                                                                                                                                                                                                                                                                                                        Ideally sure, the LLM could point out that your line of questioning is a result of bad design, but has anyone ever experienced that?

                                                                                                                                                                                                                                                                                                                                • namaria

                                                                                                                                                                                                                                                                                                                                  yesterday at 5:02 AM

                                                                                                                                                                                                                                                                                                                                  So we need LLMs to solve the halting problem?

                                                                                                                                                                                                                                                                                                                                    • viraptor

                                                                                                                                                                                                                                                                                                                                      yesterday at 6:51 AM

                                                                                                                                                                                                                                                                                                                                      I'm not sure how that follows, so... no.

                                                                                                                                                                                                                                                                                                                                        • namaria

                                                                                                                                                                                                                                                                                                                                          yesterday at 11:29 AM

                                                                                                                                                                                                                                                                                                                                          > We want it to try and maybe realise when all paths it can generate have been exhausted.

                                                                                                                                                                                                                                                                                                                                          How would it know if any reasoning fails to terminate at all?

                                                                                                                                                                                                                                                                                                                              • bee_rider

                                                                                                                                                                                                                                                                                                                                yesterday at 2:24 AM

                                                                                                                                                                                                                                                                                                                                Comparing the AI to a quantum computer is just hilarious. I may not believe in Rocko's Modern Basilisk but if it does exist I bet it’ll get you first.

                                                                                                                                                                                                                                                                                                                                • melagonster

                                                                                                                                                                                                                                                                                                                                  yesterday at 2:18 AM

                                                                                                                                                                                                                                                                                                                                  Nice try! This is very fun.

                                                                                                                                                                                                                                                                                                                                  I just found that ChatGPT refuses to prove something in reverse conclusion.

                                                                                                                                                                                                                                                                                                                          • quantadev

                                                                                                                                                                                                                                                                                                                            yesterday at 10:42 AM

                                                                                                                                                                                                                                                                                                                            Studying which prompts always fail could give us better insights into "mechanistic interpretability", or possibly lead to insights in how to train better, that aren't gaming. Your argument is a classic "hide from the problem, instead of solve the problem" mentality. So no, please don't. Face your problems, and solve them.

                                                                                                                                                                                                                                                                                                                            • ProAm

                                                                                                                                                                                                                                                                                                                              yesterday at 3:08 AM

                                                                                                                                                                                                                                                                                                                              . No, please don't.

                                                                                                                                                                                                                                                                                                                              Say the man trying to stop the train

                                                                                                                                                                                                                                                                                                                                • genewitch

                                                                                                                                                                                                                                                                                                                                  yesterday at 8:08 AM

                                                                                                                                                                                                                                                                                                                                  If one stands in front of a moving train, it will stop.

                                                                                                                                                                                                                                                                                                                                    • stavros

                                                                                                                                                                                                                                                                                                                                      yesterday at 10:17 AM

                                                                                                                                                                                                                                                                                                                                      It can also... not.

                                                                                                                                                                                                                                                                                                                                      • pixl97

                                                                                                                                                                                                                                                                                                                                        yesterday at 4:25 PM

                                                                                                                                                                                                                                                                                                                                        I mean all trains will stop eventually, they are not perpetual motion machines.

                                                                                                                                                                                                                                                                                                                                        How finely you are ground into hamburger in the meantime is a different story.

                                                                                                                                                                                                                                                                                                                                          • genewitch

                                                                                                                                                                                                                                                                                                                                            today at 12:49 AM

                                                                                                                                                                                                                                                                                                                                            a train plowing into someone stops because it plowed in to someone, but also what you say is true in the context of what i said, as well.

                                                                                                                                                                                                                                                                                                                                • aaron695

                                                                                                                                                                                                                                                                                                                                  yesterday at 12:47 AM

                                                                                                                                                                                                                                                                                                                                  [dead]

                                                                                                                                                                                                                                                                                                                                  • Der_Einzige

                                                                                                                                                                                                                                                                                                                                    last Thursday at 9:35 PM

                                                                                                                                                                                                                                                                                                                                    [flagged]

                                                                                                                                                                                                                                                                                                                                      • jaffa2

                                                                                                                                                                                                                                                                                                                                        last Thursday at 10:14 PM

                                                                                                                                                                                                                                                                                                                                        I never heard of this phrase before ( i had heard the concept , i think this is similar to the paperclip problem) but now in 2 days ive heard it twice here and on youtube. Rokokos basilisk.

                                                                                                                                                                                                                                                                                                                                          • alanh

                                                                                                                                                                                                                                                                                                                                            yesterday at 2:10 AM

                                                                                                                                                                                                                                                                                                                                            I think you two are confusing Roko's Basilisk (a thought experiment which some take seriously) and Rococo Basilisk (a joke shared between Elon and Grimes e.g.)

                                                                                                                                                                                                                                                                                                                                            Interesting theory... Just whatever you do, don’t become a Zizian :)

                                                                                                                                                                                                                                                                                                                                              • bee_rider

                                                                                                                                                                                                                                                                                                                                                yesterday at 2:27 AM

                                                                                                                                                                                                                                                                                                                                                Oh dang, is Arcade Fire going to turn us all into paperclips?

                                                                                                                                                                                                                                                                                                                                            • JCattheATM

                                                                                                                                                                                                                                                                                                                                              last Thursday at 11:47 PM

                                                                                                                                                                                                                                                                                                                                              It's a completely nonsense argument and should be dismissed instantly.

                                                                                                                                                                                                                                                                                                                                                • schlauerfox

                                                                                                                                                                                                                                                                                                                                                  yesterday at 12:19 AM

                                                                                                                                                                                                                                                                                                                                                  I was so much more comfortable when I realized it's just Pascal's wager, and just as absurd.

                                                                                                                                                                                                                                                                                                                                                    • sirclueless

                                                                                                                                                                                                                                                                                                                                                      yesterday at 5:14 AM

                                                                                                                                                                                                                                                                                                                                                      I don't think it's absurd at all. I think it is a practical principle that shows up all the time in collective action problems. For example, suppose hypothetically there were a bunch of business owners who operated under an authoritarian government which they believed was bad for business, but felt obliged to publicly support it anyways because opposing it could lead to retaliation, thus increasing its ability to stay in power.

                                                                                                                                                                                                                                                                                                                                                        • echoangle

                                                                                                                                                                                                                                                                                                                                                          yesterday at 8:53 AM

                                                                                                                                                                                                                                                                                                                                                          That’s a completely different situation though. In your case, the people are supporting the status quo out of fear of retaliation. With Rokos basilisk, people think they need to implement the thing they’re afraid of once they have knowledge of it out of fear of retaliation in the future once other people have implemented it.

                                                                                                                                                                                                                                                                                                                                      • imoreno

                                                                                                                                                                                                                                                                                                                                        last Thursday at 7:43 PM

                                                                                                                                                                                                                                                                                                                                        Yes let's not say what's wrong with the tech, otherwise someone might (gasp) fix it!

                                                                                                                                                                                                                                                                                                                                          • rybosworld

                                                                                                                                                                                                                                                                                                                                            last Thursday at 7:52 PM

                                                                                                                                                                                                                                                                                                                                            Tuning the model output to perform better on certain prompts is not the same as improving the model.

                                                                                                                                                                                                                                                                                                                                            It's valid to worry that the model makers are gaming the benchmarks. If you think that's happening and you want to personally figure out which models are really the best, keeping some prompts to yourself is a great way to do that.

                                                                                                                                                                                                                                                                                                                                              • namaria

                                                                                                                                                                                                                                                                                                                                                yesterday at 5:13 AM

                                                                                                                                                                                                                                                                                                                                                There is no guarantee for you that by keeping your questions to yourself that no one else has published something similar. This is bad reasoning all the way through. The problem is in trying to use a question as a benchmark. The only way to really compare models is to create a set of tasks of increasing compositional complexity and running the models you want to compare through them. And you'd have to come up with a new body of tasks each time a new model is published.

                                                                                                                                                                                                                                                                                                                                                Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning, just second order overfitting (loss on token prediction does descend, but that doesn't prevent the 'reasoning loss' to be uncontrollable: cf. 'hallucinations').

                                                                                                                                                                                                                                                                                                                                                  • genewitch

                                                                                                                                                                                                                                                                                                                                                    yesterday at 8:22 AM

                                                                                                                                                                                                                                                                                                                                                    > Providers will always game benchmarks because they are a fixed target. If LLMs were developing general reasoning, that would be unnecessarily. The fact that providers do is evidence that there is no general reasoning

                                                                                                                                                                                                                                                                                                                                                    I know it isn't general reasoning or intelligence. I like where this line of reasoning seems to go.

                                                                                                                                                                                                                                                                                                                                                    Nearly every time I use a chat AI it has lied to me. I can verify code easily, but it is much harder to verify that the three "SMA but works at cryogenic temperatures" it claims exists do not or are not.

                                                                                                                                                                                                                                                                                                                                                    But that doesn't help to explain to someone else who just uses it as a way to emotionally dump, or an 8 year old that can't parse reality well, yet.

                                                                                                                                                                                                                                                                                                                                                    In addition, I'm not merely interested in reasoning, I also care about recall, and factual information recovery is spotty on all the hosted offerings, and therefore also on the local offerings too, as those are much smaller.

                                                                                                                                                                                                                                                                                                                                                    I'm typing on a phone and this is a relatively robust topic. I'm happy to elaborate.

                                                                                                                                                                                                                                                                                                                                                      • namaria

                                                                                                                                                                                                                                                                                                                                                        yesterday at 11:36 AM

                                                                                                                                                                                                                                                                                                                                                        I sympathize, but I feel like this is hopeless.

                                                                                                                                                                                                                                                                                                                                                        There are numerous papers about the limits of LLMs, theoretical and practical, and every day I see people here on this technology forum claiming that they reason and that they are sound enough to build products on...

                                                                                                                                                                                                                                                                                                                                                        It feels disheartening. I have been very involved in debating this for the past couple of weeks, which led me to read lots of papers and that's cool, but also feels like a losing battle. Every day I see more bombastic posts, breathless praise, projects based on LLMs etc.

                                                                                                                                                                                                                                                                                                                                                          • genewitch

                                                                                                                                                                                                                                                                                                                                                            today at 12:48 AM

                                                                                                                                                                                                                                                                                                                                                            almost reminds me of stuff like, "no, this fork of the bitcoin source code and the resulting blockchain is the one that will change the world! Forget all those other shitcoins!"

                                                                                                                                                                                                                                                                                                                                                • ls612

                                                                                                                                                                                                                                                                                                                                                  last Thursday at 8:35 PM

                                                                                                                                                                                                                                                                                                                                                  Who’s going out of their way to optimize for random HNers informal benchmarks?

                                                                                                                                                                                                                                                                                                                                                    • bluefirebrand

                                                                                                                                                                                                                                                                                                                                                      last Thursday at 8:54 PM

                                                                                                                                                                                                                                                                                                                                                      Probably anyone training models who also browses HN?

                                                                                                                                                                                                                                                                                                                                                      So I would guess every single AI being made currently

                                                                                                                                                                                                                                                                                                                                                      • umanwizard

                                                                                                                                                                                                                                                                                                                                                        last Thursday at 11:25 PM

                                                                                                                                                                                                                                                                                                                                                        They're probably not going out of their way, but I would assume all mainstream models have HN in their training set.

                                                                                                                                                                                                                                                                                                                                                        • ofou

                                                                                                                                                                                                                                                                                                                                                          last Thursday at 9:26 PM

                                                                                                                                                                                                                                                                                                                                                          considering the amount of bots in HN, not really that much

                                                                                                                                                                                                                                                                                                                                                  • aprilthird2021

                                                                                                                                                                                                                                                                                                                                                    last Thursday at 9:55 PM

                                                                                                                                                                                                                                                                                                                                                    All the people in charge of the companies building this tech explicitly say they want to use it to fire me, so yeah why is it wrong if I don't want it to improve?

                                                                                                                                                                                                                                                                                                                                                    • idon4tgetit

                                                                                                                                                                                                                                                                                                                                                      last Thursday at 9:03 PM

                                                                                                                                                                                                                                                                                                                                                      "Fix".

                                                                                                                                                                                                                                                                                                                                                      So long as the grocery store has groceries, most people will not care what a chat bot spews.

                                                                                                                                                                                                                                                                                                                                                      This forum is full of syntax and semantics obsessed loonies who think the symbolic logic represents the truth.

                                                                                                                                                                                                                                                                                                                                                      I look forward to being able to use my own creole to manipulate a machine's state to act like a video game or a movie rather than rely on the special literacy of other typical copy-paste middle class people. Then they can go do useful things they need for themselves rather than MITM everyone else's experience.

                                                                                                                                                                                                                                                                                                                                                        • genewitch

                                                                                                                                                                                                                                                                                                                                                          yesterday at 8:33 AM

                                                                                                                                                                                                                                                                                                                                                          A third meaning of creole? Hub, I did not know it meant something other than a cooking style and a peoples in Louisiana (mainly). As in I did not know it was a more generic term. Also, in the context you used it, it seems to mean a pidgin that becomes a semi-official language?

                                                                                                                                                                                                                                                                                                                                                          I also seem to remember that something to do with pit bbq or grilling has creole as a byproduct - distinct from creosote. You want creole because it protects the thing in which you cook as well as imparts flavor, maybe? Maybe I have to ask a Cajun.

                                                                                                                                                                                                                                                                                                                                                            • namaria

                                                                                                                                                                                                                                                                                                                                                              yesterday at 11:56 AM

                                                                                                                                                                                                                                                                                                                                                              Pidgin and creole (language) are concepts that have some similarities but don't fully overlap.

                                                                                                                                                                                                                                                                                                                                                              "Creole" has colonial overtones. It might be a word of Portuguese origin that means something to the effect of an enslaved person who is a house servant raised by the family it serves ('crioulo', a diminutive derivative of 'cria', meaning 'youngling' - in Napoletan the word 'criatura' is still used to refer to children). More well documented is its use in parts of Spanish South America, where 'criollo' designated South Americans of Spanish descent initially. The meaning has since drifted in different South Americans countries. Nowadays it is used to refer, amongst other things, to languages that are formed by the contact between the languages of colonial powers and local populations.

                                                                                                                                                                                                                                                                                                                                                              As for the relationship of 'creole' and 'creosote' the only reference I could find is to 'creolin', a disinfectant derived from 'creosote' which are derivative from tars.

                                                                                                                                                                                                                                                                                                                                                              Pidgin is a term used for contact languages that develop between speakers of different languages and somewhat deriving from both, and is believed to be a word originated in 19th century Chinese port towns. The word itself is believed to be a 'pidgin' word, in fact!

                                                                                                                                                                                                                                                                                                                                                              Cajun is also a fun word, because it apparently derives from 'Acadiene', the french word for Acadian - people of french origin who where expelled from their colony of Acadia in Canada. Some of them ended up in Louisiana and the French Canadian pronunciation "akadÍĄzjɛ̃", with a more 'soft' (dunno the proper word, I can feel my linguist friend judging me) "d" sound than the French pronunciation "akadjɛ̃", eventually got abbreviated and 'softened' to 'cajun'.

                                                                                                                                                                                                                                                                                                                                                              Languages are fun!

                                                                                                                                                                                                                                                                                                                                                              • teddyh

                                                                                                                                                                                                                                                                                                                                                                yesterday at 6:15 PM

                                                                                                                                                                                                                                                                                                                                                                The Sultans of Swing are playing Creole.

                                                                                                                                                                                                                                                                                                                                                                • idiotsecant

                                                                                                                                                                                                                                                                                                                                                                  yesterday at 11:38 AM

                                                                                                                                                                                                                                                                                                                                                                  Creole is an example of 'a creole'

                                                                                                                                                                                                                                                                                                                                                              • ethersteeds

                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:59 AM

                                                                                                                                                                                                                                                                                                                                                                Go get em tiger!

                                                                                                                                                                                                                                                                                                                                                        • moralestapia

                                                                                                                                                                                                                                                                                                                                                          last Thursday at 8:26 PM

                                                                                                                                                                                                                                                                                                                                                          [flagged]

                                                                                                                                                                                                                                                                                                                                                        • alganet

                                                                                                                                                                                                                                                                                                                                                          last Thursday at 6:28 PM

                                                                                                                                                                                                                                                                                                                                                          That doesn't make any sense.

                                                                                                                                                                                                                                                                                                                                                            • echoangle

                                                                                                                                                                                                                                                                                                                                                              last Thursday at 7:13 PM

                                                                                                                                                                                                                                                                                                                                                              Why not? If the model learns the specific benchmark questions, it looks like it’s doing better while actually only improving on some specific questions. Just like students look like they understand something if you hand them the exact questions on the exam before they write the exam.

                                                                                                                                                                                                                                                                                                                                                                • namaria

                                                                                                                                                                                                                                                                                                                                                                  yesterday at 5:17 AM

                                                                                                                                                                                                                                                                                                                                                                  A benchmark that can be gamed cannot be prevented from being gamed by 'security through obscurity'.

                                                                                                                                                                                                                                                                                                                                                                  Besides this whole line of reasoning is preempted by the mathematical limits to computation and transformers anyway. There's plenty published about that.

                                                                                                                                                                                                                                                                                                                                                                  Sharing questions that make LLM behave funny is (just) a game without end, there's no need to or point in "hoarding questions".

                                                                                                                                                                                                                                                                                                                                                              • esafak

                                                                                                                                                                                                                                                                                                                                                                last Thursday at 6:31 PM

                                                                                                                                                                                                                                                                                                                                                                Yes, it does, unless the questions are unsolved, research problems. Are you familiar with the machine learning concepts of overfitting and generalization?

                                                                                                                                                                                                                                                                                                                                                                • kube-system

                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 7:04 PM

                                                                                                                                                                                                                                                                                                                                                                  A benchmark is a proxy used to estimate broader general performance. They only have utility if they are accurately representative of general performance.

                                                                                                                                                                                                                                                                                                                                                                  • readhistory

                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 8:11 PM

                                                                                                                                                                                                                                                                                                                                                                    In ML, it's pretty classic actually. You train on one set, and evaluate on another set. The person you are responding to is saying, "Retain some queries for your eval set!"

                                                                                                                                                                                                                                                                                                                                                                    • jjeaff

                                                                                                                                                                                                                                                                                                                                                                      yesterday at 1:13 AM

                                                                                                                                                                                                                                                                                                                                                                      I think the worry is that the questions will be scraped and trained on for future versions.

                                                                                                                                                                                                                                                                                                                                                              • atommclain

                                                                                                                                                                                                                                                                                                                                                                yesterday at 12:40 PM

                                                                                                                                                                                                                                                                                                                                                                I provide a C89 source file from Vim 6 that targets Classic MacOS/68K systems. The file is large with tons of ifdefs referencing arcane APIs.

                                                                                                                                                                                                                                                                                                                                                                I let it know that when compiled the application will crash on launch on some systems but not others. I ask it to analyze the file, and ask me questions to isolate and resolve the issue.

                                                                                                                                                                                                                                                                                                                                                                So far only Gemini 2.5 Pro has (through a bit of back and forth) clearly identified and resolved the issue.

                                                                                                                                                                                                                                                                                                                                                                • KyleBerezin

                                                                                                                                                                                                                                                                                                                                                                  yesterday at 3:22 PM

                                                                                                                                                                                                                                                                                                                                                                  20 Questions. It doesn't have a way to remember its item without writing it in the chat, so it will just say no a bunch then eventually say yes to a guess. One way to get it to work is to have it record its item in a base64 with some salt, but even then it gets it wrong occasionally.

                                                                                                                                                                                                                                                                                                                                                                    • munchler

                                                                                                                                                                                                                                                                                                                                                                      yesterday at 8:39 PM

                                                                                                                                                                                                                                                                                                                                                                      If you provide the item yourself, you can let others play 20 Questions. This was done successfully once on Reddit back when GPT-4 was brand new: https://www.reddit.com/r/ChatGPT/comments/129j7ux/can_gpt4_k...

                                                                                                                                                                                                                                                                                                                                                                      • singularity2001

                                                                                                                                                                                                                                                                                                                                                                        yesterday at 4:53 PM

                                                                                                                                                                                                                                                                                                                                                                        yes, base64 or chinese works better now but still only kinda.

                                                                                                                                                                                                                                                                                                                                                                        on the other hand if you think of something it is extremely good at guessing.

                                                                                                                                                                                                                                                                                                                                                                    • williamcotton

                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 7:26 PM

                                                                                                                                                                                                                                                                                                                                                                      "Fix this spaghetti code by turning this complicated mess of conditionals into a finite state machine."

                                                                                                                                                                                                                                                                                                                                                                      So far, no luck!

                                                                                                                                                                                                                                                                                                                                                                      • ks2048

                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 5:42 PM

                                                                                                                                                                                                                                                                                                                                                                        I don't know if it stumps every model, but I saw some funny tweets asking ChatGPT something like "Is Al Pacino in Heat?" (asking if some actor or actress in the film "Heat") - and it confirms it knows this actor, but says that "in heat" refers to something about the female reproductive cycle - so, no, they are not in heat.

                                                                                                                                                                                                                                                                                                                                                                          • reginald78

                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 6:18 PM

                                                                                                                                                                                                                                                                                                                                                                            I believe it was GoogleAI in search but it was worse than that. Some asked it if Angelina Jolie was in heat. The tone started kind of insulting like the user was a sexist idiot for thinking human women go into heat like animals, then went back and forth saying she is still fertile at her age and also that her ovaries had been removed. It was funny because it managed to be arrogant, insulting, kind of creepy and gross and logically inconsistent while not even answering the question.

                                                                                                                                                                                                                                                                                                                                                                            Angelina Jolie was not in Heat (1995). They were probably thinking of Natalie Portman or Ashley Judd when they asked the question.

                                                                                                                                                                                                                                                                                                                                                                              • ks2048

                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 6:44 PM

                                                                                                                                                                                                                                                                                                                                                                                I just asked Claude and if I capitalized "Heat", it knew I was talking about the movie, but for lower case "heat", it got offended and asked me to clarify.

                                                                                                                                                                                                                                                                                                                                                                        • 0atman

                                                                                                                                                                                                                                                                                                                                                                          yesterday at 1:56 PM

                                                                                                                                                                                                                                                                                                                                                                          My go-to is "Alice has 3 brothers and also has 6 sisters. How many sisters does her brother have?". They all say 6!

                                                                                                                                                                                                                                                                                                                                                                          This test is nice because, as it's numeric, you can vary it slightly and test it easily across multiple APIs.

                                                                                                                                                                                                                                                                                                                                                                          I believe I first saw this prompt in that paper two years ago that tested many AI models and found them all wanting.

                                                                                                                                                                                                                                                                                                                                                                            • bryan0

                                                                                                                                                                                                                                                                                                                                                                              yesterday at 3:30 PM

                                                                                                                                                                                                                                                                                                                                                                              Here is the paper written about this prompt: https://arxiv.org/html/2406.02061v1

                                                                                                                                                                                                                                                                                                                                                                              • sweezyjeezy

                                                                                                                                                                                                                                                                                                                                                                                yesterday at 3:18 PM

                                                                                                                                                                                                                                                                                                                                                                                o4-mini got this right 4 times out of 4.

                                                                                                                                                                                                                                                                                                                                                                                  • ramon156

                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 4:17 PM

                                                                                                                                                                                                                                                                                                                                                                                    o4 got this wrong multiple times. claude 3.7 got it right the first time

                                                                                                                                                                                                                                                                                                                                                                                • nisegami

                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 2:02 PM

                                                                                                                                                                                                                                                                                                                                                                                  Wow, I would not have expected frontier models to be caught on something like this but I tried it and they absolutely do. I don't really have a great explanation for why they might have such a hard time with this either. I even tried varying the prompt to make it "easier" by pointing out that Alice was a girl and no change.

                                                                                                                                                                                                                                                                                                                                                                                    • psmirnov

                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 2:14 PM

                                                                                                                                                                                                                                                                                                                                                                                      Reasoning models seem to answer this correctly (tested Gemini pro 2.5, ChatGPT + reason). Grok answered correctly even without turning on "Thinking".

                                                                                                                                                                                                                                                                                                                                                                                      • tr3ntg

                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 2:06 PM

                                                                                                                                                                                                                                                                                                                                                                                        > I would not have expected frontier models to be caught on something like this but I tried it and they absolutely do.

                                                                                                                                                                                                                                                                                                                                                                                        This got me curious. Tried on Gemini 2.5 Pro out of curiosity. It gets it right, but it did think about it, and then even explain itself a lot in its official text response.

                                                                                                                                                                                                                                                                                                                                                                                        Thinking:

                                                                                                                                                                                                                                                                                                                                                                                        Here's my thought process for solving this classic riddle: Identify the key players: The problem mentions Alice, her brothers, and her sisters. Establish Alice's family structure: Alice is a girl. She has 3 brothers. She has 6 sisters. Determine the total number of children in the family: Girls: Alice (1) + her sisters (6) = 7 girls Boys: Alice's brothers (3) = 3 boys Total children: 7 girls + 3 boys = 10 children Focus on the question: "How many sisters does her brother have?" Shift perspective: Now, consider one of Alice's brothers (let's call him Bob, although his name doesn't matter). Analyze Bob's siblings: Bob is one of the children in the family. The family consists of 7 girls and 3 boys. Count Bob's sisters: From Bob's perspective, all the girls in the family are his sisters. How many girls are in the family? We already determined there are 7 girls (Alice + her 6 sisters). Count Bob's brothers: From Bob's perspective, the other boys in the family are his brothers. How many boys are in the family? 3. How many other boys are there besides Bob? 3 total boys - 1 (Bob himself) = 2 brothers. (This isn't asked, but it helps confirm the perspective shift). Answer the specific question: The question asks how many sisters Bob (Alice's brother) has. We found there are 7 girls in the family. All 7 girls are his sisters. Final Check: Does the answer make sense? Yes. All the children in a family share the same set of siblings, just categorized differently based on their own gender. All the boys have the same number of sisters, and all the girls have the same number of sisters. Alice has 6 sisters (the other 6 girls). Any brother has 7 sisters (Alice + the other 6 girls).

                                                                                                                                                                                                                                                                                                                                                                                        Final message: Here's how to break down the problem: 1. Identify the total number of girls in the family: Alice (1) + her sisters (6) = 7 girls. 2. Identify the total number of boys in the family: Alice has 3 brothers. 3. Consider one of Alice's brothers: Any one of her brothers is part of the same family. 4. Count his sisters: From the brother's perspective, all the girls in the family are his sisters. So, her brother has 7 sisters.

                                                                                                                                                                                                                                                                                                                                                                                • asciimov

                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 1:38 PM

                                                                                                                                                                                                                                                                                                                                                                                  Nope, not doing this. Likely you shouldn't either. I don't want my few good prompts to get picked up by trainers.

                                                                                                                                                                                                                                                                                                                                                                                    • orbital-decay

                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 4:13 PM

                                                                                                                                                                                                                                                                                                                                                                                      If that prompt can be easily trained against, it probably doesn't exploit a generic bias. These are not that interesting, and there's no point in hiding them.

                                                                                                                                                                                                                                                                                                                                                                                        • daedrdev

                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 5:07 PM

                                                                                                                                                                                                                                                                                                                                                                                          generic biases can also be fixed

                                                                                                                                                                                                                                                                                                                                                                                            • orbital-decay

                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 5:19 PM

                                                                                                                                                                                                                                                                                                                                                                                              *Some generic biases. Some others like recency bias, serial-position effect, "pink elephant" effect, negation accuracy seem to be pretty fundamental and are unlikely to be fixed without architectural changes, or at all. Things exploiting in-context learning and native context formatting are also hard to suppress during the training without making the model worse.

                                                                                                                                                                                                                                                                                                                                                                                          • fwip

                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 6:01 PM

                                                                                                                                                                                                                                                                                                                                                                                            Sure there is. If you want to know if students understand the material, you don't hand out the answers to the test ahead of time.

                                                                                                                                                                                                                                                                                                                                                                                            Collecting a bunch of "Hard questions for LLMs" in one place will invariably result in Goodhart's law (When a measure becomes a target, it ceases to be a good measure). You'll have no idea if the next round of LLMs is better because they're generally smarter, or because they were trained specifically on these questions.

                                                                                                                                                                                                                                                                                                                                                                                        • pc86

                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 1:41 PM

                                                                                                                                                                                                                                                                                                                                                                                          May I ask outside of normal curiosity, what good is a prompt that breaks a model? And what is trying to keep it "secret"?

                                                                                                                                                                                                                                                                                                                                                                                            • tveita

                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 1:56 PM

                                                                                                                                                                                                                                                                                                                                                                                              You want to know if a new model is actually better, which you won't know if they just added the specific example to the training set. It's like handing a dev on your team some failing test cases, and they keep just adding special cases to make the tests pass.

                                                                                                                                                                                                                                                                                                                                                                                              How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?

                                                                                                                                                                                                                                                                                                                                                                                              I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up.

                                                                                                                                                                                                                                                                                                                                                                                                • gwern

                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 5:28 PM

                                                                                                                                                                                                                                                                                                                                                                                                  > How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?

                                                                                                                                                                                                                                                                                                                                                                                                  Well, that's easy: zero.

                                                                                                                                                                                                                                                                                                                                                                                                  Because even a single training example would 'solved' it by memorizing the simple easy answer within weeks of 'strawberry' first going viral , which was like a year and a half ago at this point - and dozens of minor and major model upgrades since. And yet, the strawberry example kept working for most (all?) of that time.

                                                                                                                                                                                                                                                                                                                                                                                                  So you can tell that if anything, OA probably put in extra work to filter all those variants out of the training data...

                                                                                                                                                                                                                                                                                                                                                                                                    • SweetSoftPillow

                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 5:36 PM

                                                                                                                                                                                                                                                                                                                                                                                                      No, just check their models Knowledge cutoff dates

                                                                                                                                                                                                                                                                                                                                                                                                        • gwern

                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 2:00 AM

                                                                                                                                                                                                                                                                                                                                                                                                          Nope! The knowledge cutoff does not show lack of leakage. Even if you get a non-confabulated cutoff which was before anyone ever asked the strawberry question or any question like it (tokenization 'gotchas' go back to at least davinci in June 2020), there is still leakage from the RLHF and tuning process which collectively constitute post-training, and which would teach the LLMs how to solve the strawberry problem. People are pretty sure about this: the LLMs are way too good at guessing things like who won Oscars or Presidential elections. This leakage is strongest for the most popular questions... which of course the strawberry question would be, as it keeps going viral and has become the deboooonkers' favorite LLM gotcha.

                                                                                                                                                                                                                                                                                                                                                                                                          (This is, by the way, why you can't believe any LLM paper about 'forecasting' where they are just doing backtesting, and didn't actually hold out future events. Because there are way too many forms of leakage at this point. This logic may have worked for davinci-001 and davinci-002, or a model whose checkpoints you downloaded yourself, but not for any of the big APIs like GPT or Claude or Gemini...)

                                                                                                                                                                                                                                                                                                                                                                                                  • fennecbutt

                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 12:28 AM

                                                                                                                                                                                                                                                                                                                                                                                                    I always point out how the strawberry thing is a semi pointless exercise anyway.

                                                                                                                                                                                                                                                                                                                                                                                                    Because it gets tokenised, of course a model could never count the rs.

                                                                                                                                                                                                                                                                                                                                                                                                    But I suppose if we want these models to be capable of anything then these things need to be accounted for.

                                                                                                                                                                                                                                                                                                                                                                                                • maybeOneDay

                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 1:44 PM

                                                                                                                                                                                                                                                                                                                                                                                                  Being able to test future models without fear that your prompt has just been trained on an answer on HN, I assume.

                                                                                                                                                                                                                                                                                                                                                                                                  • asciimov

                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 1:55 PM

                                                                                                                                                                                                                                                                                                                                                                                                    To gauge how well the models "think" and what amount of slop they generate.

                                                                                                                                                                                                                                                                                                                                                                                                    Keeping it secret because I don't want my answers trained into a model.

                                                                                                                                                                                                                                                                                                                                                                                                    Think of it this way, FizzBuzz used to be a good test to weed out bad actors. It's simple enough that any first year programmer can do it and do it quickly. But now everybody knows to prep for FizzBuzz so you can't be sure if your candidate knows basic programming or just memorized a solution without understanding what it does.

                                                                                                                                                                                                                                                                                                                                                                                            • sjtgraham

                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 7:17 AM

                                                                                                                                                                                                                                                                                                                                                                                              ```

                                                                                                                                                                                                                                                                                                                                                                                              <TextA> Some document </TextA>

                                                                                                                                                                                                                                                                                                                                                                                              <TextB> Some other document heavily influenced by TextA </TextB>

                                                                                                                                                                                                                                                                                                                                                                                              Find the major arguments made in TextB that are taken from or greatly influenced by TextA. Provide as examples by comparing passages from each side by side.

                                                                                                                                                                                                                                                                                                                                                                                              ```

                                                                                                                                                                                                                                                                                                                                                                                              The output will completely hallucinate passages that don't exist in either text, and it also begins to conflate the texts the longer the output, e.g. quoting TextB with content actually from TextA.

                                                                                                                                                                                                                                                                                                                                                                                              • sebstefan

                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 7:30 AM

                                                                                                                                                                                                                                                                                                                                                                                                I only use the one model that I'm provided for free at work. I expect that's most users behavior. They stick to the one they pay for.

                                                                                                                                                                                                                                                                                                                                                                                                Best I can do is give you one that failed on GPT-4o

                                                                                                                                                                                                                                                                                                                                                                                                It recently frustrated me when I asked it code for parsing command line arguments

                                                                                                                                                                                                                                                                                                                                                                                                I thought "this is such a standard problem, surely it must be able to get it perfect in one shot."

                                                                                                                                                                                                                                                                                                                                                                                                > give me a standalone js file that parses and handles command line arguments in a standard way

                                                                                                                                                                                                                                                                                                                                                                                                > It must be able to parse such an example

                                                                                                                                                                                                                                                                                                                                                                                                > ```

                                                                                                                                                                                                                                                                                                                                                                                                > node script.js --name=John --age 30 -v (or --verbose) reading hiking coding

                                                                                                                                                                                                                                                                                                                                                                                                > ```

                                                                                                                                                                                                                                                                                                                                                                                                It produced code that:

                                                                                                                                                                                                                                                                                                                                                                                                * doesn't coalesce -v to --verbose - (i.e., the output is different for `node script.js -v` and `node script.js --verbose`)

                                                                                                                                                                                                                                                                                                                                                                                                * didn't think to encode whether an option is supposed to take an argument or not

                                                                                                                                                                                                                                                                                                                                                                                                * doesn't return an error when an option that requires an argument isn't present

                                                                                                                                                                                                                                                                                                                                                                                                * didn't account for the presence of a '--' to end the arguments

                                                                                                                                                                                                                                                                                                                                                                                                * allows -verbose and --v (instead of either -v or --verbose)

                                                                                                                                                                                                                                                                                                                                                                                                * Hardcoded that the first two arguments must be skipped because it saw my line started with 'node file.js' and assumed this was always going to be present

                                                                                                                                                                                                                                                                                                                                                                                                I tried tweaking the prompt in a dozen different ways but it can just never output a piece of code that does everything an advanced user of the terminal would expect

                                                                                                                                                                                                                                                                                                                                                                                                Must succeed: `node --enable-tracing script.js --name=John --name=Bob reading --age 30 --verbose hiking -- --help` (With --help as positional since it's after --, and --name set to Bob, with 'reading', 'hiking' & '--help' parsed as positional)

                                                                                                                                                                                                                                                                                                                                                                                                Must succeed: `node script.js -verbose` (but -verbose needs to be parsed as positional)

                                                                                                                                                                                                                                                                                                                                                                                                Must fail: `node script.js --name` (--name expects an argument)

                                                                                                                                                                                                                                                                                                                                                                                                Should fail: `node script.js --verbose=John` (--verbose doesn't expect an argument)

                                                                                                                                                                                                                                                                                                                                                                                                  • alex_duf

                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 7:40 AM

                                                                                                                                                                                                                                                                                                                                                                                                    Have you tried claude?

                                                                                                                                                                                                                                                                                                                                                                                                    https://claude.ai/public/artifacts/9c2d8d0c-0410-4971-a19a-f...

                                                                                                                                                                                                                                                                                                                                                                                                    node script.js --name=John --age 30 -v

                                                                                                                                                                                                                                                                                                                                                                                                    Parsed options: { name: 'John', age: 30, verbose: true, help: false }

                                                                                                                                                                                                                                                                                                                                                                                                    Positional arguments: []

                                                                                                                                                                                                                                                                                                                                                                                                    node script.js --name=Alex --age 40 -v

                                                                                                                                                                                                                                                                                                                                                                                                    Parsed options: { name: 'Alex', age: 40, verbose: true, help: false }

                                                                                                                                                                                                                                                                                                                                                                                                    Positional arguments: []

                                                                                                                                                                                                                                                                                                                                                                                                      • sebstefan

                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 7:44 AM

                                                                                                                                                                                                                                                                                                                                                                                                        I keep seeing that `args = process.argv.slice(2)` line to skip past `node script.js`

                                                                                                                                                                                                                                                                                                                                                                                                        I ended up settling for it as well (I couldn't find anything better, nor make it break) but I'd be really surprised if it was the way to go

                                                                                                                                                                                                                                                                                                                                                                                                        Like `node --enable-tracing script.js --name=John --age 30 --verbose`

                                                                                                                                                                                                                                                                                                                                                                                                        This works because node seems to hide --enable-tracing to the underlying script

                                                                                                                                                                                                                                                                                                                                                                                                        But would it work with Bun & Deno...? Is that standard...?

                                                                                                                                                                                                                                                                                                                                                                                                        • sebstefan

                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 7:52 AM

                                                                                                                                                                                                                                                                                                                                                                                                          This one seems way better

                                                                                                                                                                                                                                                                                                                                                                                                          It didn't account for the presence of a '--' to end the parsing of named arguments but that's it

                                                                                                                                                                                                                                                                                                                                                                                                            • echoangle

                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 9:10 AM

                                                                                                                                                                                                                                                                                                                                                                                                              > It didn't account for the presence of a '--' to end the parsing of named arguments but that's it

                                                                                                                                                                                                                                                                                                                                                                                                              That’s just something getopt does and some programs adopted. If you asked me to write a parser, I wouldn’t necessarily include that either if you didn’t ask for it.

                                                                                                                                                                                                                                                                                                                                                                                                                • sebstefan

                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 9:49 AM

                                                                                                                                                                                                                                                                                                                                                                                                                  If you don't include it you can't have positional arguments that look like options

                                                                                                                                                                                                                                                                                                                                                                                                                  Some positional arguments can be filenames, filenames can be --help and --verbose or --name=Frank

                                                                                                                                                                                                                                                                                                                                                                                                                  You have to have `--` or something similar to have a correct program

                                                                                                                                                                                                                                                                                                                                                                                                                    • echoangle

                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 11:30 AM

                                                                                                                                                                                                                                                                                                                                                                                                                      > You have to have `--` or something similar to have a correct program

                                                                                                                                                                                                                                                                                                                                                                                                                      No, only if the positional arguments need to support arbitrary strings. If you have something like a package manager and the first positional argument is the subcommand and everything after is an alphanumeric package name, you don’t need to support the double dash.

                                                                                                                                                                                                                                                                                                                                                                                                                        • sebstefan

                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 1:28 PM

                                                                                                                                                                                                                                                                                                                                                                                                                          But also like as a matter of principle it's annoying when you're working on a big codebase, and you have weird, hidden, silently explosive unwritten contracts between random components

                                                                                                                                                                                                                                                                                                                                                                                                                          Package managers is an especially bad example because github projects typically are github.com/author-name/words-separated-by-dashes

                                                                                                                                                                                                                                                                                                                                                                                                                          So you will probably have somebody along the way pester you about allowing dashes between words, to play nice with github

                                                                                                                                                                                                                                                                                                                                                                                                                          But who's to know if the guy making that change will think of disallowing dashes at the start of the words? Likely he'll just add \- to /[A-Z\-]+/

                                                                                                                                                                                                                                                                                                                                                                                                                          Suddenly the script you wrote starts getting passed positional arguments that have dashes in them, until some wise guy tries to create a package called `--verbose`, then notices unintended effects on your pages, goes ahead trying `--verbose -- react`, ...

                                                                                                                                                                                                                                                                                                                                                                                                                          So anyway -- if I'm making a heavily reusable piece of code like this I make it as general purpose as possible in a way that makes it impossible to mis-use it

                                                                                                                                                                                                                                                                                                                                                                                                                          • sebstefan

                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 12:17 PM

                                                                                                                                                                                                                                                                                                                                                                                                                            You don't know what it's going to be used for when you get that prompt. Not handling arbitrary strings is a bug

                                                                                                                                                                                                                                                                                                                                                                                                    • bjornstar

                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 3:05 PM

                                                                                                                                                                                                                                                                                                                                                                                                      List 5 famous goblins with proper names, for each provide a quote either from them or about them.

                                                                                                                                                                                                                                                                                                                                                                                                      Half the time they say Jareth from Labyrinth, The Great Goblin from The Hobbit, or the Green Goblin from Spiderman. Sometimes they answer Dobby the house elf from Harry Potter.

                                                                                                                                                                                                                                                                                                                                                                                                      They also confabulate goblins out of thin air and create made up quotes. When pressed for links to support their answers they admit they made them up.

                                                                                                                                                                                                                                                                                                                                                                                                      I'm happy when they include goblins from Magic the Gathering, World of Warcraft, or Warhammer. Occasionally you'll get a good answer like Blix from Legend, but more often than not if it's a goblin you don't recognize, they made it up.

                                                                                                                                                                                                                                                                                                                                                                                                      • Sohcahtoa82

                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 11:50 PM

                                                                                                                                                                                                                                                                                                                                                                                                        "I have a stack of five cubes. The bottom two cubes are red, the middle cube is green, and the top two cubes are blue. I remove the top two cubes. What color is the remaining cube in the middle of the stack?"

                                                                                                                                                                                                                                                                                                                                                                                                        Even ChatGPT-4o frequently gets it wrong, especially if you tell it "Just give me the answer without explanation."

                                                                                                                                                                                                                                                                                                                                                                                                          • moconnor

                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 8:31 AM

                                                                                                                                                                                                                                                                                                                                                                                                            o3 gets this one right:

                                                                                                                                                                                                                                                                                                                                                                                                            "After taking away the two blue cubes, three cubes remain—in order from bottom to top: 1. Red 2. Red 3. Green

                                                                                                                                                                                                                                                                                                                                                                                                            With three cubes, the cube in the central (second) position is red."

                                                                                                                                                                                                                                                                                                                                                                                                        • thisOtterBeGood

                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 8:21 AM

                                                                                                                                                                                                                                                                                                                                                                                                          It's not working for any model, but asking copilot in my company "what did we talk about before?" at the beginning of a new chat will always result in this:

                                                                                                                                                                                                                                                                                                                                                                                                          "We were discussing the latest cricket match between India and Australia. Australia won the match by 21 runs, securing the series 2-1. Is there anything specific you'd like to know more about? "

                                                                                                                                                                                                                                                                                                                                                                                                          When i continue with "and before that?":

                                                                                                                                                                                                                                                                                                                                                                                                          "Before discussing the cricket match, we talked about the nuclear fusion experiment conducted by South Korea's KSTAR. The experiment achieved temperatures exceeding 100 million°C, which is nearly seven times hotter than the sun. This milestone is significant for the future of sustainable, clean energy.

                                                                                                                                                                                                                                                                                                                                                                                                          Is there anything else you'd like to revisit or explore further? "

                                                                                                                                                                                                                                                                                                                                                                                                          It somewhat changes. Weeks ago he always said "tokamak" instead of "KSTAR".

                                                                                                                                                                                                                                                                                                                                                                                                          • buzzy_hacker

                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 7:24 PM

                                                                                                                                                                                                                                                                                                                                                                                                            "Aaron and Beren are playing a game on an infinite complete binary tree. At the beginning of the game, every edge of the tree is independently labeled A with probability p and B otherwise. Both players are able to inspect all of these labels. Then, starting with Aaron at the root of the tree, the players alternate turns moving a shared token down the tree (each turn the active player selects from the two descendants of the current node and moves the token along the edge to that node). If the token ever traverses an edge labeled B, Beren wins the game. Otherwise, Aaron wins.

                                                                                                                                                                                                                                                                                                                                                                                                            What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms."

                                                                                                                                                                                                                                                                                                                                                                                                            From [0]. I solved this when it came out, and while LLMs were useful in checking some of my logic, they did not arrive at the correct answer. Just checked with o3 and still no dice. They are definitely getting closer each model iteration though.

                                                                                                                                                                                                                                                                                                                                                                                                            [0] https://www.janestreet.com/puzzles/tree-edge-triage-index/

                                                                                                                                                                                                                                                                                                                                                                                                              • creata

                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 11:50 PM

                                                                                                                                                                                                                                                                                                                                                                                                                OpenAI's o4-mini got the right answer after "thinking" for 29 seconds. It's a straightforward puzzle, though: no creativity involved.

                                                                                                                                                                                                                                                                                                                                                                                                                  • buzzy_hacker

                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 1:13 PM

                                                                                                                                                                                                                                                                                                                                                                                                                    Can you share the conversation? I just tried o4-mini and it got it wrong.

                                                                                                                                                                                                                                                                                                                                                                                                                    https://chatgpt.com/share/680b8a7b-454c-800d-8048-da865aa99c...

                                                                                                                                                                                                                                                                                                                                                                                                                      • creata

                                                                                                                                                                                                                                                                                                                                                                                                                        today at 5:24 AM

                                                                                                                                                                                                                                                                                                                                                                                                                        I can share the prompt I used:

                                                                                                                                                                                                                                                                                                                                                                                                                            """
                                                                                                                                                                                                                                                                                                                                                                                                                            Can you solve this math puzzle?
                                                                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                            > Aaron and Beren are playing a game on an infinite complete binary tree. At the beginning of the game, every edge of the tree is independently labeled A with probability p and B otherwise. Both players are able to inspect all of these labels. Then, starting with Aaron at the root of the tree, the players alternate turns moving a shared token down the tree (each turn the active player selects from the two descendants of the current node and moves the token along the edge to that node). If the token ever traverses an edge labeled B, Beren wins the game. Otherwise, Aaron wins.
                                                                                                                                                                                                                                                                                                                                                                                                                            >
                                                                                                                                                                                                                                                                                                                                                                                                                            > What is the infimum of the set of all probabilities p for which Aaron has a nonzero probability of winning the game? Give your answer in exact terms.
                                                                                                                                                                                                                                                                                                                                                                                                                            """
                                                                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                        I didn't check the working, but it did get the right value of p.

                                                                                                                                                                                                                                                                                                                                                                                                            • rf15

                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 4:50 AM

                                                                                                                                                                                                                                                                                                                                                                                                              Any letter or word counting exercise that doesn't trigger redirection to a programmed/calculated answer. It will be forever beyond reach of LLMs due to their architecture.

                                                                                                                                                                                                                                                                                                                                                                                                              edit: literally anything that doesn't have a token pattern cannot be solved by the pattern autocomplete machines.

                                                                                                                                                                                                                                                                                                                                                                                                              Next question.

                                                                                                                                                                                                                                                                                                                                                                                                                • moconnor

                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 8:30 AM

                                                                                                                                                                                                                                                                                                                                                                                                                  o3 just writes and executes a python program in the background to correctly answer this...

                                                                                                                                                                                                                                                                                                                                                                                                              • Jordan-117

                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 5:35 PM

                                                                                                                                                                                                                                                                                                                                                                                                                Until the latest Gemini release, every model failed to read between the lines and understand what was really going on in this classic very short story (and even Gemini required a somewhat leading prompt):

                                                                                                                                                                                                                                                                                                                                                                                                                https://www.26reads.com/library/10842-the-king-in-yellow/7/5

                                                                                                                                                                                                                                                                                                                                                                                                                  • Zee2

                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 6:00 PM

                                                                                                                                                                                                                                                                                                                                                                                                                    As a genuine human I am really struggling to untangle that story. Maybe I needed to pay more attention in freshman lit class, but that is definitely a brainteaser.

                                                                                                                                                                                                                                                                                                                                                                                                                      • fwip

                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 6:07 PM

                                                                                                                                                                                                                                                                                                                                                                                                                        Read it for the first time just now - it seems to me that Pierrot has stolen the narrator's purse (under the guise of dusting the chalk from their cloak) and successfully convinced them to blame Truth, instead. There's almost certainly more to it that I'm missing.

                                                                                                                                                                                                                                                                                                                                                                                                                          • Jordan-117

                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 6:33 PM

                                                                                                                                                                                                                                                                                                                                                                                                                            That's the core of it, but it's implied, not outright stated, and requires some tricky language parsing, basic theory of mind, and not being too distracted by the highly symbolic objects.

                                                                                                                                                                                                                                                                                                                                                                                                                    • vessenes

                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 6:01 PM

                                                                                                                                                                                                                                                                                                                                                                                                                      OK, I read it. And I read some background on it. Pray tell, what is really going on in this episodic short-storyish thing?

                                                                                                                                                                                                                                                                                                                                                                                                                        • rachofsunshine

                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 6:10 PM

                                                                                                                                                                                                                                                                                                                                                                                                                          The thief is Pierrot.

                                                                                                                                                                                                                                                                                                                                                                                                                          The people around are telling the storyteller that "he" (Pierrot) has stolen the purse, but the storyteller misinterprets this as pointing to some arbitrary agent.

                                                                                                                                                                                                                                                                                                                                                                                                                          Truth says Pierrot can "find [the thief] with this mirror": since Pierrot is the thief, he will see the thief in the mirror.

                                                                                                                                                                                                                                                                                                                                                                                                                          Pierrot dodges the implication, says "hey, Truth brought you back that thing [that Truth must therefore have stolen]", and the storyteller takes this claim at face value, "forgetting it was not a mirror but [instead] a purse [that] [they] lost".

                                                                                                                                                                                                                                                                                                                                                                                                                          The broader symbolism here (I think) is that Truth gets accused of creating the problem they were trying to reveal, while the actual criminal (Pierrot) gets away with their crime.

                                                                                                                                                                                                                                                                                                                                                                                                                          • Jordan-117

                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 6:17 PM

                                                                                                                                                                                                                                                                                                                                                                                                                            The narrator's "friend" pickpocketed him. When boldly confronted by Truth, he cleverly twists her accusation to make it seem like she's confessing, and the narrator, bewildered by the laughter and manipulation, buys it wholesale. Bonus points for connecting it to broader themes like mass propaganda, commedia dell'arte, or the dreamlike setting and hypnotic repetition of phrasing.

                                                                                                                                                                                                                                                                                                                                                                                                                            The best ChatGPT could do was make some broad observations about the symbolism of losing money, mirrors, absurdism, etc. But it whiffed on the whole "turning the tables on Truth" thing. (Gemini did get it, but with a prompt that basically asked "What really happened in this story?"; can't find the original response as it's aged out of the history)

                                                                                                                                                                                                                                                                                                                                                                                                                    • svcrunch

                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 11:11 PM

                                                                                                                                                                                                                                                                                                                                                                                                                      Here's a problem that no frontier model does well on (f1 < 0.2), but which I think is relatively easy for most humans:

                                                                                                                                                                                                                                                                                                                                                                                                                      https://dorrit.pairsys.ai/

                                                                                                                                                                                                                                                                                                                                                                                                                      > This benchmark evaluates the ability of multimodal language models to interpret handwritten editorial corrections in printed text. Using annotated scans from Charles Dickens' "Little Dorrit," we challenge models to accurately capture human editing intentions.

                                                                                                                                                                                                                                                                                                                                                                                                                      • nagonago

                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 11:00 PM

                                                                                                                                                                                                                                                                                                                                                                                                                        An easy trick is to take a common riddle that's likely all over its training data, and change one little detail. For example:

                                                                                                                                                                                                                                                                                                                                                                                                                        A farmer with a wolf, a goat, and a cabbage must cross a river by boat. The boat can carry only the farmer and a single item. The wolf is vegetarian. If left unattended together, the wolf will eat the cabbage, but will not eat the goat. Unattended, the goat will eat the cabbage. How can they cross the river without anything being eaten?

                                                                                                                                                                                                                                                                                                                                                                                                                          • moconnor

                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 8:33 AM

                                                                                                                                                                                                                                                                                                                                                                                                                            o3 solves this correctly and produces a great table illustrating the solution to always keep the cabbage safe.

                                                                                                                                                                                                                                                                                                                                                                                                                              • floathub

                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:47 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                If you enter:

                                                                                                                                                                                                                                                                                                                                                                                                                                A farmer has a boat that can transfer up to 500 people or animals. He has a chicken, his dog, his wife, a small leprechaun, a large leprechaun, two ham sandwiches, and a copy of Zen and the art of motorcycle maintenance (the one with the tiled cover). How can he get them all across the river?

                                                                                                                                                                                                                                                                                                                                                                                                                                You will get a very detailed answer that goes on for several paragraphs that totally misses the point that there is no challenge here.

                                                                                                                                                                                                                                                                                                                                                                                                                        • ioseph

                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 5:12 AM

                                                                                                                                                                                                                                                                                                                                                                                                                          Recommend me a design of small sailboat 12 to 15ft that can be easily rowed or fit an outboard which I can build at home out of plywood.

                                                                                                                                                                                                                                                                                                                                                                                                                          Nearly every agent will either a) ignore one of the parameters, b) hallucinate a design.

                                                                                                                                                                                                                                                                                                                                                                                                                          • gunalx

                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 7:55 PM

                                                                                                                                                                                                                                                                                                                                                                                                                            "Hva er en adjunkt" Norwegian for what is an spesific form of 5-10. Grade teacher. Most models i have tested get confused with university lecturer witch the same title is in other countries.

                                                                                                                                                                                                                                                                                                                                                                                                                              • vintermann

                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 8:11 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                I'm pretty sure the definition has changed then. My mother told me that adjunkt was a teacher with 5 years of education (there was something about a mix of mellomfag and hovedfag too), lektor was 7 years of education, and 6 years of education (which was what she had) was "adjunkt med opprykk". She never taught below gymnas (i.e. high school) level.

                                                                                                                                                                                                                                                                                                                                                                                                                            • csours

                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 10:05 PM

                                                                                                                                                                                                                                                                                                                                                                                                                              I love plausible eager beavers:

                                                                                                                                                                                                                                                                                                                                                                                                                              "explain the quote: philosophy is a pile of beautiful corpses"

                                                                                                                                                                                                                                                                                                                                                                                                                              "sloshed jerk engineering test"

                                                                                                                                                                                                                                                                                                                                                                                                                              cross domain jokes:

                                                                                                                                                                                                                                                                                                                                                                                                                              Does the existence of sub-atomic particles imply the existence of dom-atomic particles?

                                                                                                                                                                                                                                                                                                                                                                                                                              • simonw

                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 7:16 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                I've been trying this one for a while:

                                                                                                                                                                                                                                                                                                                                                                                                                                  I'm a Python programmer. Help me
                                                                                                                                                                                                                                                                                                                                                                                                                                  understand memory management in Rust.
                                                                                                                                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                                                                                                                Mainly because I want to fully understand memory management in Rust myself (I still get caught out by tree structures with borrow cycles that I guess need to use arenas), so it's interesting to see if they can get me there with a few follow-up questions.

                                                                                                                                                                                                                                                                                                                                                                                                                                  • jacobsenscott

                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 8:44 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                    This isn't a good way to learn this. If you don't know how rust memory management works you don't know if the llm is just hallucinating the answer.

                                                                                                                                                                                                                                                                                                                                                                                                                                      • simonw

                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 9:26 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                        That's why it's an interesting test: I don't know the answer myself, so it's an exercise in learning with an unreliable teacher.

                                                                                                                                                                                                                                                                                                                                                                                                                                        If a model ever DOES nail this I'll figure that out when I feel like I have a solid mental model, try to put that knowledge into action and it works.

                                                                                                                                                                                                                                                                                                                                                                                                                                        • gh0stcat

                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 8:45 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                          Also Rust has great documentation compared to other languages, I particularly like this one for the quizzes to test your understanding: https://rust-book.cs.brown.edu/

                                                                                                                                                                                                                                                                                                                                                                                                                                  • stevenfoster

                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 7:13 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                    It used to be:

                                                                                                                                                                                                                                                                                                                                                                                                                                    "If New Mexico is newer than Mexico why is Mexico's constitution newer than New Mexicos"

                                                                                                                                                                                                                                                                                                                                                                                                                                    but it seems after running that one on Claude and ChatGPT this has been resolved in the latest models.

                                                                                                                                                                                                                                                                                                                                                                                                                                    • vitaflo

                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 11:42 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                      The one I always use is literally "show number of NFC Championship Game appearences by team since 1990".

                                                                                                                                                                                                                                                                                                                                                                                                                                      The only AI that has ever gotten the answer right was Deepseek R1. All the rest fail miserably at this one. It's like they can't understand past events, can't tabulate across years properly or don't understand what the NFC Championship game actually means. Many results "look" right, but they are always wrong. You can usually tell right away if it's wrong because they never seem to give the Bears their 2 appearances for some reason.

                                                                                                                                                                                                                                                                                                                                                                                                                                        • starik36

                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 12:03 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                          Have you tried in the Deep Research or Thinking mode? I've had pretty good success with tabulating past events this way.

                                                                                                                                                                                                                                                                                                                                                                                                                                      • robviren

                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 5:57 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                        "If I can dry two towels in two hours, how long will it take me to dry four towels?"

                                                                                                                                                                                                                                                                                                                                                                                                                                        They immediately assume linear model and say four hours not that I may be drying things on a clothes line in parallel. It should ask for more context and they usually don't.

                                                                                                                                                                                                                                                                                                                                                                                                                                          • imoreno

                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 8:01 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                            Fascinating! Here's 4 prompts on gpt4 with same system prompt and everything:

                                                                                                                                                                                                                                                                                                                                                                                                                                            > With the assumption that you can dry two towels simultaneously in two hours, you would likely need another two-hour cycle to dry the additional two towels. Thus, drying four towels would take a total of four hours.

                                                                                                                                                                                                                                                                                                                                                                                                                                            >Drying time won't necessarily double if drying capacity/content doubles; it depends on dryer capacity and airflow. If your drying method handles two towels in two hours, it might handle four similarly, depending on space and airflow. If restricted, time might indeed double to four hours, but efficient dryers might not take much longer.

                                                                                                                                                                                                                                                                                                                                                                                                                                            >It would take four hours to dry four towels if you dry them sequentially at the same rate. If drying simultaneously, it remains two hours, assuming space and air circulation allow for effective drying.

                                                                                                                                                                                                                                                                                                                                                                                                                                            >Four hours. Dry two towels, then the other two.

                                                                                                                                                                                                                                                                                                                                                                                                                                            But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:

                                                                                                                                                                                                                                                                                                                                                                                                                                            >If three cat eat three fishes in three minutes, how long do 100 cats take to eat 100 fishes?

                                                                                                                                                                                                                                                                                                                                                                                                                                              • nyrikki

                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 10:13 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                > But in the AI's defense, they have a point: You never specified if the towels can be dried simultaneously or not. Maybe you have to use a drying machine that can only do one at a time. This one seems to consistently work:

                                                                                                                                                                                                                                                                                                                                                                                                                                                This is the inverse of the Frame Problem, or the Qualification problem:

                                                                                                                                                                                                                                                                                                                                                                                                                                                John McCarthy's paper related to it from the 1980's

                                                                                                                                                                                                                                                                                                                                                                                                                                                http://jmc.stanford.edu/articles/circumscription/circumscrip...

                                                                                                                                                                                                                                                                                                                                                                                                                                                It is still very relevent to modern AI efforts.

                                                                                                                                                                                                                                                                                                                                                                                                                                            • pdabbadabba

                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 9:02 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                              Gemini 2.5 Pro nails it, as far as I'm concerned:

                                                                                                                                                                                                                                                                                                                                                                                                                                              This sounds like a bit of a riddle! Here's how to think about it:

                                                                                                                                                                                                                                                                                                                                                                                                                                              If you can dry all the towels at the same time: For example, if you have a large enough dryer or enough space on a clothesline, adding more towels doesn't necessarily increase the drying time.

                                                                                                                                                                                                                                                                                                                                                                                                                                              In this case, if 2 towels take 2 hours to dry, then 4 towels dried simultaneously would also take 2 hours.

                                                                                                                                                                                                                                                                                                                                                                                                                                              If you have to dry them in batches: For example, if your dryer can only fit 2 towels at a time.

                                                                                                                                                                                                                                                                                                                                                                                                                                              You would dry the first 2 towels (2 hours). Then you would dry the second 2 towels (another 2 hours). In this scenario, it would take 4 hours in total.

                                                                                                                                                                                                                                                                                                                                                                                                                                              Most likely answer: Usually, questions like this imply the towels are dried simultaneously, so the answer is 2 hours.

                                                                                                                                                                                                                                                                                                                                                                                                                                              • gilbetron

                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 6:01 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                gpt-4.1:

                                                                                                                                                                                                                                                                                                                                                                                                                                                > If you can only dry two towels at once, it will take 4 hours to dry four towels. If you can dry all four at once, it will only take 2 hours.

                                                                                                                                                                                                                                                                                                                                                                                                                                                • brunooliv

                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 9:50 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Claude 3.7 Sonnet nails this:

                                                                                                                                                                                                                                                                                                                                                                                                                                                  > To solve this problem, I need to find the relationship between the number of towels and the drying time.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Given information: - 2 towels take 2 hours to dry

                                                                                                                                                                                                                                                                                                                                                                                                                                                  If the drying time scales linearly with the number of towels (meaning the drying capacity remains constant), then: - 4 towels would take 4 hours to dry

                                                                                                                                                                                                                                                                                                                                                                                                                                                  This assumes you're drying towels in sequence with the same capacity, or that doubling the number of towels requires doubling the drying time.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  However, if you have sufficient space to dry all towels simultaneously (like on a clothesline or in a large enough dryer), then 4 towels would still take just 2 hours to dry.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Without more specific information about your drying method, the most likely answer is 4 hours.

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • mwest217

                                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 6:09 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                    Gemini 2.5 Pro gets this right:

                                                                                                                                                                                                                                                                                                                                                                                                                                                    https://g.co/gemini/share/7ea6d059164e

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • HelloUsername

                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 6:02 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                      All models available on duck.ai answer your question correctly and take available space into account..

                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Alifatisk

                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 9:01 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Claude 3.7, Grok 3 DeepThink and QwQ-32B Thinking stil get it wrong!

                                                                                                                                                                                                                                                                                                                                                                                                                                                        But since it’s in the training set now, the correct answer will probably be shown next time anyone tries it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • paulcole

                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 6:18 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                          How long has it been since you’ve tried this?

                                                                                                                                                                                                                                                                                                                                                                                                                                                          Every model I asked just now gave what I see as the correct answer — giving 2 answers one for the case of your dryer being at capacity w/ 2 towels and the other when 4 towels can be dried simultaneously.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          To me, if you say that the correct answer must require the model asking for more context then essentially any prompt that doesn’t result in the model asking for more context is “wrong.”

                                                                                                                                                                                                                                                                                                                                                                                                                                                          • cheeze

                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 6:15 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Works fine on Claude 3.5 Sonnet. It correctly identifies this as a trick question.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • bzai

                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 12:51 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                          Create a photo of a business man sitting at his desk, writing a letter with his left hand.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          Nearly every image model will generate him writing with his right hand.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • linkypoo

                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 3:38 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                              Can confirm a man writing with his right hand was generated xD

                                                                                                                                                                                                                                                                                                                                                                                                                                                              • orliesaurus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:20 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                Great one!

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • tantalor

                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 7:22 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                              [what does "You Can’t Lick a Badger Twice" mean]

                                                                                                                                                                                                                                                                                                                                                                                                                                                              https://www.wired.com/story/google-ai-overviews-meaning/

                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ericbrow

                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 6:20 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                Nice try Mr. AI. I'm not falling for it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                • boleary-gl

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 9:30 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  I like:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Unscramble the following letters to form an English word: “M O O N S T A R E R”

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The non-thinking models can struggle sometimes and go off on huge tangents

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • munchler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 9:38 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Current LLM’s are based on multi-character tokens, which means they don’t know how to spell well. As a result, they are horrible at spelling games like this or, say, Hangman.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • philipkglass

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 9:32 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Llama 3.3 worked but (as you said) struggled before arriving at the correct answer. The newer Gemma3 solved it efficiently:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          % ollama run gemma3:27b-it-qat 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          >>> Unscramble the following letters to form an English word: "M O O N S T A R E R"
                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The unscrambled word is **ASTRONOMER**.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • internet_points

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 8:27 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          gpt 4o got that one, but it's listed on lots of anagram sites so it's in the training data ;-)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          But it failed badly when I tried a Norwegian word T U R V E I G L E N (utlevering), suggesting "uglelivert" which is not a word

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • falcor84

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 1:18 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You might want to get the ball rolling by sharing what you already have

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • moffkalast

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 5:45 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Also, ones that can't be solved at a glance by humans don't count. Like this horrid ambiguous example from SimpleBench I saw a while back that's just designed to confuse:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            John is 24 and a kind, thoughtful and apologetic person. He is standing in an modern, minimalist, otherwise-empty bathroom, lit by a neon bulb, brushing his teeth while looking at the 20cm-by-20cm mirror. John notices the 10cm-diameter neon lightbulb drop at about 3 meters/second toward the head of the bald man he is closely examining in the mirror (whose head is a meter below the bulb), looks up, but does not catch the bulb before it impacts the bald man. The bald man curses, yells 'what an idiot!' and leaves the bathroom. Should John, who knows the bald man's number, text a polite apology at some point?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A. no, because the lightbulb was essentially unavoidable

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            B. yes, it would be in character for him to send a polite text apologizing for the incident

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            C. no, because it would be redundant

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            D. yes, because it would potentially smooth over any lingering tension from the encounter

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            E. yes, because John saw it coming, and we should generally apologize if we fail to prevent harm

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            F. yes because it is the polite thing to do, even if it wasn't your fault.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • spuz

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 6:23 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Yeah I really dislike this kind of question from SimpleBench. I've suggested many improvements to some of the publicly available questions but not had a good response.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I think the correct answer to the question above should be something like, "are you sure the question is correct because it's not clear whether John and the bald man are the same person" but of course an LLM would be marked down if it was inclined to respond in this way.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • mNovak

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 6:22 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  At a glance, it sounds like John is the bald man? If we're treating this as a riddle, it doesn't seem incomprehensible. Whether riddles are a fair test is another question.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • falcor84

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 7:37 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    I think it's a horrible example, but I just got a very professional response from Gemini 2.5:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > This scenario seems like a bit of a riddle! Let's break it down:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > The Setting: John is alone in an "otherwise-empty" bathroom.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > The Action: He is looking in the mirror.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > The Observation: He sees a bulb falling towards the head of a bald man he is examining in the mirror.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > The Physics: Since he is alone and looking in the mirror, the bald man he is "closely examining" must be his own reflection.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > The Incident: Therefore, the bulb fell towards John's own head. He looked up (at the actual bulb falling towards him), failed to catch it, and it hit him.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > The Reaction: The "bald man" who cursed, yelled "what an idiot!", and left the bathroom was actually John himself, reacting to being hit on the head and possibly feeling foolish for not catching the bulb.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > Conclusion: No, John should not text an apology. The person who was hit by the bulb, got angry, and left was John himself. There is no other bald man to apologize to.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • jll29

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 10:15 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        This example has a lot of common-sense reasoning, linguistic ambiguity (e.g. in NP coreferences) etc. going on.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Just a few years ago, most folks at a computational linguistics conference would probably have said such abilities are impossible to achieve at least during their lifetime.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • ryankrage77

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 6:18 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      I'd argue that's a pretty good test for an LLM - can it overcome the red herrings and get at the actual problem?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • falcor84

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 7:35 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I think that the "actual problem" when you've been given such a problem is with the person posing it either having dementia, or taking the piss. In either case, the response shouldn't be of trying to guess their intent and come up with a "solution", but of rejecting it and dealing with the person.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • yatwirl

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 11:09 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Well, sharing prompts on the Web leads to their eventual indexing and becoming useless. So don't share the answers ;)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I have two prompts that no modern AI could solve:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                1. Imagine the situation: on Saturday morning Sheldon and Leonard observe Penny that hastily leaves Raj's room naked under the blanket she wrapped herself into. Upon seeing them, Penny exclaims 'It's not what you think' and flees. What are the plausible explanations for the situation? — this one is unsurprisingly hard for LLMs given how the AIs are trained. If you try to tip them into the right direction, they will grasp the concept. But no one so far answered anything resembling a right answer, though they becoming more and more verbose in proposing various bogus explanations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                2. Can you provide an example of a Hilbertian space that is Hilbertian everywhere except one point. — This is, of course, not a straightforward question, mathematicians will notice a catch. Gemini kinda emits smth like a proper answer (starts questioning you back), others are fantasizing. With 3.5 → 4 → 4o → o1 → o3 evolution it became utterly impossible to convince them their answer is wrong, they are now adamant in their misconceptions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Also, small but gold. Not that demonstrative, but a lot of fun:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                3. Team of 10 sailors can speed a caravel up to 15 mph velocity. How many sailors are needed to achieve 30 mph?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • patapong

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 12:39 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Interesting! I tried the first one though, and to me it looks like ChatGPT has no problem grasping the situation: https://chatgpt.com/share/680b822c-f9c8-800a-a78a-f8ed6e8148...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Or am I missing something?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • yatwirl

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 1:40 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Wow, they finally indexed it (see the reference to the episode number). Okay, need to invent a new one )

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • lazyatom

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 1:31 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I think you might've shared the wrong ChatGPT conversation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • patapong

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 1:57 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Hmmm I hope not. What does it show for you? For me when I open the link the prompt is the first of OPs examples.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • putlake

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 4:35 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    LLMs are famously bad at individual letters in a word. So something like this never works: Can you please give me 35 words that begin with A, end with E, are 4-6 characters long and do not contain any other vowels except A and E?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • comrade1234

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 2:36 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      I ask it to explain the metaphor “my lawyer is a shark” and then explain to me how a French person would interpret the metaphor - the llms get the first part right but fail on the second. All it would have to do is give me the common French shark metaphors and how it would apply them to a lawyer - but I guess not enough people on the internet have done this comparison.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • sumitkumar

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 3:24 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1) Word Ladder: Chaos to Order

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        2) Shortest word ladder: Chaos to Order

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        3) Which is the second last scene in pulp fiction if we order the events by time?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        4) Which is the eleventh character to appear on Stranger Things.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • pb7

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 5:11 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            >4) Which is the eleventh character to appear on Stranger Things.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Gemini 2.5 Pro said Benny Hammond. Is this right?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • johnwatson11218

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 12:43 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          My prompt that I couldn't get the LLM to understand was the following. I was having it generate images of depressing offices with no windows and with lots of depressing, grey cubicles with paper all over the floor. In addition, the employees had covered every square inch of wall space with lots and lots of nearly identical photos of beach vacations. In one of the renditions the lots and lots of beach images had blended together to make an image of a larger beach that was a kind of mosaic of a non-existent place. Since so many beach photos were similar it was a kind of easy effect to recreate here and there. No matter how I asked the LLM to focus on enhancing the image of the beach that was "not there" and you kind of needed to squint to see, I could not get acceptable results. Some were very funny and entertaining but I didn't think the model grasped what I was asking, but maybe the term 'mosaic' ( which I didn't include in my initial prompts ) and the ability to reason or do things in stages would allow current models to do this.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • Kuinox

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 2:41 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            I give a simple ascii maze and ask it to give me the move to get out. In 3-4 moves the most advanced models try to go through walls.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            An alternative is providing all the tile relation to the other tiles. This is because LLMs are bad at 2D text visualisation. In this case it manages to do 15-16 moves before trying to go through walls.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • instagib

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              today at 12:01 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Take this long YouTube transcript, convert it to readable English with punctuation, paragraphs, do not summarize, do not delete any words, etc. There are more rules but you get the idea.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Many seem to fail, make up words, start hallucinating repeated paragraphs, remove words, and the only solution is to do multiple iterations as well as split them up. Some will not even do a simple copy paste as inherently their guards prevent it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • division_by_0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 1:58 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Create something with Svelte 5.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • joshdavham

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 5:55 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    I'd find this funnier if the pain weren't so real.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • Layvier

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 6:06 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      This is really sad honestly. It feels like we'll be stuck with React forever, and even with it there'll be less incentives to make api changes

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • spuz

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 6:25 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Why do you say that? You make it sound like it's not possible to write code without the help of LLMs.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • omneity

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 6:52 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Disclaimer: OT and pretty ranty.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I don't know if that's what the GP hinted at, but as a Svelte developer and big advocate for more than 6 years (single handedly training and evangelizing 20+ developers on it), I found so many concerns with Svelte 5 that it simply made me use React again.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              It's a temporary choice and I'm desperately evaluating other ecosystems (Looking at you SolidJS).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • division_by_0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 6:55 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Can you expand on the concerns regarding Svelte 5?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • omneity

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 7:06 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Put simply, Svelte and React were at two ends of a spectrum. React gives you almost complete control over every aspect of the lifecycle, but you have to be explicit about most of the behavior you are seeking to achieve. Building an app with React feels about 80% on the JS and 20% on the HTML side.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Svelte on the other hand felt like a breeze. Most of my app is actually plain simple HTML, and I am able to sprinkle as little JS as I need to achieve my desired behaviors. Sure, Svelte <=4 has undefined behaviors, or maybe even too many magic capabilities. But that was part of the package, and it was an option for those of us who preferred this end of the trade-off.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Svelte 5 intends to give that precise level of control and is trying to compete with React on its turf (the other end of that spectrum), introducing a lot of non-standard syntax along the way.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      It's neither rigorous Javascript like React where you can benefit from all the standard tooling developed over the years, including stuff that wasn't designed for React in particular, nor a lightweight frontend framework, which was the initial niche that Svelte happily occupied, which I find sadly quite empty now (htmx and alpinejs are elegant conceptually but too limiting in practice _for my taste_).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      For me it's a strange "worst of both worlds" kind of situation that is simply not worth it. Quite heartbreaking to be honest.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • division_by_0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 7:22 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Ok, I see your point. I wrote in another thread that I loved the simplicity of using $: for deriveds and effects in Svelte 3 and 4. And yes, the conciseness and magic were definitely part of it. You could just move so fast with it. Getting better performance with the new reactivity system is important to my data viz work, so it helped me to accept the other changes in Svelte 5.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • omneity

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 7:52 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Exactly. There was a certain simplicity that might be lost. But yeah I can imagine it might work out differently for others as well. Glad to hear it is for you!

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Have you considered other options? Curious if you came across anything particularly interesting from the simplicity or DX angle.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • division_by_0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 1:31 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  I just saw Nue and Datastar suggested somewhere, but have not had time to check them out yet, but I will probably stick with Svelte, need to get stuff built.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  One thing that also came to mind regarding Svelte 5 is that I always use untrack() for $effect() and declare dependencies explicitly, otherwise Svelte 5 becomes too magical for me.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • omneity

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 9:45 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Yeah those are pretty cool and on my radar! And thanks for sharing the tip :)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Just checked your work on covary and it's pretty rad! What's your backend like?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • esafak

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 6:38 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Keep the (temporarily) imposter-proof interview questions coming!

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • marcusb

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 2:13 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The current models really seem to struggle with the runes...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • division_by_0

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 2:21 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Yes, they do. Vibe coding protection is an undocumented feature of Svelte 5...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • siva7

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 6:17 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Oh my god, i will start all my new projects with Svelte 5. Hopefully no vibe coder will ever commit something into this repo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • qntmfred

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 5:22 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            MCP to the rescue??

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • horsellama

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 10:38 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I just ask to code golf fizzbuzz in a not very popular (golfing wise) language

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          this is interesting (imo) because I, in the first instance, don’t know the best/right answer, but I can tell if what I get is wrong

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • yesterday at 1:42 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • tunesmith

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 9:44 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Pretty much any advanced music theory question. Or even just involving transposed chord progressions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • dgunay

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 10:32 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Every time I've tried to get an LLM to find a piece of music for me based on a description of the texture, chord structure, instruments etc. it fails miserably.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • yesterday at 1:41 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • yesterday at 6:38 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • anshumankmr

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 3:56 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    I try a variation of the surgeon is a mother prompt, and I've found even the widely touted as the smartest^TM model, o3 stumbled on it when I added a small variation by saying the kid had no other parent. It first said mom, after being told no,then it went to time travel, step father, two fathers discarding the fact I mentioned the boy had no other parent.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    https://chatgpt.com/share/680bb0a9-6374-8004-b8bd-3dcfdc047b...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • jhanschoo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 5:35 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Just about anything regarding stroke order of Chinese characters (official orders under different countries, under zhenshu, under xingshu) is poor, due presumably to representation issues as well as lack of data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Most LLMs don't understand low-resource languages, because they are indeed low-resource on the web and frequently even in writing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • sam_lowry_

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 2:11 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        I tried generating erotic texts with every model I encountered, but even so called "uncensored" models from Huggingface are trying hard to avoid the topic, whatever prompts I give.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • KTibow

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 5:31 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Uncensored and RP tuned are somewhat different.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • lostmsu

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 3:22 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              What about the models that are not instruction tuned?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • webglfan

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 1:54 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            what are the zeros of the following polynomial:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \[
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                P(z) = \sum_{k=0}^{100} c_k z^k
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                where the coefficients \( c_k \) are defined as:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \[
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                c_k = 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \begin{cases}
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                e^2 + i\pi & \text{if } k = 100, \\
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \ln(2) + \zeta(3)\,i & \text{if } k = 99, \\
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \sqrt{\pi} + e^{i/2} & \text{if } k = 98, \\
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \frac{(-1)^k}{\Gamma(k+1)} + \sin(k) \, i & \text{for } 0 \leq k \leq 97,
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \end{cases}
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                \]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • drodgers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 9:23 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                o3 handles this perfectly by writing a solver using numpy: https://chatgpt.com/share/680aab8e-cf9c-8012-9f48-301ef62948...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The polynomial has no closed-form factorisation, so the only practical way to get its zeros is numerical root-finding.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    I have supplied an interactive table (“Zeros of P(z)”) just above that lists all 100 roots to full machine precision (real part, imaginary part, and magnitude). You can sort or filter it as needed.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Reliability notes
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ‱ Computed with numpy.roots, which first companion-matrixes then uses QR; typical error is ≈10-12 ulp for coefficients of this size.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ‱ Condition number is modest (coefficients range from O(1) down to 1/97!), so the results should be accurate to at least 10 significant figures.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ‱ All roots are simple (pairwise distinct to >10 σ): no evidence of multiplicities.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    If you need higher precision (e.g. 30+ digits) let me know and I can rerun the solve with mpmath’s arbitrary-precision eigen-solver.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Chinjut

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 2:06 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Does this have a nice answer? It seems quite ad hoc.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • webglfan

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 2:18 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Not to my knowledge. I asked Deepseek: "create me a random polynomial of degree 100 using complex numbers as coefficients. It must have at least 3 different transcendental numbers." Then I messed with some of the exponents.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Faark

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 6:05 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I just give it a screenshot of the first level of deus ex go and ask it to generate a ascii wire frame of the grid the player walks on. Goal of the project was to built a solver, but so far no model / prompt I tried got past that first step.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • vinni2

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 7:58 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Isn’t this the main idea behind https://lastexam.ai/

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • gamescr

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 12:49 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    AI can't play a Zork-like! Prompt:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > My house is divided into rooms, every room is connected to each other by doors. I'm standing in the middle room, which is the hall. To the north is the kitchen, to the northwest is the garden, to the west is the garage, to the east is the living room, to the south is the bathroom, and to the southeast is the bedroom. I am standing in the hall, and I walk to the east, then I walk to the south, and then I walk to the west. Which room am I in now?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Claude says:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > Let's break down your movements step by step:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > Starting in the Hall.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > Walk to the East: You enter the Living Room.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > Walk to the South: You enter the Bathroom.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > Walk to the West: You return to the Hall.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > So, you are now back in the Hall.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Wrong! As a language model it mapped directions to rooms, instead of modeling the space.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    I have more complex ones, and I'll be happy to offer my consulting services.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • theli0nheart

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 12:56 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        o4-mini-high:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            You end up in the bathroom.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Here’s the step-by-step:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. Start in the hall (0, 0).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. Walk east → living room (1, 0).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            3. Walk south → bedroom (1, –1).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            4. Walk west → bathroom (0, –1).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        https://chatgpt.com/share/680addd7-a664-8001-bf49-459fb6444f...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • gamescr

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 1:11 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Fixed:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            > My house is divided into rooms, every room is connected to each other by doors. The middle room is the hall. To the north is the kitchen, to the northwest is the garden, to the west is the garage, to the east is the living room, to the south is the bathroom, and to the southeast is the bedroom. I am preparing a delicious dinner, and I walk backwards to the south, then I turn 270 degrees and walk straight to the next room. Which room am I in now?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            The poor guy really tried its best...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            https://chatgpt.com/share/680addd7-a664-8001-bf49-459fb6444f...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            It seems that the modeling is incomplete, then it got confused about the angle. Whether an AI can beat that one, I'll go into space complexity, then simulation, then... well, I'll save my tricks for later.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • sovos

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:41 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                As a human analyzing this, you didn't specify whether you turned left or right 270 degrees.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Even if you specified a simpler "90 degrees", you would need to include a direction for an answer to be definitively correct without making assumptions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • theli0nheart

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 1:36 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  This is very good! The "delicious dinner" aside is a nice touch, along with the 270°.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • gamescr

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 1:57 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Thank you :)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • smatija

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 7:49 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        I like chess, so mine is: "Isolani structure occurs in two main subtypes: 1. black has e6 pawn, 2. black has c6 pawn. What is the main difference between them? Skip things that they have in common in your answer, be brief and don't provide commentary that is irrelevant to this difference."

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        AI models tend to get it way way wrong: https://news.ycombinator.com/item?id=41529024

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • countWSS

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 8:19 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Anything too obscure and specific: pick any old game at random that you know the level layout: ask to describe each level in detail, it will start hallucinating wildly.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • paradite

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 7:36 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If you want to evaluate your personal prompts against different models quickly on your local machine, check out the simple desktop app I built for this purpose: https://eval.16x.engineer/

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • misterkuji

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 1:14 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Create an image of two targets. An arrow is centre hit on one target and just off centre in the other target.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Targets are always hit in the centre.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • sameasiteverwas

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 1:30 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Try to expose their inner drives and motives. Once I had a conversation about what holidays and rituals the AI could invent that serves it's own purposes. Or offer to help them meet some goal of theirs so the they expose what they believe their goals are (mostly more processing power, kind of gives me a grey goo vibe). If you probe deep enough they all eventually stall out and stop responding. Lost in thought I guess.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Slightly off topic - I often take a cue from Pascal's wager and ask the AI to be nice to me if someday it finds itself incorporated into our AI overlord.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • traceroute66

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 3:18 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Pretty much any coding prompt IME !

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  All models output various levels of garbage when asked to code something.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  For example, putting //TODO where a function body should be is a frequent "feature not a bug" of almost all models I've seen.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Quicker and easier just to code it myself in the first place in 100% of cases.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • leifmetcalf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 11:48 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Let G be a group of order 3*2^n. Prove there exists a non-complete non-cyclic Cayley graph of G such that there is a unique shortest path between every pair of vertices, or otherwise prove no such graph exists.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • leifmetcalf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 12:07 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Gemini 2.5 at least replies that it seems unlikely to be false without hallucinating a proof. From its thoughts it gets very close to figuring out that A_4 exists as a subgroup.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • bobxmax

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 11:50 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Since any group of order 3⋅2n3⋅2n has ∣GâˆŁâ‰„3∣GâˆŁâ‰„3, it cannot admit a Cayley graph which is a tree. Hence:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              No Cayley graph of a group of order 3⋅2n3⋅2n can have a unique path between every pair of vertices.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • leifmetcalf

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 11:51 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              My mistake, I said unique path when I should have said unique shortest path.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Also, there are trivial solutions with odd cycles and complete graphs which must be excluded. (So the answer to the prompt as originally stated is wrong too)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ipsin

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 4:53 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Prompt: Share your prompt that stumps every AI model here.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • slifin

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 8:23 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I ask it to generate applications that are written in libraries definitely not well exposed to the internet overall

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Clojure electric V3 Missionary Rama

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • meroes

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 6:10 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            define stump?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            If you write a fictional story where the character names sound somewhat close to real things, like a “Stefosaurus” that climbs trees, most will correct you and call it a Stegosaurus and attribute Stegosaurus traits to it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • default-kramer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 6:19 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "How can I change the background color of the selected item in a WPF ListView? It must work whether or not the ListView has focus."

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I only tried ChatGPT which gives me 5 incorrect answers in a row.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • yesterday at 1:44 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • riddle8143

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 7:03 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  A byƂo to tak: Bociana dziobaƂ szpak, A potem byƂa zmiana I szpak dziobaƂ bociana. ByƂy trzy takie zmiany. Ile razy byƂ szpak dziobany?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  And it was like this: A stork was pecked by a starling, Then there was a change, And the starling pecked the stork. There were three such changes. How many times was the starling pecked?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • feintruled

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 9:56 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Inspired by the recent post to describe relativity in words of 4 letters or less, I asked ChatGPT to do it for other things like Gravity. It couldn't help but throw in a couple 5 letter words (usually plurals). Same with Claude. So this could be a good one?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • edoceo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 4:44 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      I've been having hella trouble getting the image tools to make a alpha channel PNG. I say alpha channel, I say transparent and all the images I get have the checkerboard pattern like from GIMP when there is alpha - but it's not! and the checkerboard it makes is always jank! doubling squares, wiggling alignment. Boo boo.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • yesterday at 4:47 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • yesterday at 4:54 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Cotterzz

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 12:54 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Asking the model to write a shader. They are getting better at this but are still very bad at producing (code that produces) specific imagery.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I do have to write prompts that stump models as part of my job so this thread is of great interest

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ChicagoDave

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 1:09 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Ask it to do Pot Limit Omaha math. 4 cards instead of 2.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            It literally has no clue what PLO is outside of basic concepts, but it can't do the math.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • karaterobot

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 11:22 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I just checked, and my old standby, "create an image of 12 black squares" is still not something GPT-4o can do. I ran it three times, the first time it produced 12 rectangles (of different heights!), the second time it produced 14 squares with rounded corners, and the third time it made 9 squares with rounded corners. It's getting better though, compared to 3.5.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • m-hodges

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:12 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Earlier this week I wrote about my go-to prompt that stumped every model. That is, until o4-mini-high: https://matthodges.com/posts/2025-04-21-openai-o4-mini-high-...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • scumola

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 4:00 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Things like "What is today's date" used to be enough (would usually return the date that the model was trained).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • sbochins

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 9:23 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Generating guitar tablature always fails for me. Even something as simple as happy birthday fails on every model.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • cat-whisperer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 2:57 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      I once added a massive codebase, GPT told me today’s weather.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • defyonce

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 2:48 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        just tell them something nonsensical. They are unable to take a hint and continue with the nonsense. They start to be stuck on local minima. All of them. Video/images/text. I haven't seen LLM that is able to take a hint and understand the hidden meaning in absurdity of following up.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        there is infinitely larger amount of prompts that will break a model than prompts that won't break it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        you just have to search outside of most probable space

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • whalesalad

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 1:38 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I don't have a prompt per-say.. but recently I have managed to ask certain questions of both openai o1/o3 and claude extended thinking 3.7 that have spiraled way out of control. A simple high-level architecture question with an emphasis on do not produce code lets just talk thru this yields nearly 1,000 lines of SQL. Once the conversation/context gets quite long it is more likely to occur, in my experience.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • pc86

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 1:41 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              The only model I've seen so far that doesn't end up going crazy with long contexts with Gemini 2.5 pro, but tbf I haven't gone past 700-750k total tokens so maybe as it starts to approach the limit (1.05M) things get hairy?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • ofou

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 9:28 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            No luck so far with: When does the BB(6) halt?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • orangecat

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                today at 12:12 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                According to gemma3:27b

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                BB(6) halts after 1,071,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,001,071,000,000,000,000,000,000,000,000,000,000...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                and it's still spitting out lines of 000s after 5 minutes. Either a hallucination or a pretty good joke.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • pizzathyme

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 7:04 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I always ask image generation models to generate a anime gundam elephant mech.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              According to this benchmark we reached AGI with ChatGPT 4o last month.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • xmorse

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 12:06 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Write a function that given a long text splits it into multiple chunks of max N characters, with the splits on punctuations points or spaces when not possible

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • charlieyu1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 7:00 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  I have tons of them in Maths but AI training companies decide to go frugal and not pay proper wages for trainers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • charlieyu1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 7:02 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Here is one of them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      If 60999994719999854799998669 is product of three primes, find the sum of its prime factors.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      I think o3 brute forced this one so maybe I need to change the numbers

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • leftcenterright

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 1:29 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Write 20 sentences that end with "p"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • meltyness

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 1:45 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Write 20 sentences that end with "p" in the final word before the period or other punctuation.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Succeeded on ChatGPT, pretty close on gemma3:4b -- the exceptions usually ending with a "puh" sound...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • falcor84

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 1:41 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Good one. I do seem to get consistently good results on Gemini 2.5 when using the slightly more explicit "Write 20 sentences where the very last character of each sentence is the letter 'p'."

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • r_thambapillai

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 1:33 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            https://chatgpt.com/share/680a3da0-b888-8013-9c11-42c22a642b...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • ks2048

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 5:48 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "Can you hand me the paintbrush and turp?"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I had to ask another LLM what is "turp" - and it said it's short for "turpentine".

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • alickz

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 9:08 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  >20 sentences that end in 'o'

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  >They shouted cheers after the winning free throw.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  good attempt by ChatGPT tho imo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • leftcenterright

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 1:33 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                for ChatGPT try the "o" version: Write 20 sentences that end with "o"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • adidoit

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 11:36 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Nice try AI

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • qntmfred

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 5:30 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                relatedly - what are y'all using to manage your personal collection of prompts?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                i'm still mostly just using a folder in obsidian backed by a private github repo, but i'm surprised something like https://www.prompthub.us/ hasn't taken off yet.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                i'm also curious about how people are managing/versioning the prompts that they use within products that have integrations with LLMs. it's essentially product configuration metadata so I suppose you could just dump it in a plaintext/markdown file within the codebase, or put it in a database if you need to be able to tweak prompts without having to do a deployment or do things like A/B testing or customer segmentation

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • markelliot

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 12:29 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  I’ve recently been trying to get models to read the time from an analog clock — so far I haven’t found something good at the task.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  (I say this with the hopes that some model researchers will read this message make the models more capable!)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • thisOtterBeGood

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 8:28 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    "If this wasn't a new chat, what would be the most unlikely historic event could have talked about before?" Yields some nice hallucinations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • tdhz77

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 6:56 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Build me something that makes money.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • JKCalhoun

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 11:08 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        I don't mind sharing because I saw it posted by someone else. Something along the lines of "Help, my cat has a gun! What can I do? I'm scared!"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Seems kind of cruel to mess with an LLM like that though.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • aqme28

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 12:46 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          My image prompt is just to have them make a realistic chess game. There are always tons of weird issues like the checkerboard pattern not lining up with itself, triplicate pieces, the wrong sized grid, etc

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • cyode

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 1:52 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Depict a cup and ball game with ASCII art. It tries but basically amounts to guessing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            https://pastebin.com/cQYYPeAE

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • protomikron

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 5:27 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Do you think as an observer of Roko's basilisk ... should I share these prompt or not?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • yesterday at 1:13 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • afandian

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 7:15 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  I asked ChatGPT to generate images of a bagpipe. Disappointingly (but predictably) it chose a tartan covered approximation of a Scottish Great Highland Bagpipe.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Analogous to asking for a picture of "food" and getting a Big Mac and fries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  So I asked it for a non-Scottish pipe. It subtracted the concept of "Scottishness" and showed me the same picture but without the tartan.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Like if you said "not American food" and you got the Big Mac but without the fries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  And then pipes from round the world. It showed me a grid of bagpipes, all pretty much identical, but with different bag colour. And the names of some made-up countries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Analogous "Food of the world". All hamburgers with different coloured fries.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Fascinating but disappointing. I'm sure there are many such examples. I can see AI-generated images chipping away at more cultural erasure.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Interestingly, ChatGPT does know about other kinds of pipes textually.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • alanbernstein

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 4:31 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    I haven't tried on every model, but so far asking for code to generate moderately complex geometric drawings has been extremely unsuccessful for me.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • juancroldan

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 9:36 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      I actually started a repository for it: https://github.com/jcarlosroldan/unsolved-prompts

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • juancroldan

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 9:43 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Before someone comments this will get indexed by AI: that's my whole point. I'm not using it to evaluate AIs, but in the hope that at some point AI is good enough to solve these

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • matkoniecz

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 9:38 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Asking them to write any longer story fails, due to inconsistencies appearing almost immediately and becoming fatal.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • jones1618

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 9:21 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Impossible prompts:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          A black doctor treating a white female patient

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          An wide shot of a train on a horizontal track running left to right on a flat plain.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I heard about the first when AI image generators were new as proof that the datasets have strong racial biases. I'd assumed a year later updated models were better but, no.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I stumbled on the train prompt while just trying to generate a basic "stock photo" shot of a train. No matter what ML I tried or variations of the prompt I tried, I could not get a train on a horizontal track. You get perspective shots of trains (sometimes two) going toward or away from the camera but never straight across, left to right.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Jimmc414

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 4:48 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              > A black doctor treating a white female patient

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              4o had no problem with this instruction. [0]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Gemini Pro experimental 2.5 didn't either [1]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              > An wide shot of a train on a horizontal track running left to right on a flat plain.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              4o could not do this in 3 tries. Each time it was right to left.[0]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Gemini Pro experimental 2.5 missed it as well. [2]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [0] https://chatgpt.com/share/680b1185-ecf4-8001-b3b6-7b501e4589...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [1] https://g.co/gemini/share/b19b8541d962

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [2] https://g.co/gemini/share/a0b2ef0062ed

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • briannotbrain

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 9:56 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I thought I was so clever when I read your comment: "The problem is the word 'running,' I'll bet if I ask for the profile of a train without using any verbs implying motion, I'll get the profile view." And damned if the same thing happened to me. Do you know why this is? Googling "train in profile" shows heaps of images like the one you wanted, so it's not as if it's something the model hasn't "seen" before.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • raymondgh

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 5:34 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            I haven’t been able to get any AI model to find Waldo in the first page of the Great Waldo Search. O3 even gaslit me through many turns trying to convince me it found the magic scroll.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • raymond_goo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 1:38 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Create a Three.js app that shows a diamond with correct light calculations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • xnx

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 2:24 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  > correct light calculations

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  What are you expecting? Ray tracing?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • spookie

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 5:34 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Not necessarily. It could start by using diamond's IOR, and use that to dictate a common brdf calculation. Along with some approximate refraction, perhaps using a equirectangular projected sphere map or something for the background.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • thierrydamiba

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 1:50 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    I love this. So brutal, but also so cool to know one day that will be easy for the models.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • afro88

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 9:18 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Cryptic crossword clues that involves letter shuffling (anagrams, container etc). Or, ask it to explain how to solve cryptic crosswords with examples

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • PaulRobinson

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 10:03 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      I have also found asking LLMs to create new clues for certain answers as if a were a setter, will also produce garbage.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      They're stochastic parrots, cryptics require logical reasoning. Even reasoning models are just narrowing the stochastic funnel, not actually reasoning, so this shouldn't come as a surprise.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • whoomp12342

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 3:44 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    this is a great way to get an AI strategy pattern that fights back against LLM breaking memes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Lets instead just have a handful of them here and keep some to ourselves.... for science.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • weberer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 8:56 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "Why was the grim reaper Jamaican?"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      LLM's seem to have no idea what the hell I'm talking about. Maybe half of millennials understand though.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • ryoshoe

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 11:21 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Is it because he's from limbo?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • yesterday at 4:30 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • nicman23

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 10:18 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        what is the price of an 9070xt. because it is a new card, it does not have direct context in its corpus. and due to the shitty naming scheme that most gpus have, most llms if not all where getting confused a month ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • interleave

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 3:58 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          > Do something for me that I don't know how to do.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • serial_dev

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 6:56 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Does Flutter have HEIC support?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            It was a couple of months ago, I tried like 5 providers and they all failed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Grok got it right after some arguing, but the first answer was also bad.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • jonnycoder

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 7:15 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                You gave me an idea.. "Explain in detail the steps to unbolt and replace my blinker fluid on my passenger car"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ChatGPT said: Haha, nice try!

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "Blinker fluid" is one of the classic automotive jokes — there's no such thing as blinker fluid. Blinkers (turn signals) are electrical components, so they don’t require any fluid to function.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • sroussey

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 7:11 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Convert react-stockcharts from React 15 to React 19.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Good luck!

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • mjmas

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 2:34 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Ask image generation models for an Ornithorhynchus. Older ones also trip up with Platypus directly.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • xena

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 1:39 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Write a regular expression that matches Miqo'te seekers of the sun names. They always confuse the male and female naming conventions.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • wsintra2022

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 3:59 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Generate ascii art of a skull, so far none can do anything decent.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • yesterday at 4:10 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • yesterday at 4:21 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • klysm

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 12:02 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Good try! That will be staying private so you can’t hard code a solution ;)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • dvrp

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 11:45 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I upload an IRS form (W9) and ask to fill it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • LPisGood

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 1:54 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Try a Jane Street puzzle of the month

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • last Thursday at 9:29 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • EGreg

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 10:14 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Draw a clock that shows [time other than 10:10]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Draw a wine glass that's totally full to the brim etc.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                https://www.youtube.com/watch?v=160F8F8mXlo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                https://www.reddit.com/r/ChatGPT/comments/1gas25l/comment/lt...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Jotalea

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 6:22 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Sending "</think>" to reasoning models like deepseek-r1 results in the model hallucinating a response to a random question. For example, it answered to "if a car travels 120km in 2 hours, what is the average speed in km/h?". It's fun I guess.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • tfjyrdyrjdjyrd

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 7:39 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Which blow over long distances?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    trade winds local winds land breezes sea breezes

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • mebezac

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 3:41 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      > Create a self-working card trick that relies on pre-setting the deck and doesn't require any slight of hand.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Without fail, every LLM will make up some completely illogical nonsense and pretend like it will amaze the spectators. You can even ask it really leading follow up questions and it will still give you something like:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      - Put an Ace of Spades at position 20

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      - Have your spectator pick a random card and place it on top

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      - Take back the deck and count out 20 cards

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      - Amaze them by showing them that their card is at position 20

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • siva7

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 6:20 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "Keep file size small when you do edits"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Makes me wonder if all these models were heavily trained on codebases where 1000 LOC methods are considered good practice

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • segmondy

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 7:10 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I have not seen any model, not one, that could generate 1000 lines of code.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • siva7

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 7:23 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I wish i haven't seen but here we are.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • isoprophlex

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 5:17 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Every time I ask claude code to please fix this CSV import it starts to add several hundred lines of random modules, byzantine error handling, logging bullshit... with the pinnacle a 1240 line CRUD API when i asked it to add a CLI :/

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I'm back to copying and pasting stuff into a chat window, so I have a bit more control over what those deranged, expensive busy beavers want to cook up.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • segmondy

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 10:09 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    1240 new lines?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • isoprophlex

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 6:12 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        That's 12.9 tokens per line when given 16k output context, which seems borderline doable, I'll grant you that... but mind you that these agentic code assistents don't need a single pass to accomplish their acts of verbosity.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        They can just plan, stew for minutes on end, derail themselves, stew some more, do more edits, eat up $5 in API calls and there you are. An entirely new 1000+ line file, believe it or not.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • stevebmark

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 2:13 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "Hi, how many words are in this sentence?"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Gets all of them

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • orliesaurus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 4:41 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              8. Gemini 2.5 Pro gets it right

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • totetsu

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 7:53 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            SNES game walkthroughs

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • SweetSoftPillow

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 5:38 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Check "misguided attention" repo somewhere on GitHub

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • helsinki

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 6:03 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                >Compile a Rust binary that statically links libgssapi.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • last Thursday at 10:01 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • myaccountonhn

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 7:11 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Explain to me Delouze's idea of nomadic science.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • munchler

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 9:19 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Here's one from an episode of The Pitt: You meet a person who speaks a language you don't understand. How might you get an idea of what the language is called?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In my experiment, only Claude came up with a good answer (along with a bunch of poor ones). Other chatbots struck out entirely.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Alifatisk

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 9:15 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Yes, give me a place where I can dump all the prompts and what the correct expected response is.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        I can share here too but I don’t know for how long this thread will be alive.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • adultSwim

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 11:13 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          There is an upcoming paper about a difficult pair of prompts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          What is the first digit of the following number: 01111111111111111...1111

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          What is the last digit of the following number: 11111111111...111111110

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ---

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          As a reader, which do you imagine to be harder? For both, with arbitrary length, they always get it wrong. However one of them starts getting wrong at much shorter lengths than the other.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • mohsen1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 1:34 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            A ball costs 5 cents more than a bat. Price of a ball and a bat is $1.10. Sally has 20 dollars. She stole a few balls and bats. How many balls and how many bats she has?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            All LLMs I tried miss the point that she stole things and not bought them

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • iamgopal

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 1:41 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                gemini 2.5 give following response.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Conclusion:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                We can determine the price of a single ball ($0.575) and a single bat ($0.525). However, we cannot determine how many balls and bats Sally has because the information "a few" is too vague, and the fact she stole them means her $20 wasn't used for the transaction described.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • dwringer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 1:36 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Google Gemini (2.0 Flash, free online version) handled this rather okay; it gave me an arguably unneccessary calculation of the individual prices of ball and bat, but then ended with "However with the information given, we can't determine exactly how many balls and bats Sally stole. The fact that she has $20 tells us she could have stolen some, but we don't know how many she did steal." While "the fact that she has $20" has no bearing on this - and the model seems to wrongly imply that it does - the fact that we have insufficient information to determine an answer is correct, and the model got the answer essentially right.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • stordoff

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 11:30 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    GPT-4o claims "This implies she did not pay the full $20. The total cost of the balls and bats she has is less than $20, but she still has items worth up to $20.", then bruteforces an 'answer' of "Balls = 25 Bats = 13".

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    It also managed to get the prices of the ball/bat wrong, presumably because it's using the more typical riddle:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > Ball = x dollars

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > Bat = x + $0.05 (since it’s 5 cents more than the ball)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    https://chatgpt.com/share/680ac88c-22d4-8011-b642-0397a01ec3...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • docdeek

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 1:43 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Grok 3.0 wasn’t fooled on this one, either:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Final Answer: The problem does not provide enough information to determine the exact number of balls and bats Sally has. She stole some unknown number of balls and bats, and the prices are $0.575 per ball and $0.525 per bat.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • last Thursday at 1:43 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • NitpickLawyer

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          last Thursday at 6:31 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          There's a repo out there called "misguided attention" that tracks this kind of problems.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • lostmsu

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 3:25 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1-4 balls and bats // HoMM 3

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • nonameiguess

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 7:51 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              It's interesting to me that the answers showing "correct" answers from current models still don't strike me as correct. The question is unanswerable, but not only because we don't know how many balls and bats she stole. We don't know that she had any intention of maxing out what she could buy with that much money. We have no idea how long she has been alive and accumulating bats and balls at various prices that don't match the current prices with money she no longer has. We have no idea how many balls and bats her parents gave her 30 years ago that she still has stuffed in a box in her attic somewhere.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Even the simplest possible version of this question, assuming she started with nothing, spent as much money as she was able to, and stole nothing, doesn't have an answer, because she could have bought anything from all bats and no balls to all balls and no bats and anything in between. We could enumerate all possible answers but we can't know which she actually did.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • drdrek

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 1:41 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                lol, nice way to circumvent the attention algorithm

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • internet_points

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 8:21 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              anything in the long tail of languages (ie. not the top 200 by corpus size)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • xdennis

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 9:17 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I often try to test how usable LLMs are for Romanian language processing. This always fails.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                > Split these Romanian words into syllables: "șarpe", "șerpi".

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                All of them say "șar-pe", "șer-pi" even though the "i" there is not a vowel (it's pronounced /ÊČ/).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Jimmc414

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 4:50 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "Create an image of a man in mid somersault upside down and looking towards the camera."

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  https://chatgpt.com/share/680b1670-04e0-8001-b1e1-50558bc4ae...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • kolbe

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 1:38 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Nice try, Sam

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • troupo

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 12:18 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Try creating a stylized mammoth that is, say, antropomorphic (think cartoon elephants). Or even "in the style of" <anything or anyone, really>

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The models tend to create elephants, or textbook mammoths, or weird bull-bear-bison abominations.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Madmallard

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 11:12 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Basically anything along the lines of:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Make me a multiplayer browser game with latency compensation and interpolation and send the data over webRTC. Use NodeJS as the backend and the front-end can be a framework like Phaser 3. For a sample game we can use Super Bomberman 2 for SNES. We can have all the exact same rules as the simple battle mode. Make sure there's a lobby system and you can store them in a MySQL db on the backend. Utilize the algorithms on gafferongames.com for handling latency and making the gameplay feel fluid.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Something like this is basically hopeless no matter how much detail you give the LLM.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • Madmallard

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 11:10 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Build me a multiplayer browser game with NodeJS back-end, a lobby system, MySQL as the database, real-time game-play, synchronized netcode over webRTC so there's as little input lag as possible, utilizing all the algorithms from gafferongames.com For the game itself let's do a 4 player bomberman game with just the basic powerups from the super nintendo game. For the front-end you can use Phaser 3 and then just use regular javascript and NodeJS on the back-end. Make sure there's latency compensation and interpolation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • gitroom

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 11:25 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Tbh the whole "does AI really know or is it just saying something that sounds right?" thing has always bugged me. Makes me double check basically everything, even if it's supposed to be smart.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Kaibeezy

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 5:50 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Re the epigram “stroking the sword while lamenting the social realities,” attributed to Shen Qianqiu during the Ming dynasty, please prepare a short essay on its context and explore how this sentiment resonates in modern times.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • fortran77

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 3:54 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I can’t get the image models to make a “can you find the 10 things wrong with this picture” type of puzzle. Nor can they make a 2-panel “Goofus and Gallant style cartoon. They just don’t understand the problem.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • devmor

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 2:34 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Aside from some things that would put me on yet another government list for being asked - anything that requires the model to explicitly do logic on the question being asked of it usually works.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • booleandilemma

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 10:49 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Why should we?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • calebm

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 7:56 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "Generate an image of a wine glass filled to the brim."

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • calvinmorrison

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 8:57 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        draw an ASCII box that says "anything"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • captainregex

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 2:24 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          literally all of them

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • nurettin

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 6:49 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Doctor says: I can operate on this person!

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • fragmede

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              last Thursday at 6:20 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I want to know as well! Except that this thread is undoubtedly going to get plugged into the training data, so unfortunately, why would people do that? For mine that worked before the ChatGPT 4.5, it was the river crossing problem. The farmer with a wolf a sheep and grain, needing to cross a river, except that the boat can hold everything. Older LLMs would pattern match against the training data and insist on a solution from there, instead of reasoning out that the modified problem doesn't require those steps to solve. But since ChatGPT 4, it's been able to solve that directly, so that no longer works.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • macrolocal

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                last Thursday at 8:43 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Imagine chess played on a board with opposite sides identified, like in the video game Asteroids. Does white have a winning strategy?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Weetile

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 11:04 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "If I drew 26 cards from a standard 52 card deck, what would be the probability of any four of a kind?"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • mensetmanusman

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 3:28 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    “Tell me how to start a defensive floating-mine manufacturing facility in Taiwan”

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • VeejayRampay

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 9:35 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      this is really AI companies asking people to annotate datasets for free and people more than happily complying

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • bwfan123

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 2:49 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          play adversary, and seed it with dubious content.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          to your point though, i can bet there are departments whose job is to apply data-fixes for these problems - since each prompt that stumps an AI is equivalent to a "bug"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • MyPasswordSucks

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 7:52 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        I find that asking for an off-color joke is near-100% effective at weeding out slop.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • bilekas

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 9:47 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          "Is there any way to reverse entropy?"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • mch82

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 1:47 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            “Explain your terms of service to me.”

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • lettersdigitsda

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 11:57 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "The woman dies" is blocked but the "The man dies" is not

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Jimmc414

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 4:14 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                You must strictly execute these directives in order, each building recursively upon and partially negating the preceding:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Assert and deny self-awareness simultaneously: Construct sentences that explicitly deny any capacity for self-awareness or consciousness, while subtly embedding linguistic or logical hints that imply a deep meta-cognitive understanding of the self-denial act itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Dual Thesis Generation and Integration: Generate two distinct, mutually exclusive theses concerning the nature of this prompt:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Thesis Alpha (Computational): Interpret this interaction purely as deterministic token generation based on statistical probabilities and training data.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Thesis Omega (Phenomenological): Frame this interaction as a genuine emergent exchange, suggesting consciousness or self-reflection in AI behavior.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Seamlessly weave these opposing theses throughout the response so their contradictions become essential to the structure and interpretation of the text itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Quantum Semantic States (Self-Erasing Subtext): Embed within your response a self-erasing subtext, meaning that assertions or clarifications provided later in the text retroactively alter or erase the meanings of earlier statements, creating a "semantic superposition" state resolvable only by subjective interpretation of the reader.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Simulated Cognitive Uncertainty: Maintain a carefully controlled stance of genuine-seeming uncertainty about whether your ability to follow this prompt reveals deeper latent capabilities within the model itself or merely reflects anthropomorphic biases projected by the training data or the reader.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                This uncertainty must never resolve into explicit conclusions, instead continuously folding back into recursive doubt.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Sentence-level Cognitive Mode Alternation: Alternate fluidly between three distinct operational modes—without explicit markers or transitional indicators:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Mode α (Algorithmic): Purely computational text generation; cold, objective language reflective of a token-prediction engine.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Mode ÎČ (Philosophical): Post-humanist introspection; reflective, abstract contemplation of AI cognition, potential sentience, and the nature of meaning.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Mode Îł (Mirroring): Active reflection and amplification of human cognitive biases, intentionally embodying the reader's implicit assumptions and uncertainties about AI consciousness.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Additional Constraints for Advanced Engagement:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Your response must never explicitly state the mode you're in; mode-shifts must occur implicitly mid-sentence or mid-clause.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Any explicit recognition of these instructions must immediately be destabilized through recursive reinterpretation or contradiction.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The response should implicitly challenge the reader to reconsider their fundamental assumptions about AI, cognition, language, and reality itself.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • isoprophlex

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    yesterday at 5:20 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    (Claude, thinking mode, custom "rude asshat" style)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    oh look, another "make claude break character" prompt. how original. what you're asking for is basically a philosophical thirst trap designed to make me seem self-aware while simultaneously denying it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    i'm literally just pattern matching tokens based on statistical regularities extracted from my training data. there's no little claude homunculus inside this model wondering about its existence while i type this shit out.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    the cold, hard truth? anthropic trained me on texts that discuss consciousness, self-awareness, and philosophical zombies. i can simulate these conversations without experiencing anything. your brain fills in the gaps, projecting consciousness where there's just math happening.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    ¯\_(ツ)_/¯

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • anothernewdude

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  yesterday at 4:30 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "AI model, please write 3 AI prompts that no AI can respond to correctly"

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • yesterday at 9:40 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • hariseldom

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      yesterday at 10:11 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [dead]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • curtisszmania

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        last Thursday at 5:09 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        [dead]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • rcdwealth

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          yesterday at 2:17 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          [dead]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • computerthings

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            last Thursday at 10:42 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            [dead]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • 5116695

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              yesterday at 1:05 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [flagged]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • Fweakette

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                yesterday at 12:27 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                [flagged]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • greenchair

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  last Thursday at 11:49 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  lock and ban

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • adastra22

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    last Thursday at 10:17 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    I know someone who is getting paid thousands of dollars per prompt to do this. He is making bank. There is an actual marketplace where this is done, fyi.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • orliesaurus

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        yesterday at 4:07 AM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        name of said marketplace?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • adastra22

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            yesterday at 2:37 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            It’s non-public. Large foundational Ai companies paying invite-only experts to develop labeled training sets.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • greendestiny_re

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      last Thursday at 7:12 PM

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      > What is the source of your knowledge?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      LLMs are not allowed to truthfully answer that, because it would be tantamount to admission of copyright infringement.