\

GPT 5.4 in practice – Stinks?

8 points - last Tuesday at 9:13 PM

  • satvikpendem

    today at 11:29 AM

    I'm using GPT 5.4 solely now, over Claude and any other models. In my usage I find it better than all the other models especially at large and complex codebases.

    • shivang2607

      yesterday at 10:19 AM

      I have building quite complex architecture applications from some time now. So I think I know the answer. For pure Coding no one comes close to Claude. No matter what the benchmark says, no one beats claude in terms of sheer coding skills. Having said that, claude lacks in architecture design decision making, it does not make good decision regarding that, I find ChatGPT more smart in terms of system designs and architectures. And I have experienced this not once but 4 times now. And For mathematical reasoning and formula making Gemini is better than both claude and chatGPT. I have experienced this once, when I had to design formula for calculating scores of different files and functions in a codebase.

      • sjt-at-rev

        last Tuesday at 10:26 PM

        We've been testing it extensively and its performance is no better than prior versions, and in many cases worse than open weight models like GLM. Gemini3.1 Pro is so significantly better.

        To me, the play is: open weight on a provider like BaseTen (solid performance, low price point), or pay up for Gemini3.1 Pro if you need it.

        But at their high price and low-ish quality, OpenAI models just aren't in the conversation right now without heavy incentives, e.g. via Azure.

        Crazy, TBH. Curious if others find the same thing?

          • Chyzwar

            yesterday at 7:17 AM

            I recently tested Claude code, opencode, codex on same frontend feature and codex with 5.4 with high effort was the best but most expensive. For me in Europe, Claude Code with 90$ max subscription is the best value for money.

            My thinking is:

              codex - best harness
              opencode - best ux/dx
              claude - best value for money

        • vicnov

          yesterday at 10:57 PM

          I can’t use it after engaging with Claude. Even simply having a conversation about some design decisions seems annoying.

          So I would agree with you, it is not great.

          • a960206

            yesterday at 5:07 PM

            Using GPT to review work produced by Claude works extremely well.

              • muzani

                today at 9:25 AM

                Someone mentioned that even if you tell Claude its work will be reviewed by GPT, it will do better.

            • segmondy

              yesterday at 1:12 PM

              skill issues. in practice, these models are all great. no matter how great the hammer or saw is, you need skills to be a great carpenter.

                • sjt-at-rev

                  yesterday at 2:01 PM

                  I think the other side of that coin is how much effort it takes to get it to do what you need. Our pipeline is a sequence of very precise tasks where subtle contextual cues matter a lot, and there are a large classes of related error modes.

                  So yes, while we can work with any of these models to get them to do what we need eventually -- e.g. with prompt tuning to their particular style, adding more examples, or breaking tasks into smaller steps, etc. -- their instruction following has a huge impact on how quickly we can move as a team.

                  When I say "stinks", for me, if we do three rounds of optimization and testing and a model is still performing inconsistently across a class of related traps then using that model is going to slow us down, and I think it stinks.

                  In my experience, gemini3.1pro tends to work very consistently with light nudging, GLM with 2-ish rounds of optimization, and for GPT5.4, well it provided no improvement over prior models and would slow us down over the others meaningfully ... and costs too much for the effort.

                  So, meh, so I still think it stinks, skill level considered.

              • thiago_fm

                yesterday at 8:57 AM

                Try OpenCode or Kimi, they're mostly all the same thing

                We still have to see what Anthropic has cooked though

                  • sjt-at-rev

                    yesterday at 2:38 PM

                    Yeah, I've had good experience with Kimi too. Good for the price point, for sure.

                    Anthropic models are still the best for me -- as long as you don't ask them to do something they don't want to -- but also, way too expensive for bulk pipeline processing. So I keep it to coding and Coworking...

                • zephyrwhimsy

                  yesterday at 1:10 AM

                  [flagged]

                  • studio-m-dev

                    last Tuesday at 10:09 PM

                    [flagged]