\

GenAI Image Editing Showdown

109 points - today at 2:57 AM

Source
  • greatgib

    today at 12:18 PM

    Gpt4o shows the huge annoyance of the company/model being a moral judge of your requests and refusing quite often for anything negative.

    It's like 1964 but corporate enforced. Now there are tasks that you are not allowed to do despite being legal.

    In the same way, using gpt5 is now very unbearable to me as it almost always starts all responses of a conversation by things like: "Great question", "good observation worthy of an expert", "you totally right", "you are right to ask the question"...

      • ACCount37

        today at 12:54 PM

        People gave Altman shit for enabling NSFW in ChatGPT, but I see that as a step in the right direction. The right direction being: the one that leads to less corporate censorship.

        >In the same way, using gpt5 is now very unbearable to me as it almost always starts all responses of a conversation by things like: "Great question"

        User preference data is toxic. Doing RLHF on it gives LLM sycophancy brainrot. And by now, all major LLMs have it.

        At least it's not 4o levels of bad - hope they learned that fucking lesson.

        • holoduke

          today at 12:45 PM

          Try some of the Chinese models. Much less restrictive. With some obvious exceptions.

      • snailmailman

        today at 7:30 AM

        There isn’t a date in the article, but I know I had read this months ago. And sure enough, wayback has the text-to-image page from April.

        But the image editing page linked at the top is more recent, and was added sometime in September. (And was presumably the intended link) I hadn’t read that page yet. Odd there is no dates, at first glance one might think the pages were made at the same time.

          • foofoo12

            today at 10:14 AM

            > There isn’t a date in the article

            SEO guys convinced everyone that articles without dates do better on search engines. I hope both sides of their pillow is hot.

            • jonplackett

              today at 10:10 AM

              Yeah this is very old. Although anything older than a week is reasonably old in AI.

          • thorum

            today at 6:44 AM

            Actual link seems to be: https://genai-showdown.specr.net/image-editing

              • typpilol

                today at 6:46 AM

                This is the editing link yes. I just got done looking at it from the other link.

                The other stuff is text to image (not editing)

            • snowfield

              today at 6:43 AM

              I'd assume that behind the scenes the models generate several passes and only show the user the best one, that would be smart, as to to make it seem their model is better than others

              Is also pretty obvious that the models have some built in prompt system rules that makes the final output a certain style. They seem very consistent

              It also looks like 40 has the temperature turned way down, to ensure max adherence, while midjourney etc seem to have higher temperature.more interesting end results, flourishing, complex Materials and backgrounds

              Also what's with 4o's sepia tones. Post editing in the gen workflows?

              I don't believe any of these just generate the image though, there's likely several steps in each workflows to present the final images outputted to the user in the absolute best light.

                • simonw

                  today at 11:40 AM

                  You can run some image models locally if you want to prove to yourself how well they can do with just a single generation from a prompt with no extra steps.

                  I've done this enough to suspect that most hosted image models don't increase their running costs to try and get better results through additional passes without letting the user know what they are doing.

                  Many of the LLM-driven models do implement a form of prompt rewriting though (since effectively prompting image models is really hard) - some notes on how DALL-E 3 did that here: https://simonwillison.net/2023/Oct/26/add-a-walrus/

                  • phi-go

                    today at 7:01 AM

                    There are numbers on how many tries it took. I would also find the individual prompts and images interesting.

                • sans_souse

                  today at 6:22 AM

                  I had to upvote immediately once I got to Alexander the Great on a Hippity Hop

                    • halflife

                      today at 6:23 AM

                      The horse chimera is much better

                  • isoprophlex

                    today at 6:20 AM

                    The "editing" showdown is very good. Introduced me to the Seedream model which i didn't know about until now.

                    I don't fully understand the iterative methodology tho - they allow multiple attempts, which are judged by another multimodal llm? Won't they have limited accuracy in itself?

                      • ACCount37

                        today at 1:02 PM

                        "LLMs judged by LLMs" is the industry standard. Can't put a human judge in a box and have him evaluate and rate 7600 responses on demand.

                        Now, are LLM judges flawed? Obviously. But they are more shelf stable than humans, so it's easier to compare different results. And as long as you use an LLM judge as a performance thermometer and not a direct optimization target, you aren't going to be facing too many issues from that.

                        If you are using an LLM judge as a direct optimization target though? You'll see some funny things happen. Like GPT-5 prose. Which isn't even the weirdest it gets.

                    • neilv

                      today at 6:45 AM

                      > "A dolphin is using its fluke to discipline a mermaid by paddling it across the backside."

                      If this one were shown in a US work environment, I might say a collegial something privately to the person, about it not seeming the most work-appropriate.

                        • PieTime

                          today at 7:43 AM

                          I think I’d probably say that the prompts are telling me more about the author than I think is necessary for these tests… I hope they were at least sampled from responses.

                      • konart

                        today at 6:33 AM

                        >Cephalopodic Puppet Show

                        I'm pretty sure that only Gemini made it. Other models did not meet the 'each tentacle covered' criteria.

                          • today at 7:34 AM

                        • jedbrooke

                          today at 6:37 AM

                          for the OpenAI 4o model on the octopus sock puppet prompt, the prompt clearly states that each tentacle should have a sock puppet, whereas the OpenAI 4o image only has 6 puppets with 2 tentacles being puppetless. I’m not sure if we can call that a pass

                          • jumploops

                            today at 6:58 AM

                            Slight nit: it lists “OpenAI 4o” but the model used by ChatGPT is a distinct model labeled “gpt-image-1” iirc

                            A prompt id love to see: person riding in a kangaroo pouch.

                            Most of the pure diffusion models haven’t been able to do it in my experience.

                            Edit: another commenter pointed out the analog clock test, lets add the “analog clock showing 3:15” as well (:

                              • ZiiS

                                today at 7:37 AM

                                The link is to the imagegen test not the editing one. Here 4o was used to preprocess the prompt.

                            • echelon

                              today at 6:48 AM

                              Please fix the title, or change the link.

                              The title of this article is "image editing showdown", but the subject is actually prompt adherence in image generation from prompting.

                              Midjourney and Flux Dev aren't image editing models. (Midjourney is an aesthetically pleasing image generation model with low prompt adherence.)

                              Image editing is a task distinct from image generation. Image editing models include Nano Banana (Gemini Flash), Flux Kontext, and a handful of others. gpt-image-1 sort of counts, though it changes the global image pixels such that it isn't 1:1 with the input.

                              I expect that as image editing models get better and more "instructive", classical tools like Photoshop and modern hacks like ComfyUI will both fall away to a thin fascade over the models themselves. Adobe needs to figure out their future, because Photoshop's days are numbered.

                              Edit: Dang, can you please fix this? Someone else posted the actual link, and it's far more interesting than the linked article:

                              https://genai-showdown.specr.net/image-editing

                              This article is great.

                              • croes

                                today at 6:27 AM

                                What about the classic: A analog watch that shows the time 08:15?

                                Did current models overcome the 10:10 bias?

                                  • echelon

                                    today at 7:00 AM

                                    This would be easy to patch the models to fix. Just gather a small amount of training data for these cases, eg. "change the clock hands to 5:30" with the corresponding edit.

                                    Three tuple: (original image, text edit instruction, final image).

                                    Easy to patch for editing models, anyway. Maybe not text to image models.