\

NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

94 points - today at 5:56 PM

Source
  • linolevan

    today at 10:23 PM

    There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.

    [0] https://www.alphaxiv.org/abs/2509.14786

      • sdpmas

        today at 10:33 PM

        yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.

    • kseniamorph

      today at 9:08 PM

      Curious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?

        • timshel1

          today at 10:09 PM

          Modded-nanogpt is also much more data efficient than vanilla napogpt, even if some of the individual optimizations trade off higher throughput for worse data efficiency.

            • sdpmas

              today at 10:32 PM

              yes, agreed, modded-nanogpt is already a data-efficient variant of original nanogpt. just that the kinds of algorithms it allows are somewhat constrained because it optimizes for wall clock time.

      • archermarks

        today at 7:23 PM

        Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.

          • sdpmas

            today at 7:31 PM

            yes, good point. right now, it's somewhat hard to overfit because the meta-optimization extracts tiny bits of information. but over time, we will switch the validation set to some other random subset of the FineWeb or even entirely OOD datasets!

        • lzaborowski

          today at 7:52 PM

          I like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.

          If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.

          • suddenlybananas

            today at 6:43 PM

            Reminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.

              • sdpmas

                today at 6:59 PM

                hey, it's Samip (behind the Slowrun repo). yeah that's a fair point, we will mention them in the blog. but there are a couple of major differences: 1. our emphasis is on using more compute to get better data efficiency. this is important because there are lots of hacky chances that will get lower loss, but when compared to general methods that leverage a lot of compute, they don't do so well. and you can already see how this emphasis on compute leads to different methods to BabyLM! 2. our reasoning behind the repo is not anything to do with how much data a child sees. and our dataset is not tailored towards that either. it's simple pretraining on random subset of the internet. we know there are better training algorithms that get lower loss on that data, and we are finding those.

                  • soraki_soladead

                    today at 7:04 PM

                    also, BabyLM is more of a conference track / workshop than an open-repo competition which creates a different vibe

            • navvyeanand

              today at 7:59 PM

              Amazing job!

              • riajain2525

                today at 8:34 PM

                Super cool!