Tiled Hacker news on React Router

Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs

206 points - today at 1:18 PM

Source

imranq
today at 8:05 PM
Amazing write up and i wish more people showed the process for discovery which is often even more interesting than the result itself
Still the result is really interesting being able to stack abstract reasoning and get better performance and the heat maps to show the prob results
The academic literature seems to be catching up:
- *[SOLAR / DUS (Kim et al., 2023)](https://arxiv.org/abs/2312.15166)* — duplicated transformer layers to build a 10.7B model that outperformed 30B parameter baselines.
- *[The Curse of Depth (2025)](https://arxiv.org/abs/2502.05795)* — explains why this works: Pre-LN causes deep transformer layers to converge toward identity functions, meaning middle layers are where real computation happens, and duplicating them concentrates that capacity.
- *[Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Geiping et al., NeurIPS 2025)](https://arxiv.org/abs/2502.05171)* — takes the idea to its logical conclusion: a model trained with a single recurrent block repeated at inference time, scaling reasoning depth without adding parameters.
dnhkng
today at 1:20 PM
Author here. I found that duplicating a specific block of 7 middle layers in Qwen2-72B, without modifying any weights, improved performance across all Open LLM Leaderboard benchmarks and took #1. As of 2026, the top 4 models on that leaderboard are still descendants.
The weird finding: single-layer duplication does nothing. Too few layers, nothing. Too many, it gets worse. Only circuit-sized blocks of ~7 layers work. This suggests pretraining carves out discrete functional circuits in the layer stack that only work when preserved whole.
The whole thing was developed on 2x RTX 4090s in my basement. I'm now running current models (GLM-4.7, Qwen3.5, MiniMax M2.5) on a dual GH200 rig (see my other post). Code and new models coming soon.
Happy to answer questions.
Lerc
today at 6:35 PM
I have had broadly the same intuitions on the use of middle layers, but haven't had much luck with the tiny models that I can run on my hardware.
There's a video on YouTube https://www.youtube.com/watch?v=pDsTcrRVNc0
about a looping layer models, after watching that I poured some thoughts off the top of my head into a comment which, of course, promptly sunk without a trace. I'll repost the gist of them here.
If you gain benefit from looping layers, at some level every layer of parameters is in front of and behind every other, the conclusion must be that the order of the layers does not need to be fixed at all.
If you cycle through the layers multiple times, are you doing so for the benefit of a particular layer on a particular problem. If so, can you skip the other layers that don't add on repetition. If you can skip (and you can know when to skip), and you can repeat (and know when to repeat)
What you would need is a mechanism which can decide which layer is needed next. Is that then not a looping single layer MOE model? Storing the layers as a wide set of selectable options rather than a deep set of unconditional layers. You would be picking what the next layer should be (or exit the loop) the threshold for exit drops each iteration so it always eventually exits. With a tunable 'how hard to think' knob to adjust the threshold.
janalsncm
today at 8:27 PM
It would be extremely interesting if we could use this kind of model surgery approach to tack on additional modalities. For example, adding vision to a text only model.
Another very interesting thing would be modulating compute at the token level. Default is 0 loops, maybe 1 loop is better, and 10 loops is even better than that.
mysteria
today at 7:15 PM
The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
This wasn't something I really dug into in great detail but I remember my surprise back then at how all those merged models and those "expanded" models like Goliath still generated coherent output. IMO those were more community models made by small creators for entertainment rather than work, and only really of interest to the local LLM groups on Reddit, 4chan, and Discord. People might briefly discuss it on the board and say "that's cool" but papers aren't being written and it's less likely for academics or corpo researchers to notice it.
That being said I wonder if it's possible to combine the layers of completely different models like say a Llama and a Qwen and still get it to work.
Even with math probes, I hit unexpected problems. LLMs fail arithmetic in weird ways. They don’t get the answer wrong so much as get it almost right but forget to write the last digit, as if it got bored mid-number. Or they transpose two digits in the middle. Or they output the correct number with a trailing character that breaks the parser.
Would using grammar parsing help here by forcing the LLM to only output the expected tokens (i.e. numbers)? Or maybe on the scoring side you could look at the actual probabilities per token to see how far the correct digit is.
Havoc
today at 4:26 PM
Crazy writeup.
Author is right about the base64 part. Does seem weird that it can decode and understand it at same time. And I guess what makes it weird that we just sorta accept that for say English and German this works ie normal use but when framed as base64 then it suddenly stops feeling intuitive
kristianp
today at 8:38 PM
Does your work give any insight into how reasoning at inference time works?
hmokiguess
today at 4:13 PM
I really enjoyed reading this. I feel like generalists intuitively experience this exact thing so much throughout their lives because they must have this neuroanatomy you describe. There’s a certain geometry to knowledge that makes possible for this orthogonal movement and it is really fascinating to me. Thank you for publishing this, you made my day!
hex4def6
today at 6:36 PM
I've gotta say, this writeup gives me an itchy feeling. It really does feel like poking around a synthetic brain at this point.
You could make the argument it's closer to the blocks of a CPU compared with a brain, and it's no different to copy-pasting some IP block for eg, HW JPEG decoding. But I feel like the difference here is we're 'discovering' these blocks / organs. They weren't designed, they were evolved.
iamjackg
today at 7:10 PM
I find the concept of LLM "brain surgery" fascinating, precisely because of how opaque the network is. One of the first things I did back when llama.cpp first got vision model support was hack the code to zero out (or otherwise modify) random numbers in the image embedding generated by the projector and then ask the LLM to describe the image. It was absolutely fascinating.
It would go from a normal description of the item in the picture to suddenly seeing people clapping in the background that were not there, or making up some other stuff. I kinda stopped after a while, but I should pick that back up and do a more coherent experiment to see if I can find any correlation between vector dimensions and "meaning."
tgw43279w
today at 2:12 PM
That was a fun read! The base64 decoding and encoding is quite interesting. A parallel: these models are surprisingly robust to heavy word mangling, back in 2023 people used this trick to jailbreak the models very often, but what was more surprising is that they even understand it. I always thought of it this way there must be some circuitry in the model that maps these almost unrecognizable words/sentences into their rectified versions. But what your base64 also shows is the fact thy can also encode them back as well! (However models are known to not be able to produce mangled output that looks convincingly random. I think the base64 transformation is more mechanical in this regard and hence it‘s easier to do the reverse for them.) So your layer circuit hypothesis aligns pretty well with my mental model of how these models work based on the interpretability work I am familiar with! I really also like the way you used the heatmaps as a tool to derive layer insights, very intuitive! But it’s really surprising that you can simply duplicate layers and achieve better results that generalize! This is some research grade effort! I’m confident you could publish this in NeurIPS or ICML if you put it into a paper! I‘m quite impressed! Great work!
dnhkng
today at 5:25 PM
Here's an extract, the core TL;DR for a feel of the article.
"And now for the weirdness: There was never the case where any Transformer layer would have seen the output from a future layer!
Layer 10 is trained on layer 9’s output distribution. Layer 60 is trained on layer 59’s. If you rearrange them — feeding layer 60’s output into layer 10 — you’ve created a distribution the model literally never saw during training.
The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
Experimentally, this proved that layers were far more interchangeable than anyone had reason to expect. The internal representations were homogenous enough that the model could digest out-of-order hidden states without collapsing. The architecture was far more flexible than a rigid pipeline.
Between the Base64 observation and Goliath, I had a hypothesis: Transformers have a genuine functional anatomy. Early layers translate input into abstract representations. Late layers translate back out. And the middle layers, the reasoning cortex, operate in a universal internal language that’s robust to architectural rearrangement. The fact that the layer block size for Goliath 120B was 16-layer block made me suspect the input and output ‘processing units’ sized were smaller that 16 layers. I guessed that Alpindale had tried smaller overlaps, and they just didn’t work.
If that was true, maybe I didn’t need to teach a model new facts to make it smarter. I didn’t need fine-tuning. I didn’t need RLHF. I just needed to give it a more layers to think with."
cootsnuck
today at 3:50 PM
Super cool. Love seeing these writeups of hobbyists getting their hands dirty, breaking things, and then coming out on the other side of it with something interesting.
WithinReason
today at 3:16 PM
Here is a paper that made a similar observation recently:
https://www.alphaxiv.org/abs/2512.19941
BloodAndCode
today at 7:18 PM
Did you try repeating the same mid-layer block more than once?
If the gain comes from giving the model another pass over its internal representation, I'd expect some sort of diminishing-returns curve as you add more repeats. But if those layers form a spevific circuit, running it multiple times might actually break the computation.
It would be really interesting to see which of those regims the model falls into.
dongecko
today at 5:31 PM
What a great read! You got me at the base64 oddity. I also stumbled over this, while trying to dodge some LLM limitation. (was trying to generate images in a time before multimodal was a thing. it only worked to a degree).
d0100
today at 6:45 PM
I wonder if joining layers from the "organs" of different models could further enhance the results
kovek
today at 4:51 PM
Is this similar to send 48656c6c6f2c20686f772061726520796f753f in the prompt? As done here: https://youtu.be/GiaNp0u_swU?si=m7-LZ7EYxJCw0k1-
goodmythical
today at 4:04 PM
Isn't this similar to models that have "double check the answer"?
First pass runs your input through, second pass runs it's output as input?
Just, in double check it presumably runs the entire stack while you're trying to skip the translation steps and only double check the logic?
Aditya_Garg
today at 5:00 PM
Wild stuff and great read
Do you think karpathy's autoresearch would be useful here?
tjwei
today at 3:45 PM
Really interesting discovery, especially the part about base64. Reminds me of this: Transformer Layers as Painters https://arxiv.org/abs/2407.09298
blourvim
today at 2:01 PM
I am not really an ml dev so I don't understand most of it. It does sound ridiculous how it would even work work. Brilliant work and great article I enjoyed reading it
This sounds similar to the Kimi's mixture of experts architecture if I understood it correctly(likely I have not), can you comment on this ?
lordmathis
today at 4:27 PM
That's cool. I tried the b64 thing on my local qwen3.5 27b without access to tools and it did it.
patchnull
today at 4:05 PM
This lines up with what I have seen doing CKA (centered kernel alignment) analysis on transformer internals. The middle layers in most large models have surprisingly similar representations to their neighbors, so duplicating them is basically giving the model extra compute cycles in a region where it is already doing useful refinement without messing up the input/output encoding stages. Curious whether picking layers by representation similarity instead of just a contiguous block would do even better.
seeknotfind
today at 3:35 PM
Did you ever try multiple copies?
GaggiX
today at 4:06 PM
This reminds me when people were doing crazy stuff to improve the first Stable Diffusion model by swapping layers, interpolating weights, documenting which layer was most responsible for the quality of the hands etc. At the end the final models had dozens of different ancestors.
rob_c
today at 4:21 PM
very awesome writeup, glad to see someone with access to hw actually playing with this.
Hopefully the cost per GPU will kick-it soon and we'll see people properly play, but frankly the "middle section" layers 2(ish) to (n-1)(ish) of a model can be shuffled up/down and left/right and still perform well.
The fun one will be an LLM router for LLM layers to apply the best reasoning to the best input so far, but frankly that would need the years and years of training that the author hints at.
The one that's still out of grasps is still how to combine/manipulate per-layer k,v caches into a globally coherent state. i.e. if layers can be moved up/down why can't the cached k,v be swapped/combined with different projections? global k,v caches work, but they have to be _huge_ in order to prevent model collapse even on something as simple as owt.
priowise
today at 4:36 PM
[flagged]

Show HN: How I Topped the HuggingFace Open LLM Leaderboard on Two Gaming GPUs

imranq

dnhkng

Balinares

gitpusher

pennomi

dormento

oliver_dr

3abiton

rapatel0

skerit

dnhkng

rapatel0

gavinray

phn

digdugdirk

dnhkng

user_7832

jauntywundrkind

afpx

naasking

dnhkng

radarsat1

Lerc

janalsncm

janalsncm

mysteria

janalsncm

dnhkng

Havoc

dinobones

dnhkng

dormento

dtj1123

idiotsecant

broDogNRG

kristianp

hmokiguess

dnhkng

hex4def6

dnhkng

iamjackg

dnhkng

tgw43279w

dnhkng

cootsnuck

WithinReason

dnhkng

WithinReason

tgw43279w

BloodAndCode

dnhkng

efromvt

dongecko

d0100

kovek

dnhkng

goodmythical

sva_

dnhkng

Aditya_Garg

janalsncm

tjwei

blourvim

dnhkng

lordmathis

patchnull

dnhkng

doctorpangloss

dnhkng

seeknotfind

dnhkng

cosarara

GaggiX

rob_c

priowise

user_7832