Here's an extract, the core TL;DR for a feel of the article.
"And now for the weirdness: There was never the case where any Transformer layer would have seen the output from a future layer!
Layer 10 is trained on layer 9ās output distribution. Layer 60 is trained on layer 59ās. If you rearrange them ā feeding layer 60ās output into layer 10 ā youāve created a distribution the model literally never saw during training.
The astounding thing about Goliath wasnāt that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still donāt understand why this didnāt raise more eyebrows.
Experimentally, this proved that layers were far more interchangeable than anyone had reason to expect. The internal representations were homogenous enough that the model could digest out-of-order hidden states without collapsing. The architecture was far more flexible than a rigid pipeline.
Between the Base64 observation and Goliath, I had a hypothesis: Transformers have a genuine functional anatomy. Early layers translate input into abstract representations. Late layers translate back out. And the middle layers, the reasoning cortex, operate in a universal internal language thatās robust to architectural rearrangement. The fact that the layer block size for Goliath 120B was 16-layer block made me suspect the input and output āprocessing unitsā sized were smaller that 16 layers. I guessed that Alpindale had tried smaller overlaps, and they just didnāt work.
If that was true, maybe I didnāt need to teach a model new facts to make it smarter. I didnāt need fine-tuning. I didnāt need RLHF. I just needed to give it a more layers to think with."