kmeisthax
today at 6:21 PM
I'm wondering if the big problem is just the lack of recurrent connections in the standard Transformer design, and selective layer duplication is just a weird way to fix the same problem. I have to wonder if it would be possible to deliberately architecture a model to discover and exploit layers worth duplicating at training time.
The current model architectures we use have a fixed routing of residuals per layer, from the first to the last. I'm imagining replacing this with a matrix of routing weights[0] that determines how "strong" the connection is between each Transformer layer. We still evaluate each layer "in order", but now instead of just giving the layer the last layer's residuals, it gets the sum of all prior layers times their weight in the routing matrix. Recurrent connections (i.e. output of layer 9 to input of layer 3) could be handled by doing a second pass and using the first pass's recurrent residuals as inputs. You could then "loop" the model as many times as desired per token, or even have it do parallel decoding with each token communicating with the others while also recurring on itself.
You'd probably need some kind of normalization akin to what Deepseek did with Manifold Hyper Connections (mHC). Hell, mHC might also be useful in combination with this kind of layer routing, so the model could grow different recurrent loops for various bits of it's thought-space.
EDIT: if anyone uses it please call it "neuralese recurrence" just to scare the AI safety bros
[0] I'm not sure how you'd initialize these weights. Maybe each row/column is a narrow gaussian centered around the prior layer, with some random or constant weighting everywhere else?