Show HN: Llama 3.2 Interpretability with Sparse Autoencoders

278 points - yesterday at 8:37 PM


I spent a lot of time and money on this rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary LLMs that was quite popular this year and produced great research papers by Anthropic [1], OpenAI [2] and Deepmind [3].

I am quite proud of this project and since I consider myself the target audience for HackerNews did I think that maybe some of you would appreciate this open research replication as well. Happy to answer any questions or face any feedback.

Cheers

[1] https://transformer-circuits.pub/2024/scaling-monosemanticit...

[2] https://arxiv.org/abs/2406.04093

[3] https://arxiv.org/abs/2408.05147

Source
  • foundry27

    yesterday at 11:44 PM

    For anyone who hasn’t seen this before, mechanistic interpretability solves a very common problem with LLMs: when you ask a model to explain itself, you’re playing a game of rhetoric where the model tries to “convince” you of a reason for what it did by generating a plausible-sounding answer based on patterns in its training data. But unlike most trends of benchmark numbers getting better as models improve, more powerful models often score worse on tests designed to self-detect “untruthfulness” because they have stronger rhetoric, and are therefore more compelling at justifying lies after the fact. The objective is coherence, not truth.

    Rhetoric isn’t reasoning. True explainability, like what overfitted Sparse Autoencoders claim they offer, basically results in the causal sequence of “thoughts” the model went through as it produces an answer. It’s the same way you may have a bunch of ephemeral thoughts in different directions while you think about anything.

      • stavros

        today at 12:05 AM

        I want to point out here that people do the same: a lot of the time we don't know why we thought or did something, but we'll confabulate plausible-sounding rhetoric after the fact.

        • benreesman

          today at 2:12 AM

          A lot of the mech interp stuff has seemed to me like a different kind of voodoo: the Integer Quantum Hall Effect? Overloading the term “Superposition” in a weird analogy not governed by serious group representation theory and some clear symmetry? You guys are reaching. And I’ve read all the papers. Spot the postdoc who decided to get paid.

          But there is one thing in particular that I’ll acknowledge as a great insight and the beginnings of a very plausible research agenda: bounded near orthogonal vector spaces are wildly counterintuitive in high dimensions and there are existing results around it that create scope for rigor [1].

          [1] https://en.m.wikipedia.org/wiki/Johnson%E2%80%93Lindenstraus...

            • drdeca

              today at 3:18 AM

              Where are you seeing the integer quantum Hall effect mentioned? Or are you bringing it up rather than responding to it being brought up elsewhere? I don’t understand what the connection between IQHE and these SAE interpretability approaches is supposed to be.

              • txnf

                today at 3:07 AM

                Superposition code is a well known concept in information theory - I think there is certainly more to the story then described in the current works, but it does feel like they are going in the right direction

            • Onavo

              today at 12:56 AM

              How does the causality part work? Can it spit out a graphical model?

              • fsndz

                today at 12:55 AM

                I stopped at: "causal sequence of “thoughts” "

                  • benchmarkist

                    today at 1:23 AM

                    Interpretability research is basically a projection of the original function implemented by the neural network onto a sub-space of "explanatory" functions that people consider to be more understandable. You're right that the words they use to sell the research is completely nonsensical because the abstract process has nothing to do with anything causal.

                      • HeatrayEnjoyer

                        today at 3:00 AM

                        All code is causal.

                          • benchmarkist

                            today at 3:05 AM

                            Which makes it entirely irrelevant as a descriptive term.

            • jwuphysics

              today at 12:13 AM

              Incredible, well-documented work -- this is an amazing effort!

              Two things that caught my eye were (i) your loss curves and (ii) the assessment of dead latents. Our team also studied SAEs -- trained to reconstruct dense embeddings of paper abstracts rather than individual tokens [1]. We observed a power-law scaling of the lower bound of loss curves, even when we varied the sparsity level and the dimensionality of the SAE latent space. We also were able to totally mitigate dead latents with an auxiliary loss, and we saw smooth sinusoidal patterns throughout training iterations. Not sure if these were due to the specific application we performed (over paper abstracts embeddings) or if they represent more general phenomena.

              [1] https://arxiv.org/abs/2408.00657

                • PaulPauls

                  today at 1:15 AM

                  I'm very happy you appreciate it - particularly the documentation. Writing the documentation was much harder for me than writing the code so I'm happy it is appreciated. I furthermore downloaded your paper and will read through it tomorrow morning - thank you for sharing it!

              • Eliezer

                today at 2:37 AM

                This seems like decent alignment-positive work on a glance, though I haven't checked full details yet. I probably can't make it happen, but how much would someone need to pay you to make up your time, expense, and risk?

                • curious_cat_163

                  yesterday at 9:52 PM

                  Hey - Thanks for sharing!

                  Will take a closer look later but if you are hanging around now, it might be worth asking this now. I read this blog post recently:

                  https://adamkarvonen.github.io/machine_learning/2024/06/11/s...

                  And the author talks about challenges with evaluating SAEs. I wonder how you tackled that and where to look inside your repo for understanding the your approach around that if possible.

                  Thanks again!

                    • PaulPauls

                      today at 1:09 AM

                      So evaluating SAEs - determining which SAE is better at creating the most unique features while being as sparse as possible at the same time - is a very complex topic that is very much at the heart of the current research into LLM interpretability through SAEs.

                      Assuming you already solved the problem of finding multiple perfect SAE architectures and you trained them to perfection (very much an interesting ML engineering problem that this SAE project attempts to solve) then deciding on which SAE is better comes down to which SAE performs better on the metrics of your automated interpretability methodology. Particularly OpenAI's methodology emphasizes this automated interpretability at scale utilizing a lot of technical metrics upon which the SAEs can be scored _and thereby evaluated_.

                      Since determining the best metrics and methodology is such an open research question that I could've experimented on for a few additional months, have I instead opted for a simple approach in this first release. I am talking about my and OpenAI's methodology and the differences between both in chapter 4. Interpretability Analysis [1] in my Implementation Details & Results section. I can also recommend reading the OpenAI paper directly or visiting Anthropics transformer-circuits.pub website that often publishes smaller blog posts on exactly this topic.

                      [1] https://github.com/PaulPauls/llama3_interpretability_sae#4-i... [2] https://transformer-circuits.pub/

                  • jaykr_

                    yesterday at 9:10 PM

                    This is awesome! I really appreciate the time you took to document everything!

                      • PaulPauls

                        today at 12:36 AM

                        Thank you for saying that! I have a much, much harder time documenting everything and writing out each decision in continuous text than actually writing the code. So it took a look time for me to write all of this down - so I'm happy you appreciate it! =)

                    • JackYoustra

                      yesterday at 9:53 PM

                      Very cool work! Any plans to integrate it with SAELens?

                        • PaulPauls

                          today at 12:44 AM

                          Not sure yet to be honest. I'll definitely consider it but I'll reorient myself and what I plan to do next in the coming week. I also planned on maybe starting a simpler project and maybe showing people how to create the full model of a current Llama 3.2 implementation from scratch in pure PyTorch. I love building things from teh ground up and when I looked for documentation for the Llama 3.2 background section of this SAE project then the existing documentation I found was either too superficial or outdated and intended for Llama 1 or 2 - Documentation in ML gets outdated so quickly nowadays...