Tiled Hacker news on React Router

Show HN: Llama 3.2 Interpretability with Sparse Autoencoders

579 points - 11/21/2024

I spent a lot of time and money on this rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary LLMs that was quite popular this year and produced great research papers by Anthropic [1], OpenAI [2] and Deepmind [3].

I am quite proud of this project and since I consider myself the target audience for HackerNews did I think that maybe some of you would appreciate this open research replication as well. Happy to answer any questions or face any feedback.

Cheers

[1] https://transformer-circuits.pub/2024/scaling-monosemanticit...

[2] https://arxiv.org/abs/2406.04093

[3] https://arxiv.org/abs/2408.05147

Source

foundry27
11/21/2024
For anyone who hasn’t seen this before, mechanistic interpretability solves a very common problem with LLMs: when you ask a model to explain itself, you’re playing a game of rhetoric where the model tries to “convince” you of a reason for what it did by generating a plausible-sounding answer based on patterns in its training data. But unlike most trends of benchmark numbers getting better as models improve, more powerful models often score worse on tests designed to self-detect “untruthfulness” because they have stronger rhetoric, and are therefore more compelling at justifying lies after the fact. The objective is coherence, not truth.
Rhetoric isn’t reasoning. True explainability, like what overfitted Sparse Autoencoders claim they offer, basically results in the causal sequence of “thoughts” the model went through as it produces an answer. It’s the same way you may have a bunch of ephemeral thoughts in different directions while you think about anything.
jwuphysics
11/22/2024
Incredible, well-documented work -- this is an amazing effort!
Two things that caught my eye were (i) your loss curves and (ii) the assessment of dead latents. Our team also studied SAEs -- trained to reconstruct dense embeddings of paper abstracts rather than individual tokens [1]. We observed a power-law scaling of the lower bound of loss curves, even when we varied the sparsity level and the dimensionality of the SAE latent space. We also were able to totally mitigate dead latents with an auxiliary loss, and we saw smooth sinusoidal patterns throughout training iterations. Not sure if these were due to the specific application we performed (over paper abstracts embeddings) or if they represent more general phenomena.
[1] https://arxiv.org/abs/2408.00657
Eliezer
11/22/2024
This seems like decent alignment-positive work on a glance, though I haven't checked full details yet. I probably can't make it happen, but how much would someone need to pay you to make up your time, expense, and risk?
curious_cat_163
11/21/2024
Hey - Thanks for sharing!
Will take a closer look later but if you are hanging around now, it might be worth asking this now. I read this blog post recently:
https://adamkarvonen.github.io/machine_learning/2024/06/11/s...
And the author talks about challenges with evaluating SAEs. I wonder how you tackled that and where to look inside your repo for understanding the your approach around that if possible.
Thanks again!
OrangeMusic
11/29/2024
This has been "taken down" and the repo archived. No explanation of what happened.
JackYoustra
11/21/2024
Very cool work! Any plans to integrate it with SAELens?
monkeycantype
11/22/2024
Thank you for posting this PaulPauls,
can I please ask a wacky question that I have about mech.interp. ?
we know that when we use a metric to measure humans, we game the metric, I wonder about future ai, gaming mech.interp.
for simplicity let me propose a 2d matrix to encode tokens, and hope my understanding of neural networks is good enough for this to be a valid question I understand that higher dimensions have a lot more 'space', so accidental collisions are easily avoided, but 2d makes my example easier to explain.
if we had the following tokens mapped in 2d space
```
  Apple 1a
  Pear  1b
  Donkey 2a
  Horse 2b
```
it becomes impossible to understand if the neurons 1,2,a,b, all excited represents apple and horse or donkey and pear?
I wonder if a future, vastly more competent AI overseeing its own training might use this form of ambiguity as means to create a model in which the weights are selected so the potential for encoding collisions exist, so that it is possible to deceive an mech.int. observer as to the true thoughts of the model, essentially enabling the ai to think in euphemisms?
samstevens
11/22/2024
I’m really excited to see some more open SAE work! The engineering effort is non trivial and I’m going to check out your dataloading code tomorrow. You might be interested in an currently in-progress project of mine to train SAEs on vision models: https://github.com/samuelstevens/saev
jaykr_
11/21/2024
This is awesome! I really appreciate the time you took to document everything!
moconnor
11/22/2024
Find a latent for the Golden Gate bridge and put a Golden Gate Llama 3.2 on HuggingFace. This will get even more attention and love, more so if you include link to a space to chat with it!
Also, you didn't ask for suggestions but putting some interesting results / visualizations at the top of the README is a very good idea.
vivekkalyan
11/22/2024
This is great work! Mechanistic interpretability has tons of use cases, it's great to see open research in that field.
You mentioned you spent your own time and money on it, would you be willing to share how much you spent? It would help others who might be considering independent research.
westurner
11/23/2024
The relative performance in err/watts/time compared to deep learning for feature selection instead of principal component analysis and standard xgboost or tabular xt TODO for optimization given the indicating features.
XAI: Explainable AI: https://en.wikipedia.org/wiki/Explainable_artificial_intelli...
/? XAI , #XAI , Explain, EXPLAIN PLAN , error/energy/time
imranhou
11/22/2024
Coming from a layman's perspective, a genuine question regarding: "Implements SAE training with auxiliary loss to prevent and revive dead latents, and gradient projection to stabilize training dynamics".
I struggle to understand this phrase "to prevent and revive ", perhaps this is simple speak to those that understand the subject of SAEs, but it feels a bit self contradictory to me, could anyone elaborate?
yangwang92
11/22/2024
Nice！ You did what I wanted. Have you tried to train SAE for vision encoder and language encoder? I am working on this idea. May we work together, let me initial an issue.
batterylake
11/22/2024
This is incredible!
PaulPauls, how would you like us to cite your work?
11/22/2024
enterthedragon
11/22/2024
This is amazing, the documentation is very well organized
Carrentt
11/22/2024
Fantastic work! I absolutely love all the documentation.
coolvision
11/23/2024
nice! did you use cloud GPUs or built your own machine?
S-Kaenel
11/22/2024
Amazing research!!
jzjsj
11/23/2024
whwjwj
11/22/2024

Show HN: Llama 3.2 Interpretability with Sparse Autoencoders

foundry27

stavros

LoganDark

btbuildem

sinuhe69

TeMPOraL

mdp2021

sinuhe69

TeMPOraL

worldsayshi

TeMPOraL

legel

HeavyStorm

mdp2021

catskul2

mdp2021

og_kalu

mdp2021

og_kalu

mdp2021

mdp2021

stavros

mdp2021

og_kalu

mdp2021

og_kalu

mdp2021

DSingularity

stavros

f_devd

stavros

f_devd

snthpy

jamesemmott

snthpy

omgwtfbyobbq

snthpy

shshshshs

Hedepig

mdp2021

Hedepig

mdp2021

Hedepig

mdp2021

Hedepig

mdp2021

Hedepig

fragmede

snthpy

mdp2021

briffid

snthpy

benreesman

txnf

drdeca

benreesman

bubaumba

Onavo

fsndz

benchmarkist

HeatrayEnjoyer

benchmarkist

mdp2021

jwuphysics

PaulPauls

Eliezer

curious_cat_163

PaulPauls

curious_cat_163

OrangeMusic

dimitry12

JackYoustra

PaulPauls

monkeycantype

Majromax

samstevens

lynx23

jaykr_

PaulPauls