- NLA thoughts
- NLAs are pretty cool.
NLA evals
- How do we know if the AV we made is actually extracting meaningful aspects of the model’s thoughts from the activations? Like is the AV actually putting useful, human understandable aspects of the thinking in the verbalization. It seems like it would be useful to benchmark how good NLAs are at figuring out if the model is scheming (or some other bad thing), compared to just giving a transcript to a model and asking if there’s scheming going on.
Are AVs better than good prompting?
- The explanations from AVs are better (at minimizing reconstruction error) than those you would good from generating explanations using the prompt from the warm-start SFT phase. This doesn’t happen automatically: in the warm-start phase Sonnet 4.6 gets the full input but not activations, while the AV gets activations but not the full input. So the AV is probably learning, at a minimum, how to extract bits of the input text from the activation. But is it actually learning about interesting features, or is it just learning a better way to format the explanations? A good baseline for NLAs might be to measure how much better the AVs explanations are compared to prompting an LLM to generate explanations given only the input, but prompting it with several examples of good explanations and telling it to match that closely.
Activation decomposition
- I wrote a bit about this before, but it seems like it would be helpful to the AV to decompose the input activation over multiple tokens. Reasoning: for normal models that take text as input the input for each token is ~18 bits of information (for a 200k vocabulary), while the residual stream is much bigger (for even a small model like Gemma-3B, the residual stream is 2560 parameters, 40960 bits at bf16). So it seems like it’s important to have a lot of extra “space” in the residual stream, and input activations take up all of the space. In my small tests with AOs, it seems like repeating the input token helps somewhat, and decomposing it doesn’t improve it more than repeating does (possibly because my decomposition scheme isn’t good enough).
The AR is too smart
- The AVs (especially for the open models) love to give quotes from the input text. I think this is probably very useful for reconstructing the activations, but it’s not useful for interpretability (I already know what the input is! I want deeper insights than that!).
What if we gave the entire input text to the AR? That would solve the quote problem, by forcing the AV to only give information to the AR that improves over the input text it already knows. This might work but the AR could be able to nearly entirely reconstruct the activation from just the input text, which would obviate the need for the AV altogether.
I think part of the cause of this is that the AR is too smart, and is too good at putting together an activation just from quotes from the input text. The obvious fix would be to make the AR a smaller model, but this is hard because we want the activation to be the same size as the residual stream so you’d have to decompose it (which might be a good thing! see above!), and we want the model to have a kind of native understanding of the activations. Another thing you could do is make the AR worse is using various regularization techniques during training. The goal of all of this is to force the AV to lay things out clearly so the AV doesn’t have to do much understanding.
What if the AR was a flow chart
- What if the AR was an extremely simple model, like a learned flow chart? I haven’t thought about it much but this seems vaguely like it might be a good idea.
The AV output could be more useful
- Right now AV outputs read kinda like a jumble of thoughts. Maybe we should force the AV output to be in a more structured format (or even just provide more structure in the warm-start data).
NLAs for activation diffing
- What if you used an NLA to interpret the difference between the activations of a fine-tuned model and the activations of a base model (on the same input). I think this was done for AOs but I don’t think this has been tried for NLAs?
What if we used the AR for something
- The AR isn’t very useful for interp right now, but maybe it should be??