2026-01-30

ARENA W3D5
- Experiments
  - When does the refusal direction emerge?
    - exists in base models
      - makes sense? because conversations with AIs are in training data
    - later layers are more important
  - Mech interp on "end thinking" tokens
  - Does it matter if you do helpful training first or harmless training first?