- Toronto AI Safety Meetup
- retrieval interleaved generation
- Sparse Auto-Encoders
- Auto-encoders
- Large image -> small latent space representation -> Large image
- Feature extraction using AE
- linear transformation hypothesis
- non-linear features
- e.g. days of week, circular things
- day of week feature wouldn't let you add one and loop around
- not considering these
- monosemanticity: 1 neuron = 1 feature, polysemanticity: >1 neurons = 1 feature
- low-rank adapation
- turning a big matrix into a matrix mutiplication of tiny matricies
- causual inner product?
- Sparse auto-encoder
- Small (16K) -> Big (34M) -> Small (16K)
- Sparse: mostly zeroes (or close)
- Trained to minimize sum of activations
- "Just clamp code error feature to zero" doesn't work
- Model might not know code errors
- Multiple features
- E.g. there are a lot of golden gate bridge features, most golden gate bridge stuff doesn't activate the one they used for golden gate claude