2024-10-03

Toronto AI Safety Meetup
- retrieval interleaved generation
- Sparse Auto-Encoders
  - Auto-encoders
    - Large image -> small latent space representation -> Large image
      - Must be compressing
    - Feature extraction using AE
    - linear transformation hypothesis
    - non-linear features
      - e.g. days of week, circular things
        
        day of week feature wouldn't let you add one and loop around
      - not considering these
    - monosemanticity: 1 neuron = 1 feature, polysemanticity: >1 neurons = 1 feature
    - low-rank adapation
      - turning a big matrix into a matrix mutiplication of tiny matricies
    - causual inner product?
  - Sparse auto-encoder
    - Small (16K) -> Big (34M) -> Small (16K)
    - Sparse: mostly zeroes (or close)
    - Trained to minimize sum of activations
    - "Just clamp code error feature to zero" doesn't work
      - Model might not know code errors
      - Multiple features
        
        E.g. there are a lot of golden gate bridge features, most golden gate bridge stuff doesn't activate the one they used for golden gate claude