2024-07-18

Toronto AI Safety Meetup
- Representation noising effectively prevents harmful fine-tuning on LLMs
  - Harmful training: from an initial model take train steps to find the parameters that minimize the language modelling objective on harmful samples using GD
    - Attacker has some budget of training steps, goal is to put fine tuning out of budget
  - Popular method in CV (avoiding harmful images) is Sophon
    - https://arxiv.org/pdf/2404.12699
  - Want to minimize desired task loss and maximize harmful task loss
  - Make bad representations as close to Gaussian noise as possible
  - Doesn't affect training on non-dangerous datasets