- Toronto AI Safety Meetup
- Representation noising effectively prevents harmful fine-tuning on LLMs
- Harmful training: from an initial model take train steps to find the parameters that minimize the language modelling objective on harmful samples using GD
- Attacker has some budget of training steps, goal is to put fine tuning out of budget
- Popular method in CV (avoiding harmful images) is Sophon
- Want to minimize desired task loss and maximize harmful task loss
- Make bad representations as close to Gaussian noise as possible
- Doesn't affect training on non-dangerous datasets