2024-12-12

Toronto AI Safety Meetup - AI Control Research Agenda
- Problem: might not solve AI alignment in time, so we might have a release of an unaligned model
- AI Control systems put a layer on top of unaligned models to make them more safe
- Idea: ask existing LLM if training documents are safe when training new one
  - or use synthetic data
  - can models generalise from innocuous facts?
- Can model include subtle bugs in the code because it hates the user?
  - does this require explicit scheming?
- Got to slide 3/14, to be continued