- Toronto AI Safety Meetup - AI Control Research Agenda
- Problem: might not solve AI alignment in time, so we might have a release of an unaligned model
- AI Control systems put a layer on top of unaligned models to make them more safe
- Idea: ask existing LLM if training documents are safe when training new one
- or use synthetic data
- can models generalise from innocuous facts?
- Can model include subtle bugs in the code because it hates the user?
- does this require explicit scheming?
- Got to slide 3/14, to be continued