- Toronto AI Safety Meetup: AI Control, part 2
- Untrusted model outputs are checked somehow, if the checker doesn't like it we use a less capable model or humans
- Metrics
- Usefulness
- Percent of inputs the model completes right
- Safety
- What % of dataset runs do we let non-zero backdoors be included
- Video about AI control
- ask trusted to add backdoor see if trusted catches it
- Assumptions
- Human-level untrusted models, not superhuman
- Slow takeoff
- Making control protocols is a cat-and-mouse game
- There are trusted models
- if even GPT-3.5 is bad this doesn't work
- untrusted model jailbreaking trusted model?
- Model reasoning needs to be somewhat interpretable
- Inscrutable reasoning might mean superhuman intelligence
- Evaluation task distribution generalizes to tasks we care about
- Control evaluations aren't dangerous