2024-12-19

Toronto AI Safety Meetup: AI Control, part 2
- Untrusted model outputs are checked somehow, if the checker doesn't like it we use a less capable model or humans
- Metrics
  - Usefulness
    - Percent of inputs the model completes right
  - Safety
    - What % of dataset runs do we let non-zero backdoors be included
- Video about AI control
- ask trusted to add backdoor see if trusted catches it
- Assumptions
  - Human-level untrusted models, not superhuman
  - Slow takeoff
    - Making control protocols is a cat-and-mouse game
      - Slow
  - There are trusted models
    - if even GPT-3.5 is bad this doesn't work
    - untrusted model jailbreaking trusted model?
  - Model reasoning needs to be somewhat interpretable
    - Inscrutable reasoning might mean superhuman intelligence
  - Evaluation task distribution generalizes to tasks we care about
  - Control evaluations aren't dangerous