- Toronto AI Safety Meetup
- metadata
- AI containment (boxing)
- Proof-carrying code
- Dafny lets you write imperative code, then turns it into a formal formula about the code
- Software is bad, does stuff we don't want
- We can make proofs that software is good
- Lampson confinement
- Rules
- Total isolation
- only calls other confined programs
- Masking and enforcement
- All inputs specified by the controller
- e.g. no side channels
- Formal confinement
- LLM box
- inputs
- pre-conditions
- post-conditions
- outputs
- Hoare triple
- Ternary predicate that expresses when a command sends an assertion to another assertion
- Experiment results
- Doesn't work well on current frontier models, but will hopefully be ready for future models
- Verification burden
- if it costs x tokens to complete the program, it costs (burden*x) to prove it correct
- Future work
- Non-toy languages and proof stacks
- Elicit subversive code like original AI Control paper
- in that paper they elicit backdoored code, and invent system to audit solutions to them
- Strategic outlook
- Doesn't work for arbitrary/scheming ASI
- Doesn't rely on any whiteboxing
- Don't need to be trusted in the weights