2025-06-26

Toronto AI Safety Meetup
- metadata
  - title: Formal Confinement: Proof-carrying code and AI safety
  - repo: https://github.com/quinn-dougherty/formal-confinement
  - slides: https://github.com/quinn-dougherty/formal-confinement/blob/master/comms/trajectory_labs/template.pdf
- AI containment (boxing)
  - Just keep the AI in sealed hardware that can't affect the hardware world!
  - Original discussion on this: https://www.yudkowsky.net/singularity/aibox
  - Recent work on boxing AI
    - AI Control
    - Safeguarded AI
- Proof-carrying code
  - Dafny lets you write imperative code, then turns it into a formal formula about the code
  - Software is bad, does stuff we don't want
  - We can make proofs that software is good
  - Lampson confinement
    - Rules
      - Total isolation
        
        only calls other confined programs
      - Masking and enforcement
        
        All inputs specified by the controller
        
        e.g. no side channels
- Formal confinement
  - LLM box
    - inputs
      - pre-conditions
      - post-conditions
    - outputs
      - proof-carrying code
        
        can be nothing
  - Hoare triple
    - Ternary predicate that expresses when a command sends an assertion to another assertion
  - Experiment results
    - Doesn't work well on current frontier models, but will hopefully be ready for future models
    - Verification burden
      - if it costs x tokens to complete the program, it costs (burden*x) to prove it correct
- Future work
  - Non-toy languages and proof stacks
  - Elicit subversive code like original AI Control paper
    - in that paper they elicit backdoored code, and invent system to audit solutions to them
- Strategic outlook
  - Doesn't work for arbitrary/scheming ASI
    - Side-channel attacks
  - Doesn't rely on any whiteboxing
    - Don't need to be trusted in the weights