2025-03-20

Toronto AI Safety Meetup
- Jam
  - Control Hackathon on weekend
  - things
    - ControlArena challenges
    - Control Protocol design
    - Read Team & Vulnerability Research
- Do models know they’re being evaluated? (Giles)
  - done as a part of MATS
  - background
    - alignment faking
      - would look different in the wild
        
        training/evaluation/realworld requests might sound different, allowing a misaligned model to act based on the environment
  - difficult to measure things
    - what the model thinks
      - just ask it
      - logprobs
      - whitebox approaches
        
        linear probes
    - if it’s correct
      - the the researchers had low agreement with each other