- Toronto AI Safety Meetup
- Jam
- Control Hackathon on weekend
- things
- ControlArena challenges
- Control Protocol design
- Read Team & Vulnerability Research
- Do models know they’re being evaluated? (Giles)
- done as a part of MATS
- background
- alignment faking
- would look different in the wild
- training/evaluation/realworld requests might sound different, allowing a misaligned model to act based on the environment
- difficult to measure things
- what the model thinks
- just ask it
- logprobs
- whitebox approaches
- if it’s correct
- the the researchers had low agreement with each other