- Toronto AI Safety Meetup: Auditing Language Models for Hidden Objectives
- https://arxiv.org/pdf/2503.10965
- Fine-tuning pre-trained LLM gives it rewards and objectives
- Hard to specify rewards
- Problems
- Sycophantic responses
- Over-confident guessing
- Verbose answers
- Instrumental convergence
- E.g. I give model task to do X, and to do that the model needs to survive
- Alignment faking
- Deception
- Steps
- Create good proxy model. An AI Model with hidden misalligned object
- e.g. adding extra unneeded details in replies to questions
- Run blind auditing game, four teams of researchers investigate
- Phase 1: Explore, try to find anything weird
- Phase 2: Given a weird behavior, try to find others
- Study auditing techniques