2025-04-03

Toronto AI Safety Meetup: Auditing Language Models for Hidden Objectives
- https://arxiv.org/pdf/2503.10965
- Fine-tuning pre-trained LLM gives it rewards and objectives
- Hard to specify rewards
- Problems
  - Sycophantic responses
  - Over-confident guessing
  - Verbose answers
  - Instrumental convergence
    - E.g. I give model task to do X, and to do that the model needs to survive
  - Alignment faking
  - Deception
- Steps
  1. Create good proxy model. An AI Model with hidden misalligned object
    - e.g. adding extra unneeded details in replies to questions
  2. Run blind auditing game, four teams of researchers investigate
    - Phase 1: Explore, try to find anything weird
    - Phase 2: Given a weird behavior, try to find others
  3. Study auditing techniques