2024-09-19

Toronto AI Safety Meetup: Weak-to-strong generalization
- Paper: Weak-to-Strong Generalization: Elicting Strong Capabilities With Weak Supervision
- Empircal methods
  - narrow ASIs
    - e.g. chess
  - https://www.alignmentforum.org/posts/nekLYqbCEBDEfbLzF/artificial-sandwiching-when-can-we-test-scalable-alignment
- Weak-to-Strong generalization
  - Used various NLP classification tasks
  - Training GPT-4 to predict task based on GPT-2 labels
  - Fine-tuning on GPT-2 does better than 0-shot prompting
- Models can still be accurate with 99% random labels
  - https://arxiv.org/pdf/1705.10694