2024-08-08

Toronto AI Safety Meetup
- News
  - BlueDot Impact biosecurity course is running
    - Apply by Sep 15th, 2024
  - OpenAI stuff
    - Staff/board changes
  - Paper: Tamper-Resistant Safeguards for Open-Weight LLMs
    - Similar to presentation a few weeks ago about preventing harmful fine-tuning
      - Different people though
    - Uses meta-learning
      - Tries adversarial training runs and nudges so those work less well
  - Paper: Predicting Results of Social Science Experiments Using Large Language Models
  - Kurzegsagt video on AI safety
    - Doomy
  - METR compared what tasks humans can do in X minutes to what various models can do in unlimited time
    - Both GPT-4o and humans could do 40% in 30 minutes
- Refusal mechanisms
  - How
    - Finetuning and RLHF
  - Representations
    - Has to be in the residual stream
  - Ablating refusal direction
    - Just subtract the (normalized) refusal direction at every layer and token position