2024-08-08
- Toronto AI Safety Meetup
- News
- BlueDot Impact biosecurity course is running
- Apply by Sep 15th, 2024
- OpenAI stuff
- Staff/board changes
- Paper: Tamper-Resistant Safeguards for Open-Weight LLMs
- Similar to presentation a few weeks ago about preventing harmful fine-tuning
- Different people though
- Uses meta-learning
- Tries adversarial training runs and nudges so those work less well
- Similar to presentation a few weeks ago about preventing harmful fine-tuning
- Paper: Predicting Results of Social Science Experiments Using Large Language Models
- Kurzegsagt video on AI safety
- Doomy
- METR compared what tasks humans can do in X minutes to what various models can do in unlimited time
- Both GPT-4o and humans could do 40% in 30 minutes
- BlueDot Impact biosecurity course is running
- Refusal mechanisms
- How
- Finetuning and RLHF
- Representations
- Has to be in the residual stream
- Ablating refusal direction
- Just subtract the (normalized) refusal direction at every layer and token position
- How
- News