- Toronto AI Safety Meetup
- News
- EAGxToronto
- Deadline to apply is end of July
- Llama 3.1 405B
- Mistral Large 2
- Challenges in Aligning LLMs
- Based on paper: https://arxiv.org/pdf/2404.09932
- Concepts
- Safety
- Alignment
- When developer intent matches model intent
- May not match model behaviour
- E.g. because limited model capabilities
- Assurance
- Proving(ish) a system is safe/aligned
- Includes interpretability
- Categories
- Scientific understanding of LLMs
- In-context learning is magical
- Learning to do thing based on examples in prompt
- Dunno how it works
- Mechanistic intrpretability for specific tasks
- Hypotheses
- Pattern matching
- Mesa-optimization
- Transformer learns to learn during training
- "capabilities" is poorly defined
- Combining multiple points in the paper
- Current measurements are sloppy
- Involves a bunch of benchmarks with informal discussions about what the benchmarks demonstrate
- Measuring general-purposeness and reasoning is hard
- Scaffolds
- Defined
- Non-user prompting LLM
- Anything attached to an LLM that is not an LLM
- Like an OpenAI "tool"
- Including ChatGPT function calls
- Scratchpad
- Effects of scale not well understood
- LLMs agents
- Natural language can underspecify goals
- Safety-performance tradeoffs poorly understood
- "Alignment tax"
- How big of a trade-off is there?
- Over-refusals
- A drug discovery AI could find known-dangerous viruses by minimizing goodness instead of maximizing goodness
- Development and deployment methods
- Fine-tuning struggles to ensure alignment/safety
- Doesn't fully unlearn bad things
- Problems with generalizing between languages
- Hallucinations
- How to prevent?
- Broad token distributions = large uncertaintity?
- Ask question several times, then ask again if the various responses agree
- Adversarial challenges
- Jailbreaks
- Prompt injection
- Data poisoning
- Sociotechnical challenges
- What values to encode
- Dual-use capabilites enable malicious use
- Trustworthiness of LLMs
- Socioeconomic impacts of AI
- Lacking governance/regulation