2024-07-25

Toronto AI Safety Meetup
- News
  - EAGxToronto
    - Deadline to apply is end of July
  - Llama 3.1 405B
  - Mistral Large 2
- Challenges in Aligning LLMs
  - Based on paper: https://arxiv.org/pdf/2404.09932
  - Concepts
    - Safety
    - Alignment
      - When developer intent matches model intent
      - May not match model behaviour
        
        E.g. because limited model capabilities
    - Assurance
      - Proving(ish) a system is safe/aligned
      - Includes interpretability
  - Categories
    - Scientific understanding of LLMs
      - In-context learning is magical
        
        Learning to do thing based on examples in prompt
        
        Dunno how it works
        
        Mechanistic intrpretability for specific tasks
        
        https://arxiv.org/pdf/2310.15213
        
        https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
        
        Hypotheses
        
        Pattern matching
        
        but novel tasks
        
        Mesa-optimization
        
        Transformer learns to learn during training
      - "capabilities" is poorly defined
        
        Combining multiple points in the paper
        
        Current measurements are sloppy
        
        Involves a bunch of benchmarks with informal discussions about what the benchmarks demonstrate
        
        Measuring general-purposeness and reasoning is hard
        
        Scaffolds
        
        Defined
        
        Non-user prompting LLM
        
        Anything attached to an LLM that is not an LLM
        
        Like an OpenAI "tool"
        
        Including ChatGPT function calls
        
        Scratchpad
        
        Effects of scale not well understood
        
        Scaling laws
      - LLMs agents
        
        Natural language can underspecify goals
      - Safety-performance tradeoffs poorly understood
        
        "Alignment tax"
        
        How big of a trade-off is there?
        
        Over-refusals
        
        A drug discovery AI could find known-dangerous viruses by minimizing goodness instead of maximizing goodness
        
        Probably this: https://www.nature.com/articles/s42256-022-00465-9.epdf
        
        Summary and interview with author: https://www.theverge.com/2022/3/17/22983197/ai-new-possible-chemical-weapons-generative-models-vx
    - Development and deployment methods
      - Fine-tuning struggles to ensure alignment/safety
        
        Doesn't fully unlearn bad things
        
        Problems with generalizing between languages
        
        Hallucinations
        
        How to prevent?
        
        Broad token distributions = large uncertaintity?
        
        Ask question several times, then ask again if the various responses agree
      - Adversarial challenges
        
        Jailbreaks
        
        Prompt injection
        
        Data poisoning
    - Sociotechnical challenges
      - What values to encode
      - Dual-use capabilites enable malicious use
      - Trustworthiness of LLMs
      - Socioeconomic impacts of AI
      - Lacking governance/regulation