2024-09-05

Toronto AI Safety Meetup
- Happenings
  - https://technophilosophy.ca/ on sept 10
  - Claude is lazier in the summer
  - New stuff
    - Beijing Institute of AI Safety and Governance
    - Oxford Human-Centered AI Lab
    - SSI $1b funding
      - a16z invested
- Wireheading and Reward Inconsistency by Evgenii Opryshko
  - Reinforcement learning
    - Agent in environment produces actions executing in environment, changing the state, rewarding the agent
    - Goal is to maximize total reward over a trajectory
    - Useful for outcome-based tasks
    - RLHF is use-case
    - Terminology
      - $\pi(s) \rightarrow a$ = policy
        
        Best action in state s
        
        $\pi(s) = \argmax_a(Q(s,a))$
      - $V(s) \rightarrow R$ = (state) value function
        
        How much rewards do we expect to collect in the future from state s
      - $Q(s, a) \rightarrow R$ = Q-function = State-Action value function
        
        Q-value - how much do we expect in future if we do action a
    - Deep Q learning
      - $Q(s, a) = \text{Reward} + Q(s', a)$
  - Reward inconsistency
    - Defined: "the predicted outcome of learning from new experience contradicts current values"
    - Applies to embedded agents
    - Rational to avoid this
      - Because utility under changed policy wouldn't maximize utility under this policy
    - Instances
      - Delusion boxing
        
        E.g. pointing camera at a picture of happy people
      - Sensor tampering
      - Reward tampering (wireheading)
        
        Defined: agent changing reward logic to always give high reward
        
        Examples
        
        https://arxiv.org/pdf/2406.10162
        
        Life
        
        Study with mouse stimulating itself
        
        In humans
        
        Why not most humans don't want to wirehead
        
        Using current values to evaluate future predictions
      - Specification gaming (reward hacking)
        
        AI optimizes an objective function in a way that differs from what the creators intended it to do
        
        Examples
        
        Robot trained to grasp ball only put hand in front of ball
        
        Examples: https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/
    - Detecting reward inconsistency