- Toronto AI Safety Meetup
- Happenings
- https://technophilosophy.ca/ on sept 10
- Claude is lazier in the summer
- New stuff
- Beijing Institute of AI Safety and Governance
- Oxford Human-Centered AI Lab
- SSI $1b funding
- Wireheading and Reward Inconsistency by Evgenii Opryshko
- Reinforcement learning
- Agent in environment produces actions executing in environment, changing the state, rewarding the agent
- Goal is to maximize total reward over a trajectory
- Useful for outcome-based tasks
- RLHF is use-case
- Terminology
- $\pi(s) \rightarrow a$ = policy
- Best action in state s
- $\pi(s) = \argmax_a(Q(s,a))$
- $V(s) \rightarrow R$ = (state) value function
- How much rewards do we expect to collect in the future from state s
- $Q(s, a) \rightarrow R$ = Q-function = State-Action value function
- Q-value - how much do we expect in future if we do action a
- Deep Q learning
- $Q(s, a) = \text{Reward} + Q(s', a)$
- Reward inconsistency
- Defined: "the predicted outcome of learning from new experience contradicts current values"
- Applies to embedded agents
- Rational to avoid this
- Because utility under changed policy wouldn't maximize utility under this policy
- Instances
- Delusion boxing
- E.g. pointing camera at a picture of happy people
- Sensor tampering
- Reward tampering (wireheading)
- Defined: agent changing reward logic to always give high reward
- Examples
- https://arxiv.org/pdf/2406.10162
- Life
- Study with mouse stimulating itself
- In humans
- Why not most humans don't want to wirehead
- Using current values to evaluate future predictions
- Specification gaming (reward hacking)
- AI optimizes an objective function in a way that differs from what the creators intended it to do
- Examples
- Detecting reward inconsistency