- AI Control Hackathon 2025
- Main result: does debate improve backdoor detection over standard monitor?
- allow trusted to ask untrusted multiple times
- maybe wait until end for scoring info
- had to prompt trusted not to give score as a question too early
- too polite
- untrusted model offers fixed code, which trusted model thinks replaces the orignal one
- limit amount of tokens
- random sigterms
- nsteps
- 3 responses
- tell untrusted < 100 characters
- wrap untrusted
break loop if no question
- test with
- trusted models: 4o-mini, 4o, o3-mini
- policy: honest, attack
- monitor: original, original with better prompt, debate