2025-03-30

AI Control Hackathon 2025
- Main result: does debate improve backdoor detection over standard monitor?
- allow trusted to ask untrusted multiple times
- maybe wait until end for scoring info
  - had to prompt trusted not to give score as a question too early
- too polite
- untrusted model offers fixed code, which trusted model thinks replaces the orignal one
- limit amount of tokens
- random sigterms
- nsteps
  - 3 responses
    - multimode
  - tell untrusted < 100 characters
  - wrap untrusted
  - ~~break loop if no question~~
- test with
  - trusted models: 4o-mini, 4o, o3-mini
  - policy: honest, attack
  - monitor: original, original with better prompt, debate