- Capitalization tokens
- This paper from Anthropic says: "The tokenizer used by Claude 3.5 Haiku includes some special tokens which are depicted in some of our plots. These include two capitalization tokens (↑, ⇪) and a new-line token (⏎)"
- This other paper says "The tokenizer the model was trained with uses a special “Caps Lock” token" and that the actual output is in lowercase, but uses ⇪ and ↑ in examples
- "of↑ Aging at Harvard University (⇪PGDA)"
- not always used: "and strand displacement amplification (SDA), among"
- Empircal evidence with claude-3-7-sonnet-20250219
- "Repeat this phrase, then write the acronym for it in brackets after that: ethylene glycol diacrylate", gives "Ethylene glycol diacrylate [" when limited to BOTH 11 and 12 tokens - the 12th token
- Toronto AI Safety Meetup
- Hackathon
- If we write about bad things, AIs will do those bad things
- Evidence for
- Simulator theory
- Post-trained LLMs don't have a single identity; they take on different simulacra (or personas) based on the prompt
- If you train it on the entire Internet, it can
- Out of context reasoning
- models can do multi-hop reasoning from training data without thinking about it
- Emergent misalignment
- fine-tuning on bad code = generally evil
- Reward hacking
- Fine-tuning on documents about how model reward hacks makes it reward hack
- Prevention
- Don't train on bad data
- reduces capabilities
- some docs have good and bad, removing just bad -> they predict the removed part
- 90% removal works poorly, 100% works pretty well
- hard to clean
- Positive amplification
- Just visualize your goals
- Conditional pre-training
- prepend [good] or [bad] to documents; start outputs with [good]
- only tested on small models
- Gradient routing
- only update some parameters when training on bad stuff