- Penumbra community call
- Toronto AI Safety Meetup: Epoch AI: Can AI Scaling Continue Through 2030?
- Epoch AI is trying to provide grounded evidence about AI trajectory
- Training compute increasing at 4x per year since 2010
- Probably no constraint to scaling by 2030, but power is right on the line
- Bottlenecks
- Power constraints
- Llama 3 405B required ~27 MW
- Amazon has 960 MW nuclear plant
- Expected 2030 usage for frontier model (assuming 5000x) is 5.6 GW
- Assuming 4x more efficent, fp16->fp8 (2x efficiency gain), 3x longer training runs
- Sourcing lots of power
- No power plants with enough, new builds take at least 3 years from permit approval
- Grid power
- Northern Virginia is best place for data center power
- Geographically distributed training runs
- Chip production
- if largest lab gets 20% of H100s, 100M would need to be produced
- 1.5M-2M H100s projected shipped in 2024
- Advanced packaging
- HBM (high-bandwith memory) chips
- Nvidia's supplier sold out until 2026
- Silicon wafers not a constraint
- AI uses only 5% of TSMC 5nm production
- TSMC worried about scaling up too much, then an AI winter
- Data scarcity
- Text tokens
- Internet has ~500T (100T if you only count pre-compiled, 3000T if you have private data)
- Largest dataset is 18T
- Multimodal data
- 10T video seconds, 1T audio seconds, 10T images
- Should allow training a 6e28-2e32 FLOP model
- Chinchilla scaling laws
- Synthetic data
- Easier to verify than generate
- hard to generate code samples, easy keep ones that compile
- Problems
- Modal collapse
- Promising research about how to avoid model collapse
- Diminishing returns
- Lots of uncertainty
- Latency wall
- Minimum time to process datapoint proportional to number of layers
- Least problematic constraint, but inflexible