- Toronto School of Foundation Modeling
- Frontier-scale MoE serving at Modular by Brendan Duke
- Perfomance
- FLOPs = FLOating Point operations
- Tmath, Tcomp for math and communcation time
- lower bound = max(Tmath, Tcomp)
- upper bound = Tmath + Tcomp
- bigger one is the bound
- arithmetic intensity = (computation FLOPs)/(communcation bytes)
- Math-bound (Tmath > Tcomp) when?
- can be perived from arithmetic intensity et al.
- Intensity(dot product) = (Total FLOPs)/(Total bytes) =
- N is vector length, (N + N - 1)/(2N + 2N + 2) = 1/2
- Roofline model graph
- Y = FLOPs/s realized (log scale)
- X = arithmetic intensity (log scale)
- Goes up/right to ridge point, then flat
- Prefill GEMM -> compute bound region (fills KV cache)
- matmul arithmetic intensity
- With usual assumptions, we can cancel out to get that arithmetic intensity is the batch size
- -> so doubling batch size (roughly) doubles output
- assumptions stop working out for batch size > 512
- Usually much more GPU compute than memory bandwidth
- -> memory bandwidth is usually bottleneck
- Taalas
- company that makes hardware with loads of memory bandwidth
- Mixture of Experts (MoE)
- Routing layer chooses experts to run
- attention is lookup to a KV store, similarily MoE is lookup to which expert
- but attention is cont, MoE is discrete
- Router is usually lightweight linear layer
- Dense transformers
- Before: Layer norm, self attention, addback, layer norm, feedforward network
- After: last part is replaced with MoE
- Dense vs sparse
- Dense = fewer experts, sparse = many experts
- Why dense?
- Numbers of experts is around the same of numbers of GPUs on a node
- DeepSeek ablation study on number of experts
- More experts -> better model quality
- Sparse
- Can't just shard all the expert weights
- Need different parallelism setup
- In DeepSeek, first 3 are dense feedforward, 58 are MoE
- Shared router or per-layer router?
- Current models use per-layer router
- Shared router (like per-layer routers but with tied weights) could be better?
- REAP
- paper on pruning low-value experts
- MoE scaling
- More devices -> less compute, increases AlltoAll communcation
- Because fewer experts per device
- Larger batch size -> more throughput, constrains memory
- More experts -> more parallelism, more router complexity
- More experts -> less communications, more computation
- Perf calculations
- Simplifications
- Uniformly distributed expert routing
- (In practice, sometimes some experts don't get used much)
- All experts active in given step
- If batch size is large enough, usually all experts are active
- Memory bandwith bound
- GPUs have more compute than memory bandwidth
- Ignore link latency
- There is a fixed cost for transfers
- Upper bound
- TMoE = Tdispatch + Texperts(grouped GEMM) + Tcombine
- MoE node calescing
- At most, you route a token to 4 remote nodes
- Semianalysis has could articles about optimal training node topologies
- Shared experts fusion
- Expert routing is actually very non-uniform
- Very spiky
- Depending on the dataset, we see different spikes
- Solution: expert parallel load balancing
- Disaggregated inference
- Do prefill on different node than decode
- How to get KV cache on decode node?
- Why is this better?
- Good when you have a SLA that means you have to output tokens at a consistent pace
- Worse for batch use cases
- Modular
- AI stack built on LLVM
- Nearly hardware agnostic
- Stack
- Mammoth
- k8s-native cluster management
- MAX
- like vLLM+pytorch
- MAX Serving
- single container that works across hardware
- Mojo
- Pythonic programming language
- For writing portable model kernels
- COmparision
- Trition: usuable, portable but not performant
- Can't write good Blackwell kernels in Triton
- CUDA: Portable and performant but not usuable
- Mojo: usuable, portable, and peformant
- Downsides
- Lack of maturity
- Small community
- Is fun
- Graph Compiler
- Inputs
- MLIR DAG of DL ops
- Package kernels
- Output
- DAG of executable kernels