2025-11-11

Toronto School of Foundation Modeling
- Frontier-scale MoE serving at Modular by Brendan Duke
- Perfomance
  - FLOPs = FLOating Point operations
  - Tmath, Tcomp for math and communcation time
    - lower bound = max(Tmath, Tcomp)
      - -> no overlap
    - upper bound = Tmath + Tcomp
      - -> no sync
    - bigger one is the bound
  - arithmetic intensity = (computation FLOPs)/(communcation bytes)
    - computation per byte
  - Math-bound (Tmath > Tcomp) when?
    - can be perived from arithmetic intensity et al.
  - Intensity(dot product) = (Total FLOPs)/(Total bytes) =
    - N is vector length, (N + N - 1)/(2N + 2N + 2) = 1/2
  - Roofline model graph
    - Y = FLOPs/s realized (log scale)
    - X = arithmetic intensity (log scale)
    - Goes up/right to ridge point, then flat
    - Prefill GEMM -> compute bound region (fills KV cache)
  - matmul arithmetic intensity
    - With usual assumptions, we can cancel out to get that arithmetic intensity is the batch size
      - -> so doubling batch size (roughly) doubles output
        
        assumptions stop working out for batch size > 512
  - Usually much more GPU compute than memory bandwidth
    - -> memory bandwidth is usually bottleneck
    - Taalas
      - company that makes hardware with loads of memory bandwidth
- Mixture of Experts (MoE)
  - Routing layer chooses experts to run
    - attention is lookup to a KV store, similarily MoE is lookup to which expert
      - but attention is cont, MoE is discrete
    - Router is usually lightweight linear layer
  - Dense transformers
    - Before: Layer norm, self attention, addback, layer norm, feedforward network
    - After: last part is replaced with MoE
  - Dense vs sparse
    - Dense = fewer experts, sparse = many experts
    - Why dense?
      - Numbers of experts is around the same of numbers of GPUs on a node
        
        so better sharding
    - DeepSeek ablation study on number of experts
      - More experts -> better model quality
        
        So sparse is better
    - Sparse
      - Can't just shard all the expert weights
      - Need different parallelism setup
  - In DeepSeek, first 3 are dense feedforward, 58 are MoE
    - Many models are all MoE
  - Shared router or per-layer router?
    - Current models use per-layer router
    - Shared router (like per-layer routers but with tied weights) could be better?
  - REAP
    - paper on pruning low-value experts
  - MoE scaling
    - More devices -> less compute, increases AlltoAll communcation
      - Because fewer experts per device
    - Larger batch size -> more throughput, constrains memory
    - More experts -> more parallelism, more router complexity
      - Routers get very complex
    - More experts -> less communications, more computation
  - Perf calculations
    - Simplifications
      - Uniformly distributed expert routing
        
        (In practice, sometimes some experts don't get used much)
      - All experts active in given step
        
        If batch size is large enough, usually all experts are active
      - Memory bandwith bound
        
        GPUs have more compute than memory bandwidth
      - Ignore link latency
        
        There is a fixed cost for transfers
    - Upper bound
      - TMoE = Tdispatch + Texperts(grouped GEMM) + Tcombine
  - MoE node calescing
    - At most, you route a token to 4 remote nodes
  - Semianalysis has could articles about optimal training node topologies
  - Shared experts fusion
  - Expert routing is actually very non-uniform
    - Very spiky
    - Depending on the dataset, we see different spikes
    - Solution: expert parallel load balancing
      - Replicates hot experts
  - Disaggregated inference
    - Do prefill on different node than decode
    - How to get KV cache on decode node?
    - Why is this better?
      - Good when you have a SLA that means you have to output tokens at a consistent pace
      - Worse for batch use cases
- Modular
  - AI stack built on LLVM
  - Nearly hardware agnostic
  - Stack
    - Mammoth
      - k8s-native cluster management
    - MAX
      - like vLLM+pytorch
      - MAX Serving
        
        single container that works across hardware
    - Mojo
      - Pythonic programming language
      - For writing portable model kernels
      - COmparision
        
        Trition: usuable, portable but not performant
        
        Can't write good Blackwell kernels in Triton
        
        CUDA: Portable and performant but not usuable
        
        Hard to use
        
        Mojo: usuable, portable, and peformant
      - Downsides
        
        Lack of maturity
        
        Small community
      - Is fun
    - Graph Compiler
      - Inputs
        
        MLIR DAG of DL ops
        
        Package kernels
      - Output
        
        DAG of executable kernels