The Ultra-Scale Playbook
A dense, interactive reference for distributed training at scale — covers tensor parallelism, pipeline parallelism, data parallelism, and how to combine them without destroying your throughput. This is not a blog post with pretty diagrams; it is closer to a living engineering document with concrete numbers, code references, and tradeoff tables. The thing that makes it worth your Saturday is the specificity: it tells you when ZeRO-3 stops paying off, how pipeline bubbles eat your GPU utilization, and which combinations of parallelism strategies actually make sense at which scale. Reservation: if you are not training models yourself — if you are purely an API consumer — this is background reading at best. But for anyone running multi-GPU or multi-node training jobs, this is the reference that replaces three separate blog posts you have already lost in your bookmarks. -> Best for: ML researcher or AI engineer running distributed training on real hardware