How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0 thumbnail
How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0
www.semianalysis.com
Even in 2018, purely compute-bound workloads made up 99.8% of FLOPS but only 61% of the runtime. The normalization and pointwise ops achieve 250x less FLOPS and 700x less FLOPS than matrix multiplications, respectively, yet they consume nearly 40% of the model’s runtime. Memory follows a hierarchy f
4 Users
0 Comments
128 Highlights
128 Notes

Summary

OpenAI's Triton and PyTorch 2.0 are disrupting Nvidia's CUDA monopoly in machine learning. Eager mode is a standard scripting execution method, while graph mode has two phases: defining a computation graph and deferred execution. Google generative AI models are based on Jax, not TensorFlow. FLOPS and memory are two important factors in machine learning. FLOPS have increased multiple orders of magnitude due to Moore's Law and architectural changes, but memory has not followed the same path. Even with heavy optimizations, 60% FLOPS utilization is considered a high utilization rate for large language model training.

Top Highlights

  • Even in 2018, purely compute-bound workloads made up 99.8% of FLOPS but only 61% of the runtime. The normalization and pointwise ops achieve 250x less FLOPS and 700x less FLOPS than matrix multiplications, respectively, yet they consume nearly 40% of the model’s runtime.
  • Memory follows a hierarchy from close and fast to slow and cheap. The nearest shared memory pool is on the same chip and is generally made of SRAM. Some machine-learning ASICs attempt to utilize huge pools of SRAM to hold model weights, but there are issues with this approach. Even Cerebras’ ~$2,500,000 wafer scale chips only have 40GB of SRAM on t...
  • 1GB of SRAM on TSMC’s 5nm process node would require ~200mm^2 of silicon.
  • While capacity is a significant bottleneck, it is intimately tied to the other major bottleneck, bandwidth. Increased memory bandwidth is generally obtained through parallelism.
  • why PyTorch won.

Tags

AI
Hardware
ML
PyTorch
machine learning

Ready to highlight and find good content?

Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.