On GPU vs CPU Algorithms Compilation, Cpp, Parallel Computation, Wolfram Physics (2023)

By Xah Lee. Date: .

on GPU vs CPU algorithms compilation, C++, parallel computation, Wolfram Physics

by BruceKnowsHow, 2023-05-22

  • Finally understood something interesting about GPU programming today.
  • We have this ubiquitous problem in GPU programming with too many "shader permutations".
  • In this context a shader just means a complete compiled GPU program.
  • For a long time I wondered why this is even a problem, you don't have anything like this when writing C++.
  • The analogy would be if you had a C++ function with like 10 inputs which you pull out and compile a different version of the function just to have those values compile time constant.
  • Now in C++ you would not do this because the linking time of any scheme like this would add way more overhead, but also on the CPU the savings of compile-time folding a few values is so small.
  • So why would you do this on the GPU?
  • I have wondered for 10 years.
  • I realized the answer today that it is because a line of GPU code will get executed roughly I-million times as much as a line of CPU code.
  • Because its executed in batched parallel workloads, the impact of each line is magnified by the number of threads.
  • So it becomes worthwhile to have a speed-of-light hyper specific shader for every run.
  • Just adding an if() statement that is always false, but must be evaluated at runtime, could have a 1% impact on the throughput Of your GPU code.
  • This leads to things like GPU compilers inlining every function (every GPU compiler does this), and application developers coming up with schemes to compile a specific version of their shader to remove as many runtime instructions as possible.
  • So the fundamental reason behind "shader permutations" is that massively parallel programs are worth optimizing more than serial CPU programs.
  • I believe that there is a lot of unrealized value in GPU-specialized systems, where most of the algorithms run entirely on the GPU.
  • I think one of the major reasons we haven't seen more Of this type Of thing is that it is an order Of magnitude more complicated than writing C code.
  • I found this frustrating for a long time, but now I am realizing there are kind of fundamental reasons for that complexity.
  • I feel less upset about it now.
  • Machine learning frameworks, and GPU-dataframe libraries, are two recent examples of where GPU code managed to succeed because these problems both fit well into this model of a simple (usually python-based) API, which drives some C code, which drives some GPU code.
  • The key thing here is that good abstractions were found so this worked out with a manageable amount of complexity.
  • I think there are more of these domains out there waiting to be found, and long term I would hope the trend is that parallel solutions are found to all the parallel problems, but there are not a lot of GPU programmers, and most of them are working on video games, their female voice or other lame crap so time will tell
  • I feel like parallelism is a deep concept in algorithms and computation.
  • No matter how fast computers get, a parallel processor will be a lot faster at its problems than a CPU.
  • A godlike A1 would run parallel code when it is applicable
  • Makes me think if reality was running on a computational substrate, it would make a lot of sense if it was parallel in nature.
  • General relativity and the speed of light seem to suggest this, where there are boundaries on how fast information can propogate.
  • That means the computation generating it could be split into different threads.
  • I liked listening to the Wolfram Physics project because it gave me ideas for weird GPU algorithms.
  • The fact that they are interested in sets of hypergraph rewrite rules that are invariant under the sequence that you apply the rules, means that there is a lot of parallelism in the problem of running those rules.
  • And apparently this invariance corresponds to rules that can generate general relativity.
GPU vs CPU algo compilation 2023-05-22
GPU vs CPU algo compilation 2023-05-22

Grok explains

### Overview of the Article

  • The article, titled "on GPU vs CPU algorithms compilation, C++, parallel computation, Wolfram Physics," is a reflective discussion formatted as a series of chat messages from a user named "BruceKnowsHow." It explores why GPU programming often involves more aggressive optimizations than CPU programming, particularly for algorithms in parallel computing.
  • The core idea revolves around how GPUs handle massive parallelism differently from CPUs, making certain techniques—like compiling specialized versions of code—highly valuable for performance gains.
  • The piece ties this into broader implications for fields like machine learning and even theoretical physics, emphasizing the untapped potential of parallel algorithms.

### Key Realization: Shader Permutations and Why GPUs Need Them

  • The article kicks off with a personal "aha" moment about "shader permutations." In GPU programming, developers often compile multiple tailored versions of their code (called shaders) for different scenarios, treating inputs as compile-time constants to eliminate unnecessary runtime checks.
  • This seems wasteful at first—why not just write flexible code with if-statements?—but it's justified because GPUs execute code in huge batches, running the same line millions of times in parallel.
  • A tiny inefficiency, like an always-false `if()` statement evaluated at runtime, can shave off 1% of throughput across the entire workload.
  • On CPUs, which process tasks more serially, such optimizations yield negligible benefits and add compilation overhead, making them impractical.
  • In essence, the author argues: a single line of GPU code gets executed roughly **1 million times more often** than the equivalent on a CPU due to this batched parallelism.
  • This amplifies the payoff of hyper-optimized, "permuted" shaders.

### GPU vs. CPU: Core Differences in Algorithm Execution

The discussion contrasts the two architectures head-on:

  • **GPUs**: Built for parallel workloads, like rendering graphics or training neural networks.
  • They thrive on simple, repetitive tasks where thousands of threads run the same code simultaneously.
  • Compilers can inline everything aggressively, and developers exploit this by pre-compiling variants to strip out dead code.
  • **CPUs**: Optimized for sequential, general-purpose tasks with branching logic.
  • Here, compile-time tricks like constant folding (simplifying expressions known at compile time) save little time, and the linking overhead from permutations outweighs any gains.
  • This leads to a deeper insight: GPU algorithms succeed when problems fit a "simple API → C code → GPU code" pipeline, where abstractions hide the complexity.
  • CPU algorithms, by contrast, don't demand such specialization because execution frequency is lower.

### Examples of GPU-Winning Algorithms

The article doesn't dive into code snippets but highlights real-world successes:

  • **Machine Learning Frameworks** (e.g., TensorFlow or PyTorch): These use Python APIs to orchestrate C++ backends that feed data to GPUs.
  • Parallel matrix multiplications and convolutions shine here, as GPUs crunch massive datasets far faster than CPUs.
  • **GPU-Dataframe Libraries** (like RAPIDS cuDF): Similar to Pandas but GPU-accelerated, they parallelize data operations like filtering or joins, leveraging the same batched model.
  • It also nods to the **Wolfram Physics Project**, where hypergraph rewrite rules (a way to model the universe computationally) are sequence-independent, hinting at inherent parallelism ripe for GPU implementation.
  • No specific benchmarks are given, but the 1% throughput hit from a bad `if()` serves as a proxy for how sensitive GPU algos are to inefficiencies.

### Broader Insights and Philosophical Ties

  • Beyond tech, the author muses on parallelism as a "deep concept" in computation:
  • GPUs outpace CPUs on parallelizable problems, but few domains beyond games and AI fully exploit this—there's huge potential for more "GPU-specialized systems."
  • An advanced AI would intuitively choose parallel code when it fits, much like humans pick tools for tasks.
  • Tying into physics: If the universe is a "computational substrate" (like Wolfram's model), it might run parallel by nature, with the speed of light as a parallelism boundary (preventing instant info sharing across distances, akin to relativity).
  • GPU programming is "an order of magnitude more complicated" than C++, but that's not a flaw—it's a feature born from the architecture's power.

### Conclusions and Takeaways

  • The piece ends optimistically, hoping for more breakthroughs in parallel algos despite the scarcity of GPU experts (many stuck in gaming).
  • It doesn't prescribe fixes but implies: Look for problems with batched, uniform computations—they're GPU gold.
  • If you're into parallel computing, this article underscores why GPUs aren't just "fast hardware" but a mindset shift for algorithm design.
  • Overall, it's a concise, insightful read blending engineering with big-picture thinking, perfect for devs pondering why their CUDA code feels so finicky compared to vanilla C++.

Programing Languages and their Machines