Impure world resurrection: Implicit parallelism through non-deterministic evaluation strategy with implicit side-effect control
When I've found out that there exists a pure functional programming language which is actually practical (Haskell), I was excited: I expected that pureness would allow optimizations such as automatic memoization and parallelism.
However, when i started learning Haskell I was surprised that even though memoization and parallelisation are possible, they have to be explicit. Later I understood a reason for this: Haskell has a deterministic evaluation order, it is lazy unless you explicitly demand eager computation. With deterministic evaluation performance is also deterministic: programmer can expect certain performance characteristics from certain algorithms. So you wouldn't be surprised by out-of-memory condition caused by memoization and speculative execution. That's nice. Haskell's pureness allows one to make sure that parallelization is safe, but compiler won't make decisions for you. (Although it can do an optimization if it can prove that it won't make performance worse. Which is rare.)
But I still think that automatic parallelization would be awesome, so I tried to design a language, or execution model, where it is not just possible but natural. Here's my thought process:
Let's start with something simple, like a couple of function calls:
foo(bar(i), baz(j));
Compiler can execute bar(i) and baz(j) in parallel if one of them or both is pure, i.e. it neither reads nor writes any external information. E.g. if bar is pure and baz isn't, bar won't be affected by any side effects of baz, and neither it will affect baz because it has no side effects.
Compiler can analyze function's code to see whether function is pure or not, but parallelization on function call level is less than ideal:
- it won't be able use more than two computation units (e.g. CPU cores): we can only run bar and baz in parallel, but not foo, because foo depends on their results.
- if function execution times differ, we won't get even 2x speedup: e.g. if computing bar requires negligible time, one CPU core will be used to compute baz, and so there is no effect from parallelization at all
- function might have both pure and impure parts, so we could compute pure parts in parallel. Or maybe impure parts too, if they don't touch same resources
- a lot of things about program are not known before run-time, they are either undecidable or very hard to analyze. (Loops, conditional constructs, recursion might depend on parameters which are known only in run-time.) So parallelization is going to be limited or not possible at all.
- parallelization decision often depends on number of elements or iterations: if we work with array with million elements we might want to process those elements in parallel on many CPU cores, but if there are only a handful of such elements, it's faster to do it serially, as communication overhead would be relatively significant in this case
- parallelization decision also depends on context and hardware details: how many CPU cores are available, how many RAM we can use, how expensive is communication and so on
Comments