Skip to main content

Impure world resurrection: Implicit parallelism through non-deterministic evaluation strategy with implicit side-effect control

When I've found out that there exists a pure functional programming language which is actually practical (Haskell), I was excited: I expected that pureness would allow optimizations such as automatic memoization and parallelism. However, when i started learning Haskell I was surprised that even though memoization and parallelisation are possible, they have to be explicit. Later I understood a reason for this: Haskell has a deterministic evaluation order, it is lazy unless you explicitly demand eager computation. With deterministic evaluation performance is also deterministic: programmer can expect certain performance characteristics from certain algorithms. So you wouldn't be surprised by out-of-memory condition caused by memoization and speculative execution. That's nice. Haskell's pureness allows one to make sure that parallelization is safe, but compiler won't make decisions for you. (Although it can do an optimization if it can prove that it won't make performance worse. Which is rare.) But I still think that automatic parallelization would be awesome, so I tried to design a language, or execution model, where it is not just possible but natural. Here's my thought process: Let's start with something simple, like a couple of function calls: foo(bar(i), baz(j)); Compiler can execute bar(i) and baz(j) in parallel if one of them or both is pure, i.e. it neither reads nor writes any external information. E.g. if bar is pure and baz isn't, bar won't be affected by any side effects of baz, and neither it will affect baz because it has no side effects. Compiler can analyze function's code to see whether function is pure or not, but parallelization on function call level is less than ideal:
  • it won't be able use more than two computation units (e.g. CPU cores): we can only run bar and baz in parallel, but not foo, because foo depends on their results.
  • if function execution times differ, we won't get even 2x speedup: e.g. if computing bar requires negligible time, one CPU core will be used to compute baz, and so there is no effect from parallelization at all
  • function might have both pure and impure parts, so we could compute pure parts in parallel. Or maybe impure parts too, if they don't touch same resources
So it looks like we need to go deeper, into definitions of foo, bar and baz, to do optimal, full scale parallelization. Imagine compiler would traverse definitions of these functions and functions they call, inlining their code into the call site. Eventually this process will transform this call into a number of primitive pure (such as arithmetic: 2+2) and impure (such as I/O) computations glued together with data flow and control structures. All pure primitives can be computed in any order (thus evaluation order is non-deterministic) as long as data dependencies are respected. I.e. if bar computes some number using only integer i as input, and foo further does some math with it, we can compute this independently from other computations, but of course computation in bar has to be done before computation in foo because foo doesn't yet have a number to work with until bar finishes. What to do about impure parts? We could execute them serially, in order they appear like in other programming languages. But that's not optimal. Imagine bar reads text from file "bar.txt" and baz reads text from file "baz.txt". Theoretically, reading "bar.txt" might trigger some external condition which would also affect "baz.txt", so ideally we should do it in this order. But practically, programmers do not care about such obscure conditions, so they don't care in what order files are read. So we can come up with this rule for impure operations: operations which affect same object must be serialized with respect to each other, but not with operations which affect different, independent objects. Here object might mean file descriptor, socket, variable in memory. In some cases you might want something different: to serialize operations on different objects or to run operations on same object in parallel. But this cases are rare and so it's possible to allow explicit control is such cases. (Try to think about cases where you might want this and you'll see that there aren't many. Note that through operating system's buffering, read-ahead and write-back caching you don't have full control over I/O anyway.) So far so good. But a compile-time parallelization like this won't work so well because:
  • a lot of things about program are not known before run-time, they are either undecidable or very hard to analyze. (Loops, conditional constructs, recursion might depend on parameters which are known only in run-time.) So parallelization is going to be limited or not possible at all.
  • parallelization decision often depends on number of elements or iterations: if we work with array with million elements we might want to process those elements in parallel on many CPU cores, but if there are only a handful of such elements, it's faster to do it serially, as communication overhead would be relatively significant in this case
  • parallelization decision also depends on context and hardware details: how many CPU cores are available, how many RAM we can use, how expensive is communication and so on
So it makes sense to delay parallelization decisions till runtime, which is similar to just-in-time (JIT) compilation. So we need to traverse function calls and control structures in run time (at least those parts which aren't certain at compile time). It has certain overhead, sure, but with modern multi-core CPUs we often have a couple of cores sitting idle, and we can do a lot of traversal on multi-GHz CPU core. Besides that, traversal resolves data flow and control structures in runtime, so it does part of a job we had to do anyway through plain execution. So here's how it could work: we will traverse program's function calls and control structures to produce a direct acyclic graph which would represent data and control dependencies. Graph nodes which do not depend on anything can be executed right away, on any available CPU core. Once they are executed, they are removed from graph and things that depend on them can become computable. This graph can be built dynamically, on fly: we can stop traversal at certain point (e.g. when graph is too large) and start again when some nodes are removed after they were computed. Computing dependencies for impure operations (side effects) is more tricky, but, generally, they depend on other impure operations before them in traversal order until we know what object they are associated with, and then we can queue them. (Note that we need to know that no preceding operation in traversal order affects this object, this is the tricky part.) But... what to do with mutable data structures, variables and such? Well, we need to treat operations on them as impure. However, it doesn't mean that we need to explicitly serialize each operation on them, as through various tricks this serialization can be done implicitly or in lightweight fashion. For example, local variables are not visible from outside, so there is no need to serialized their access globally: compiler can just serialize access in code it generates (and often such variables are 'compiled away', i.e. not present as a recognizable machine code feature). On comp.lang.lisp people coined term 'funperative' for such cases: function might be implemented in imperative style internally, but to external observer it will be seen as pure functional as it has no side effects and doesn't depend on 'hidden' information. (I believe such 'funperative' functions can be easily be transformed into recursive functions, but there is no need to do this, as compiler has to emit imperative machine code in the end.) Data structures might be mutable when they are initialized, but then converted into read-only ones. Function which works with read-only data only is pure. Honestly, I don't have a particular set of language features in mind, and, perhaps, it is possible to make many different languages around these concepts. But it's clear that language doesn't have to be 100% pure functional: it might allow both pure functional and imperative, impure parts. In some cases impurities can be optimized away, in other cases they cannot, but it's not a problem with this design as there is a runtime side-effect control as a last resort. So programmer has choice whether to use functional or imperative constructs, and imperative constructs will only be punished with failures to optimize but not failures to compile. (Also, people who worry about lack of safety due to unexpected side effect combinations (hopefully) will be able to demand analysis from compiler for further investigation and decision making, so, again, it's a choice of programmer, but compiler should provide tools to make it as good as possible.) As a person who tends to think about algorithms in an imperative way (i.e. in terms of steps and mutation of data), I find it very interesting: I won't need to change the way I think to make programs better. OK, so it might work. But is it worth the effort? If parallelization is the only goal, probably not: explicit parallelization works fairly well. But there is another thing non-deterministic evaluation order can do: fine-grained prioritization, which can help to improve responsiveness. I think that's far more important. P.S. If you wonder what 'impure world resurrection' means, I've included these words just because title sounded to academic-y otherwise, and it really wasn't an academic paper. Actually it's a name of totally unrelated spell from Naruto anime. But I think it sounds cool, so maybe it would be used as a name of a system if it will ever be implemented. Also I should note that I'm not an expert in compiler/programming language design, so take everything I wrote with a grain of salt. It's just a thought process.

Comments

Popular posts from this blog

Lisp web tutorial?

"PHP vs. Lisp: Unfortunately, it's true..." article initiated quite active discussion on reddit , one fellow asking : Can someone post a tutorial for taking a clean install of Ubuntu (or windows or anything) to finish with serving a basic CRUD application using lisp? Maybe a TODO list with entires consisting of: incomplete/complete boolean, due date, subject, body? actually i had an impression that there are more than enough such tutorials, but as nobody replied i've tried finding one myself, starting with Hunchentoot tutorials. surprisingly, none of them covered a short path from clean OS install to working examples. neither i've found my ABCL-web  tutorial suitable for this, so i decided to try myself.  my idea was that Linux distros like Debian and Ubuntu contain a lot of Lisp packages, and it should be fairly easy to install them, as it automatically manages dependencies etc. i've decided to try Hunchentoot -- i'm not using it myself, but it's k

Lisp syntax is great!

lots of people complain about Lisp syntax -- they find it too weird and verbose, they call LISP "Lots of Irritating  Silly Parentheses"; and sometimes they even pop up with proposals to "fix Lisp" on comp.lang.lisp -- "Lisp is sort of cool, but this syntax... let me show you my great ideas." on the other hand, most lispers (and I among them) actually love s-expression syntax. who is right here? are syntax preferences a subjective thing, or one can decide which is better quite in an (more-or-less) objective way? or, perhaps, that's just a matter of taste and custom? i've got a good example today.. i'm using Parenscript -- cool Common Lisp library that automatically generates JavaScript from Lisp-like syntax -- and i've wrote a function that caches document.getElementById results (that makes sence for dumb browsers like IE):   (defun my-element-by-id (cache id) (return (or (slot-value cache id)     (setf (slot-value cache

out-of-memory: a sad case

the problem.. while Turing machine's tape is infinite, all real world programs are running within some resource constraints -- there is no such thing as infinite memory. for some programs that ain't a problem -- amount of memory needed by the algoritm can bee known beforehands, at programming time. but for most real world applications memory requirements are not known until run time, and sometimes it is very hard to predict how much does it need. obvious example is an application that allocates memory according to a user input -- for example, an image editor asks user for a dimension of an image he'd like to create. it needs to allocate an array in memory for the image (to have a fast access), so when dimensions exceed possible bounds, good-behaving application should notify user -- and user can either reduce image size or, perhaps, upgrade his machine, if he really wants to work with large images. while it is fairly easy to implement such functionality on a single-task O