Caching, lazy sequences, and streams

On the Clojure Google group there’s been a good discussion about lazy sequences and streams. The exchange goes to some of the fundamentals of the language, but if I have a decent grip on the issues, it sounds as though Mark Engelberg is concerned about the inefficiency of caching the results of evaluating n elements of a sequence like (def whole-numbers (iterate inc 0)). Here’s a bit from an excellent blog post in which he lays out his points:

No one would ever want to cache the whole-numbers sequence. It’s significantly faster to generate the sequence via incrementing every time you need to traverse the sequence, than it is to cache and hold the values. Furthermore, because the sequence represents an infinite sequence, there’s no upper limit to the memory consumption.

Engelberg would prefer non-cached lazy sequences, better described as streams, for most operations. Rich Hickey reveals that he in fact has an implementation of streams more or less up his sleeve, but he’s been holding back on adding it to Clojure’s core because streams bring their own issues and because it sounds non-trivial to add them–it would likely postpone the language’s 1.0 release. That’s my disco summary: please read Mark’s post and the whole discussion thread for everything I’ve missed and left out.

My own contribution to this debate might be to point out that there are plenty of scenarios in which caching is crucial to the construction of procedures. As I often do, I go back to SICP for examples. In the chapter on streams, Abelson and Sussman discuss how nice it would be to write recursive procedures, and gain the simplicity and readability of recursive code, but gain iterative efficiency, that is, linear growth of memory and execution time. They claim that laziness gets them that. As illustration, they provide a recursive definition of the Fibonacci sequence, which I here translate into Clojure:

> (def fibs (lazy-cons 0 (lazy-cons 1 (map + (rest fibs) fibs))))
> (take 10 fibs)
(0 1 1 2 3 5 8 13 21 34)

Without caching, this would be a bear. With caching, getting the nth Fibonacci number should be a matter of one addition operation. And we get beautiful code. Sure, this is an academic example, but consider how powerful the one-line definition is. It reads just like the natural language definition of the sequence: the first term is 0, the next is 1, and each successive term is the sum of the two previous terms. Awesoma powa.

Rich’s main worry about streams seems to be that since Clojure promises to let you treat many Java structures as sequences, it can’t be sure it is wrapping a source of immutable data. Between two evaluations of a lazy sequence, something could change in the Java object, and then the n elements of the sequence that your two evaluations share could be different.

That isn’t supposed to happen. With caching, it wouldn’t. Evaluating the first ten elements in a sequence like this fixes them for as long as the sequence is in scope. Later evaluating the first twenty, you will get the same items 0 through 9 as you got before.

So streams would not promise immutability, necessitating a separate abstraction interface from sequences and complicating the current everything-is-a-sequence/seq-able beauty of Clojure.

Perhaps there are other ways to get non-caching performance out of sequences. For example, some Clojure data structures are lazy without using lazy-cons. Stephen Gilardi explains how a clojure.lang.Range object responds to rest with another Range. Something like (doall (range 10000000000)) actually discards each of these “rest” Ranges as it goes. Maybe if you’re careful about your collections and the operations you perform on them, you can get better performance now without streams. Perhaps new seq-able structures that iterate like Range can help enough that we can avoid muddying the language.

When not handled in a functional setting, i.e. when side effects matter, structures like dorun and doseq that don’t retain the head of the sequences should provide speed and efficiency too.

Finally, as Mark pointed out, there’s scoping to consider. If you don’t bind your sequence to a name, then as soon as it’s evaluated, the JVM’s garbage collector should be able to clean it up. Not ideal, but at least the GC is very efficient, and the unneeded sequence will be unlinked quickly.

I’m curious to see what happens next with streams in Clojure. It will be hard to make everyone happy either way.