Wednesday, February 1, 2012

Clojure: lazy seq + database = bad

In my work on topoged-hibernate I naively thought that it would be great to return a lazy-seq of the results of a query like:

However, this has several problems, not the least of which is that the underlying session is closed and the data is inaccessible.  When the with-session macro finishes, the session and transaction are closed and then the lazy-seq is returned.  However, it no longer has a connection to the database and so an exception is thrown once sequence is read.  This is just one problem with accessing data from a database via a lazy-seq.

The same problem applies to any attempt to use a lazy-seq to access a database.  The problem is that this is no way to know when the seq is no longer in use to close the connection.  One could easily add to the seq to check once the end of the dataset had been reached.  At that point, it could close the connection.

However, a lot of the time the entire dataset need not be read as (take 10 dataset).   The seq would be never get to the end and leave to connection open.  This is not really a Clojure problem, but a problem with Java.  In Java, there is no way to know when an Object is done being used.  On course there is the finalizer but that has proven to has so many problems that its use is discouraged.

One solution is to read all the data before returning.  This is a common way of handing data and works for small datasets (small enough to fit into memory).  It is an easy and efficient way to get the desired results and still manage the database connections.

Larger datasets, that cannot be stored in memory, have to be processed as the data is read, thusly:

Now the data is processed within a closure that will close the connection once the processing is completed.  Of course, one must take care to not accidentally return a lazy-seq by returning a map or filter of the data.

What I wish we could do is to return a seq that could then close its own connection once it is no longer in use.  There are some huge issues with that including:
  • Java is not helpful and only figures out if an Object is no longer in use at garbage collection which is not guaranteed to happen for any given Object.
  • The onus then falls on Clojure to keep track of each resultset and "determine" when it is no longer in use.  This might entail writing a secondary garabage collection in Clojure.
  • Another problem is that the seq might never be cleaned up, like if it were "def"ed.
In conclusion, while using lazy-seqs to process datasets might seem like a good idea, you will quickly find it is not worth the trouble.