Saturday, March 26, 2011

Using closures to optimize speed of couchdb-python view functions

Python's standard library (and 3rd party modules) brings a lot of pluggable functionality that you can use in Couch view (map/reduce) and list/show functions.

Here, we'll using the couchdb-python library, which is, to my knowledge, the only well known couchdb library for Python that implements a view server (couchdbkit does not, which makes it fall out of consideration for the things I commonly need to do).

Let's say we have a couch application that keeps track of airline flight itineraries. A couch document for a single flight might look like:

{
  "_id": "123456",
  "departure-time": "2011-03-26T08:30:00-0700",
  "arrival-time":   "2011-03-26T09:15:00-0800"
}

Mind you, this is of course only a portion of the information that'd be in such a document. The times are specified in ISO 8601 format. If we want to determine the duration of the flight, we don't need to store that in the document, since it's determinable from the departure and arrival times. To accomplish that, we can make a couch view as such:

def fun(doc):
    from dateutil.parser import parse
    departure = parse(doc['departure-time'])
    arrival = parse(doc['arrival-time'])
    duration =  arrival - departure
    duration = (duration.days * 24 * 60 +
        duration.seconds // 60)
    yield doc['_id'], duration


This view function simply associates the document id to the duration, in minutes, of the flight.

If you have the need to import functionality, it's important to note that repeat imports can dramatically decrease performance. Since this function is called once per document, and the function does an import once per call, then there is one redundant import per document, which is particularly undesirable if you're managing thousands or millions of documents. More information can be found on the subject of Python import performance here.

The current implementation of the view-map function compiler in couchdb-python expects your code to produce only one identifier. If not for this restriction, you'd be able write your code as you normally would in any other situation, but until couchdb-python is updated to be top-level import-friendly, we have to do some extra work.

To get around this, we can use a closure. Common in functional languages like Lisp, and gaining popularity in JavaScript, closures are not frequently used in Python in this form (generally for lack of need, though they are used to make decorators). Fall all of those technical folks out there, yes, Python may not support true closures in the strictest sense. At any rate, we can strike a balance between functionality and performance by taking care of all our imports and doc-agnostic pre-processing outside of the real map function, as shown below.

def outer():
    from dateutil.parser import parse

    def inner(doc):
        departure = parse(doc['departure-time'])
        arrival = parse(doc['arrival-time'])
        duration =  arrival - departure
        duration = (duration.days * 24 * 60 +
            duration.seconds // 60)
        yield doc['_id'], duration

    return inner

outer = outer()

Note that there are currently discussions underway on the couchdb-python Google Code page regarding ways to side-step the need for closures altogether by improving the view server's compilation layer. Check back in a few weeks, and there might be some news on that subject.

No comments:

Post a Comment