Saturday, March 26, 2011

Cleaner pseudo-randomized views in CouchDB

Lately I've been running into several explanations on how to efficiently get random documents out a Couch database, and so far, the majority of the approaches settle on storing a random value in the document itself, as demonstrated here. This seems like a waste (it adds administrative clutter to the document).

Of course, it's an understandable approach, since couch view functions are required to be idempotent (running the same function with the same input must produce the same output), and at face value, you'd think that generating a random number in the view function would produce different results each time, even with the same input. However, that's not necessarily the case.

Note: none of the following should be used for anything even remotely cryptographic (as that's not the intended or achieved purpose).

The way around this is to make the random-number generation itself idempotent. After all, unless you ask for truely random specifically, 'random' implementations use pseudo-random number generation anyways, so why not? An example of how to accomplish this with a Python view function is:

def outer():
    import random
    #from hashlib import md5
    def inner(doc):
        seed = doc['_id']
        #seed = md5(seed).digest()
        random.seed(seed)
        yield random.random(), {'_id': doc['_id']}

    return inner
outer = outer()

The use of a hashing function like md5 may improve the distribution of values, though probably not (if it did, that'd say a lot about your PRNG implementation), not to mention it takes more time.

This function gives each document a random position in the view, but that general position will not change over the lifetime of the document because the document id will never change. If the position of a document within the 'random' view needs to change each time the document is updated, we can use the following function:

def outer():
    import random
    #from hashlib import md5
    def inner(doc):
        seed = repr(doc)
        #seed = md5(seed).digest()
        random.seed(seed)
        yield random.random(), {'_id': doc['_id']}

    return inner
outer = outer()

It is common for Couch documents to contain a 'type' value. We can also trivially make views that do randomization within type-groups:

def outer():
    import random
    #from hashlib import md5
    def inner(doc):
        seed = repr(doc)
        #seed = md5(seed).digest()
        random.seed(seed)
        key = [doc.get('type'), random.random()]
        yield key, {'_id': doc['_id']}

    return inner
outer = outer()

In all of these cases, we're doing what views are meant to do: keeping computable information out of the document. This is especially sensible when you consider that a random value for these purposes really is just junk data. It's a means to an end, but rarely do you actually care about the random value itself. Furthermore, because this method of generating (pseudo) random numbers is idempotent, it's semantically equivalent to recipes that store the random value in the document.

No comments:

Post a Comment