Cursors, Iterators and Context Managers, on my!
Lately I’ve been looking at how various gramps modules access the database. We tend to have patterns like:
some_list = self.db.get_some_handles(sort_handles=False)
for handle in some_list:
some_data = \
self.db.get_something_from_handle(handle)
This works just fine of course, but while looking at these patterns I started to get worried about the performance implications of this approach. If you drill down to find out what “get_some_handles()” actually does, you will arrive at a line in base.py like this:
return table.keys()
which calls the database keys() function. What does that do? Quoting from the bsddb documentation:
Return the list of keys contained in the DB file. The order of the list is unspecified and should not be relied on. In particular, the order of the list returned is different for different file formats.
Well, OK, but really happens is that the method retrieves all the keys in the database and builds a Python list, which it then returns to the caller. However, looking at our “for” loop, above, we go through the database again, this time extracting the data that we really want. Thinking of the complexity of this approach, we have Omega(n) operations to retrieve the keys and Omega(n) operations to retrieve the data associated with the keys. On top of that, we have a list that is — you guessed it — Omega(n) in size.
Cursors:
Wouldn’t it be nice if we could do both things at the same time and maybe not even have to build a list of keys? Well, we can! You see, there is another method, using a database cursor. I could rewrite the for loop above like this:
mycursor = self.db.get_some_cursor()
for key, data in mycursor:
# do something with the key and data
The complexity of this algorithm is Omega(n) < 2*Omega(n) and the storage used is only O(1) < Omega(n).
Note 1: This approach will not work if you need the data sorted or filtered. Obviously if you need the data sorted, you need to build some kind of list and sort it, which is the default behavior of get_some_handles(). Also, if you need to filter the data, there is currently no way to avoid the first approach, but I’m thinking about that…
Note 2: If you rely on attributes of a list (such as len() or indexing), you can’t use them on an iterator. If you still need to know how many elements there are in some table, you can generally get that with a call to:
self.db.get_number_of_something()
Iterators:
Actually the modified “for” loop using the cursor also uses a new method added to the GrampsCursor class: an iterator. That is, there is a __iter__ method in the class that does the work that allows you do say:
for key, data in mycursor:
Without the iterator, you need to use a longer approach:
data = mycursor.first()
while data:
# use the data
data = mycursor.next()
cursor.close()
Context Managers:
Did you notice the call to the close() method? Did you also see that the revised “for” loop has no call to this method? In practice, this would leave an active cursor with possibilities for deadlocks. However, thanks to another method (two, actually) added to GrampsCursor, it is now possible to code it like this:
with self.db.get_some_cursor() as mycursor:
for key, data in mycursor:
# use the data
If you haven’t used the “with” statement before, spend a few minutes reviewing its use in the Python documentation. Basically it establishes a context inside which you can work. It also guarantees that certain things are done when you leave the context. In this case, when you leave the cursor context, the cursor is closed automagically.
To Do:
The next step is to look at proxy databases and filters and see if we can turn some of their functions into iterators as well. Stay tuned.