Filters, proxies, generators and iterators
Dear devs,
As a follow-up to my earlier work with iterators and context managers, I am taking a serious look at the way we iterate over the database. Typically we do something like:
my_list = db.get_obj_handles()
where “obj” is people, places, events, etc.
for handle in my_list:
# do something with the handle
Sometimes the very first thing in the loop is:
my_obj = db.get_obj_from_handle(handle)
In this case, the better approach is:
with db.get_obj_cursor() as my_cursor:
for handle, obj in cursor:
#I already have the handle and the data
There are times when this approach is not applicable, e.g. in Check.py where we might be modifying the data we’re iterating over (at least without major surgery on Check.py and there’s no compelling reason to do so.)
Sometimes though, the list is passed to other methods for further processing. The most obvious of these are the various filtering mechanisms, including the proxy databases. I’m experimenting with adding iterator methods to these classes. e.g. if a class has a get_person_handles() method that returns a list of person handles, I’m adding an iter_person_handles() method that returns an iterator over the handles. (For the analytically inclined, the first is an O(n) algorithm and uses O(n) memory and the second is O(1) to set up and uses O(1) memory.) The advantage of this approach shows up when you combine filters.
As a test case, I modified (locally) Narrative Web (very slightly) to take advantage of this as follows:
~/gramps32$ svn diff src/plugins/webreport/NarrativeWeb.py
Index: src/plugins/webreport/NarrativeWeb.py
===================================================================
--- src/plugins/webreport/NarrativeWeb.py (revision 12753)
+++ src/plugins/webreport/NarrativeWeb.py (working copy)
@@ -4186,8 +4186,8 @@
# gets the person list and applies the requested filter
- ind_list = self.database.get_person_handles(sort_handles=False)
- self.progress.set_pass(_('Applying Filter...'), len(ind_list))
+ ind_list = self.database.iter_person_handles()
+ self.progress.set_pass(_('Applying Filter...'), self.database.get_number_of_people())
ind_list = self.filter.apply(self.database, ind_list, self.progress)
return ind_list
Note that the apply() method still returns a list, which is fine for the moment. When you generate a Narrative Web Site, you can specify that you want neither living people nor private records included. When you do so without my patches, here’s what happens:
1. All the people in the database are passed through the “living” filter (via the living proxy db) and a list of dead people is returned.
2. All the people in the list returned from the first step are looked at to see if their records are marked private and removed from the list if that is so.
3. The updated list is passed to the filter.apply method for a third pass and updated once again.
In effect, we pass the data by the filters. A more intelligent approach is to pass the filters by the data. This is what happens with my patch in place:
1. An iterator is set up to filter out living people
2. An iterator is set up to filter out records marked private
3. The filter.apply method reads from the second iterator, which reads from the first iterator. In fact, it’s very much like piping data through the shell:
cat my_data | remove_living | remove_private | apply_filter > my_final_data
A pleasant surprise for me was that the “filter.apply()” method invoked above is just as happy with an iterator as with a list. That means that we do not need to modify any filters to use this approach.
So, what’s next? Extend this idea to the other objects and use iterators instead of list builders wherever possible.
What might go wrong? Well, the biggest immediate hassle is that you cannot write:
len(iterator)
Since the length is unknown until the the iterator is exhausted. There is one easy way around that. There is a get_number_of_people() method available for use that gets the count directly from bsddb. It might be high if there are lots of filters involved, since it returns the count of items in the database, unfiltered but it is usable.
When does this not help as much? When you are going to sort or index the data.
Metrics:
To see the advantage of this approach I worked up some metrics comparing 3.1.2 with the 3.2 trunk in svn with my patches. I modified Narrative Web to include this code:
from timeit import Timer
self.progress.set_pass(_('Applying Filter...'), self.database.get_number_of_people())
def t1():
ind_list = self.database.
ind_list = self.filter.apply(self.database, ind_list, self.progress)
print "time:", Timer(t1).timeit(100)
and inserted "get" for
3.1.2 results:
time: 76.0091130733 or about .76 seconds per repetition
size: 120
3.2 results:
time: 35.1433758736 or about .35 seconds per repetition
size: 24
Thus, not only was the storage reduced (no surprise there) but the time was cut in half. Next, I tried the same with the large sample.gramps in svn which has about 2000 people. I set the repetitions to just one because I'm an impatient fellow:
3.1.2 results:
time: 80.1852030754 or about 80 seconds
size: 5748
3.2 results:
time: 39.8106660843 or about 40 seconds
size: 24
Again, the time was cut approximately in half.
Benny
3 July 2009 @ 12:51 am
Yes, I believe the apply method or takes in no list and then obtains an iterator itself and returns a list, or takes in a list, and goes over that to check one entry after the next.
Nice work