Optimizing SQL with Gramps-Connect

This is the first blog post about a new project under the Gramps umbrella: Gramps-Connect. The idea is to take as much of Gramps as we can to the web, focusing on collaboration. There are already many ideas from users and developers, a proposal page detailing some of the planning, a student (Kathy) working on adding and editing data from the web, and a demo website, gramps-connect.org. We are building Gramps-Connect on top of Django, a set of Python web development tools.

One of my goals in working on Gramps-Connect is to get a very large family tree on-line for collaboration among a number of researchers. The organizing property of these data are those people who are descended from John Chenoweth & Mary Calvert in the US (see http://www.chenowethsite.com/ for some of this information).

So far, the privatized dataset  has the following numbers:

People     153,529
Families     56,070
Events     362,661
Notes     36,510
Media     0
Sources     7,730
Places     42,209
Repositories     1

The GEDCOM file is 43 MB. I imported the file into Gramps Gtk, and exported it to Django (using current trunk base, which will become Gramps 3.2). This took hours (overnight, on a hefty machine). Afterwards, the SQLite file was 228 MB (this file is stored in src/web/sqlite.db).

Now, the question was how would the response time be in viewing this data through the browser? I didn’t pay too much attention to the stats when not logged in (as this is using probably_alive to determine what info to show, is recursive, and needs some speedup work). So I logged in (id = admin/password = gramps), and began to browse the People View. I was disappointed: it took over 30 seconds to view each page of 15 names! (BTW, this view has all names in it, including all alternate names, unlike Gramps Gtk which only shows primary names).

Was this slowness a limitation of Django? Sqlite? Data organization? To narrow down the problem I installed the Django Debug Toolbar, and turned it on (uncommented all debug_toolbar items, and set DEBUG = True in src/web/settings.py) and reloaded the page. The toolbar shows a pane on the right-hand side of the screen. One of the tabs is SQL queries. It showed 311 queries per people page view. Clicking on the Query tab shows all of the queries for that page, and even a graphical bar for each showing the relative percentage time that that query took across the entire time of that page creation.

There were a few query types that were taking about 800 milliseconds (.8 seconds) each. That adds up fast! So I opened up a console window, started sqlite on the Django database (“sqlite3 sqlite.db”) and started poking around. All of the Gramps tables are prefixed with “grampsdb_”. So the Person model is stored in the “grampsdb_person” table. Some Sqlite meta commands:

sqlite> .tables
sqlite> .indices grampsdb_eventref
sqlite> .timer on

I could see the issue by cutting and pasting the SQL query from the browser window to the sqlite prompt, and saw the same timings from the browser:

sqlite> SELECT “grampsdb_eventref”.”id”, “grampsdb_eventref”.”object_type_id”, “grampsdb_eventref”.”object_id”, “grampsdb_eventref”.”order”, “grampsdb_eventref”.”last_saved”, “grampsdb_eventref”.”last_changed”, “grampsdb_eventref”.”private”, “grampsdb_eventref”.”ref_object_id”, “grampsdb_eventref”.”role_type_id” FROM “grampsdb_eventref” WHERE (“grampsdb_eventref”.”object_type_id” = 25 AND “grampsdb_eventref”.”object_id” = 64647 ) ORDER BY “grampsdb_eventref”.”order” ASC;

I started adding indexes to see if I can get the time down. (One current limitation with Django is that it can’t create multi-field indexes by itself… you have to do that through raw SQL.) I created indexes, testing the queries as a went to see what would help reduce time. I found a couple that help:

sqlite> create index grampsdb_name_surname ON grampsdb_name (surname, first_name);
sqlite> create index grampsdb_eventref_object_id_object_type_id ON grampsdb_eventref
(object_id, object_type_id);

Now the views appear instantaneously! So without adding any code, examining the timings of each individual query, and creating new indexes on the fly (with the browser open, viewing live data), one can optimize functionality throughout Gramps-Connect. I’m really enjoying my Gramps database now!

-Doug

Reblog this post [with Zemanta]

1 Comment

  • Benny

    Yes, you need an index on the column that forms the sort, then use the index to show eg people 30 to 45.

    We do not do this in gramps gtk for the the following reasons at the moment:

    * people can set the name format in the preferences, changing how the people are sorted in the view
    * people can sort on other columns, we cannot put indexes on all columns
    * people have different locales which sort alphabets differently, so an index would only be correct for eg us locale, and then be wrong for swedish viewers.

    I have played with the idea to make a fixed ‘performant’ view for people, in which we add a index over the name format, with the index regenerated on locale change and name format change (which would take some time once), and then not show all people, but also only 30 or 50 at a time, just like on a website

Join the Conversation!