Difference between revisions of "Database Query API"

From Gramps
Jump to: navigation, search
(Implementation)
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Overview ==
+
Starting with Gramps 5.0, there is a new manner to select data from the database. The goal of this API is to make it easy to select data by different criteria, and make selection as fast as possible.
  
Starting with Gramps 5.0, there is a new method on the database object called "select" that works as follows:
+
This page documents the interface for developers.
  
db.select(TABLE-NAME,
+
= QuerySet =  
          SELECT-LIST,
 
          where=WHERE-LIST,
 
          order_by=ORDER-BY-LIST)
 
  
As an example, consider selecting the gramps_id from all people who have a surname of "Smith" and whose name begins with a "J", ordered by the gramps_id:
+
Each database backend now supports the following QuerySet objects:
  
db.select("Person",
+
* db.Person
          ["gramps_id"],
+
* db.Family
          where=["AND", [("primary_name.surname_list.0.surname", "=", "Smith"),
+
* db.Note
                          ("primary_name.first_name", "LIKE", "J%")]],
+
* db.Source
          order_by=[("gramps_id", "ASC")])
+
* db.Citation
 +
* db.Place
 +
* db.Repository
 +
* db.Event
 +
* db.Tag
  
== API ==
+
Each of these QuerySets allows you to chain together a series of QuerySet method calls. QuerySet supports the following methods:
  
There are three items one can specify
+
* .select([fields...]) - returns generator
 +
* .order(fieldname, ...) - order-by fields given as strings; use "-name" for descending
 +
* .filter(obj, args...) - applies a FilterObject or Python callable
 +
* .map(f) - applies a function to all selected objects
 +
* .proxy(name, args...) - applies a named-proxy ("living", "private", or "referenced")
 +
* .limit() - set start or limit or selection
 +
* .count() - returns the number of matches
 +
* .tag(tag_name) - puts a tag on all selected items, if not tagged with tag_name already
  
* select fields
+
These methods can be lined up to create a series of operations:
* where
+
 
* order_by
+
db.Person.filter(lambda person: person.private=True).select("gramps_id")
 +
 
 +
== Database.QuerySet.select() ==
 +
 
 +
db.Person.select()
 +
db.Person.select("gramps_id", "handle")
 +
 
 +
== Database.QuerySet.order() ==
 +
 
 +
db.Person.order("gramps_id").select()
 +
 
 +
Examples:
 +
 
 +
* db.Person.order("gramps_id").select()
 +
* db.Person.order("primary_name.first_name", "-gramps_id").select()
 +
 
 +
The first-listed field name is the primary sort, followed by secondary sorts. In the last example above, the data would be sorted first by first_name ascending, and inside that, by gramps_id descending, such as:
 +
 
 +
Avery, I0003
 +
Avery, I0002
 +
Avery, I0001
 +
Barry, I0013
 +
Barry, I0012
 +
Barry, I0011
 +
 
 +
== Database.QuerySet.filter() ==
 +
 
 +
== Database.QuerySet.map() ==
 +
 
 +
== Database.QuerySet.proxy() ==
 +
 
 +
== Database.QuerySet.limit() ==
 +
 
 +
== Database.QuerySet.count() ==
 +
 
 +
== Database.QuerySet.tag() ==
 +
 
 +
== Joins ==
 +
 
 +
The database.select method can also do joins across primary objects in each of the WHERE, ORDER, or select fields. For example consider this request:
 +
 
 +
>>> db.Family.select("mother_handle.gramps_id")
 +
 
 +
This will return the gramps_id's of the mothers of each family, if there is a mother.
 +
 
 +
You can join across multiple tables with more complex queries, such as:
 +
 
 +
>>> list(db.Family.select("mother_handle.event_ref_list.ref.gramps_id"))
 +
 
 +
That returns all of the Event gramps_ids for all events for the mother (Person) of all families, looking something like:
 +
 
 +
[{"mother_handle.event_ref_list.ref.gramps_id": ["E0001", "E0002"]},
 +
  {"mother_handle.event_ref_list.ref.gramps_id": "E0003"},
 +
  {"mother_handle.event_ref_list.ref.gramps_id": ["E0003", "E0001"]},
 +
  ...
 +
]
 +
 
 +
Note that these joins are done per record and are therefore not yet optimized---each requires another database access. In the future, these could be optimized via a SQL JOIN.
  
 
== Implementation ==
 
== Implementation ==
Line 28: Line 93:
 
There are now two database backends: Berkeley DB (BSDDB), and [https://www.python.org/dev/peps/pep-0249/ Python's DB-API]. BSDDB is a data store with much of the database code written in Python, and DB-API is a common interface to the popular SQL engines. We have used BSDDB in Gramps for many years, but are now transitioning to DB-API.
 
There are now two database backends: Berkeley DB (BSDDB), and [https://www.python.org/dev/peps/pep-0249/ Python's DB-API]. BSDDB is a data store with much of the database code written in Python, and DB-API is a common interface to the popular SQL engines. We have used BSDDB in Gramps for many years, but are now transitioning to DB-API.
  
With BSDDB, Gramps has a pipeline design when it comes to accessing the data. For example, consider getting the People for the flat view. First we get a cursor that iterates over the data. Then we sort it, on whatever criteria we have requested. Finally, we filter the data.
+
With BSDDB, Gramps has a pipeline design when it comes to accessing the data. For example, consider getting the People for the flat view. First we get a cursor that iterates over the data. Then we sort it, on whatever criteria we have requested. Finally, we filter the data. The select method will always perform a linear search on fully expanded data.
  
In order to make this operation faster for DB-API, we need to know the filter information, and sort order when we ask for the data. With SQL we can simply add WHERE clauses and ORDER BY clauses to the basic SELECT statement. But these are only useful if we can have indexes on the relevant data.
+
In order to make the select operation faster for DB-API, we need to know the filter information, and sort order when we ask for the data. With SQL we can simply add WHERE clauses and ORDER BY clauses to the basic SELECT statement. But these are only useful if we can have indexes on the relevant data.
  
 
This is made more difficult because Gramps uses a hierarchical representation of data. For example, we might wish to have the People data sorted by "surname, given" of the primary_name. But that information is actually in:
 
This is made more difficult because Gramps uses a hierarchical representation of data. For example, we might wish to have the People data sorted by "surname, given" of the primary_name. But that information is actually in:
Line 37: Line 102:
 
* person.primary_name.first_name  
 
* person.primary_name.first_name  
  
respectively. We could make special fields for these, and special indexes. But it would be much more flexible if we could create a variety of ad hoc queries on the fly.
+
respectively. We can make special fields for these, and special indexes. Gramps 5.0 creates "secondary" fields and indexes in SQL for every str, int, or bool data on a primary object. These secondary fields are known from the primary object's schema.  
 
 
The BSDDB datastore doesn't have any schema, which means that it has no idea of "gramps_id" or "primary_name" or any field. An idea of schema has been developed over the last few years. This makes possible the Database Differences Report, without having to write any field-specific code: the data knows its own structure.  
 
  
 
The schema idea has been augmented with additional methods based on the idea of "fields". Now, you can ask a person object:
 
The schema idea has been augmented with additional methods based on the idea of "fields". Now, you can ask a person object:
Line 51: Line 114:
 
  "Johnson"
 
  "Johnson"
  
Building on that, a db.select method has been added so that you can get all of those fields from all People, and sort and filter on all of the regular (string, int) fields. This works independently of SQL. However, if you re-implement that method for DB-API, and have the appropriate sql-fields, and sql-indexes, then you have an large speed-up for large data.
+
or even:
 +
 
 +
>> person.get_field("primary_name.surname_list.surname")
 +
["Johnson", "Johansen", "Johnston"]
 +
 
 +
In the last example, the "surname" field was applied to each of the surname_list items.
  
So, our old system required a scan of all data, unpickling, creating objects, and sorting for any use. If you have 100k records, that required processing all of them. With the new DB-API implementation, you can do that same query in a fraction of that time. Views (written appropriately) will appear very quickly (milliseconds) regardless of the size of the database.
+
Each primary object can also have a list of extra secondary fields/indexes. These are specified in the primary object's method "get_extra_secondary_fields()" which returns a list of dotted-path strings.
  
To interactively test it out, you can try these methods with a regular BSDDB database, but you won't see any speed enhancements. To see it work fast, you'll need a DBAPI database built from the latest addons-source/DBAPIBackend. The new webapp takes search text and turns it into a SQL statement directly for seeing the speed and flexibility of the functionality.
+
The dotted-field path strings are mapped to SQL names by replacing dots with two underscores.
  
 
== Speed Tests ==
 
== Speed Tests ==
  
In this stringified field-based API, we would write code as follows. Consider that we want to select the handle of all people whose surname is "Smith", given name starts with a "J", and ordered by gramps_id. We would write:
+
Using he dotted-field path string field-based Select API, we can write code as follows. Consider that we want to select the handle of all people whose surname is "Smith", given name starts with a "J", and ordered by gramps_id:
  
  db.select("Person",
+
  list(db.Person.filter(primary_name__surname_list__0__surname="Smith",  
          ["handle", "gramps_id"],
+
                      primary_name__first_name__LIKE="J%").order("gramps_id").select())
          where=["AND", [("primary_name.surname_list.0.surname", "=", "Smith"),  
 
                          ("primary_name.first_name", "LIKE", "J%")]],
 
          order_by=[("gramps_id", "ASC")])
 
 
   
 
   
 
This code works on BSDDB as well as DB-API. Let's see the difference in timing on databases that have 187,294 people (created from GenFan, this is 20 full generations).  
 
This code works on BSDDB as well as DB-API. Let's see the difference in timing on databases that have 187,294 people (created from GenFan, this is 20 full generations).  
Line 78: Line 143:
 
So, where we can access the data via SQL, we can get a speedup, the biggest will always be in the filter as it makes it so we don't have to load into Python many objects. We have linear code in many places that could benefit from using db.select().
 
So, where we can access the data via SQL, we can get a speedup, the biggest will always be in the filter as it makes it so we don't have to load into Python many objects. We have linear code in many places that could benefit from using db.select().
  
Currently DB-API is automatically creating SQL fields and indexes for all regular (non-list, standard Python types) primary-object attributes (like gramps_id, privacy, etc.). This takes time and space. We may want to manage this a bit more carefully.
+
[[Category:GEPS|D]]

Revision as of 22:33, 9 February 2016

Starting with Gramps 5.0, there is a new manner to select data from the database. The goal of this API is to make it easy to select data by different criteria, and make selection as fast as possible.

This page documents the interface for developers.

QuerySet

Each database backend now supports the following QuerySet objects:

  • db.Person
  • db.Family
  • db.Note
  • db.Source
  • db.Citation
  • db.Place
  • db.Repository
  • db.Event
  • db.Tag

Each of these QuerySets allows you to chain together a series of QuerySet method calls. QuerySet supports the following methods:

  • .select([fields...]) - returns generator
  • .order(fieldname, ...) - order-by fields given as strings; use "-name" for descending
  • .filter(obj, args...) - applies a FilterObject or Python callable
  • .map(f) - applies a function to all selected objects
  • .proxy(name, args...) - applies a named-proxy ("living", "private", or "referenced")
  • .limit() - set start or limit or selection
  • .count() - returns the number of matches
  • .tag(tag_name) - puts a tag on all selected items, if not tagged with tag_name already

These methods can be lined up to create a series of operations:

db.Person.filter(lambda person: person.private=True).select("gramps_id")

Database.QuerySet.select()

db.Person.select()
db.Person.select("gramps_id", "handle")

Database.QuerySet.order()

db.Person.order("gramps_id").select()

Examples:

  • db.Person.order("gramps_id").select()
  • db.Person.order("primary_name.first_name", "-gramps_id").select()

The first-listed field name is the primary sort, followed by secondary sorts. In the last example above, the data would be sorted first by first_name ascending, and inside that, by gramps_id descending, such as:

Avery, I0003
Avery, I0002
Avery, I0001
Barry, I0013
Barry, I0012
Barry, I0011

Database.QuerySet.filter()

Database.QuerySet.map()

Database.QuerySet.proxy()

Database.QuerySet.limit()

Database.QuerySet.count()

Database.QuerySet.tag()

Joins

The database.select method can also do joins across primary objects in each of the WHERE, ORDER, or select fields. For example consider this request:

>>> db.Family.select("mother_handle.gramps_id")

This will return the gramps_id's of the mothers of each family, if there is a mother.

You can join across multiple tables with more complex queries, such as:

>>> list(db.Family.select("mother_handle.event_ref_list.ref.gramps_id"))

That returns all of the Event gramps_ids for all events for the mother (Person) of all families, looking something like:

[{"mother_handle.event_ref_list.ref.gramps_id": ["E0001", "E0002"]},
 {"mother_handle.event_ref_list.ref.gramps_id": "E0003"},
 {"mother_handle.event_ref_list.ref.gramps_id": ["E0003", "E0001"]},
 ...
]

Note that these joins are done per record and are therefore not yet optimized---each requires another database access. In the future, these could be optimized via a SQL JOIN.

Implementation

There are now two database backends: Berkeley DB (BSDDB), and Python's DB-API. BSDDB is a data store with much of the database code written in Python, and DB-API is a common interface to the popular SQL engines. We have used BSDDB in Gramps for many years, but are now transitioning to DB-API.

With BSDDB, Gramps has a pipeline design when it comes to accessing the data. For example, consider getting the People for the flat view. First we get a cursor that iterates over the data. Then we sort it, on whatever criteria we have requested. Finally, we filter the data. The select method will always perform a linear search on fully expanded data.

In order to make the select operation faster for DB-API, we need to know the filter information, and sort order when we ask for the data. With SQL we can simply add WHERE clauses and ORDER BY clauses to the basic SELECT statement. But these are only useful if we can have indexes on the relevant data.

This is made more difficult because Gramps uses a hierarchical representation of data. For example, we might wish to have the People data sorted by "surname, given" of the primary_name. But that information is actually in:

  • person.primary_name.surname_list[0].surname
  • person.primary_name.first_name

respectively. We can make special fields for these, and special indexes. Gramps 5.0 creates "secondary" fields and indexes in SQL for every str, int, or bool data on a primary object. These secondary fields are known from the primary object's schema.

The schema idea has been augmented with additional methods based on the idea of "fields". Now, you can ask a person object:

>> person.get_field("primary_name.first_name")
"Sarah"

and with some additional syntax:

>> person.get_field("primary_name.surname_list.0.surname")
"Johnson"

or even:

>> person.get_field("primary_name.surname_list.surname")
["Johnson", "Johansen", "Johnston"]

In the last example, the "surname" field was applied to each of the surname_list items.

Each primary object can also have a list of extra secondary fields/indexes. These are specified in the primary object's method "get_extra_secondary_fields()" which returns a list of dotted-path strings.

The dotted-field path strings are mapped to SQL names by replacing dots with two underscores.

Speed Tests

Using he dotted-field path string field-based Select API, we can write code as follows. Consider that we want to select the handle of all people whose surname is "Smith", given name starts with a "J", and ordered by gramps_id:

list(db.Person.filter(primary_name__surname_list__0__surname="Smith", 
                      primary_name__first_name__LIKE="J%").order("gramps_id").select())

This code works on BSDDB as well as DB-API. Let's see the difference in timing on databases that have 187,294 people (created from GenFan, this is 20 full generations).

Here is a summary:

       | Filter |  Select All   | Sort All
-------|--------|---------------|----------
BSDDB  | 20.6s  |       9.2s    | 18.1s
DB-API |   .3s  |        .5s    |   .6s

So, where we can access the data via SQL, we can get a speedup, the biggest will always be in the filter as it makes it so we don't have to load into Python many objects. We have linear code in many places that could benefit from using db.select().