Changes

Jump to: navigation, search

Database Query API

1,094 bytes added, 18:24, 29 January 2016
no edit summary
== Overview API ==
Starting with Gramps 5.0, there is a new method on the database object called "select" that works as follows:
db.select(TABLE-NAME,
SELECT-LIST,
where=WHERE-EXPRESSION, order_by=ORDER-BY-LIST, order_bystart=START-ROW, limit=LIMIT-ROW-COUNT) The following are required: * TABLE-NAME - the name of the table. That would be "Person", Family", "Media", "Repository", "Place", "Note", "Source", "Citation", or "Tag"* SELECT-LIST - a list of dot-separated field path strings from this object (eg, "gramps_id", "primary_name.first_name", etc) Optional arguments: * WHERE-EXPRESSION - a matching expression, such as ("gramps_id", "=", "I0001"). These can be nested (see below)* ORDER-BY-LIST- a list of dot-separated field path strings, each paired with a sorting direction, for example [("gramps_id", "ASC")]* START-ROW - the row number on which to start. Default is 0, meaning start at beginning* LIMIT-ROW-COUNT - the limit of how many rows to return. Default is -1, meaning no limit
As an example, consider selecting the gramps_id from all people who have a surname of "Smith" and whose name begins with a "J", ordered by the gramps_id:
["gramps_id"],
where=["AND", [("primary_name.surname_list.0.surname", "=", "Smith"),
("primary_name.first_name", "LIKE", "J%")]],
order_by=[("gramps_id", "ASC")])
The parameters "start" and "limit" are used for paged selects. These will also return the total of the selection as if start or limit had not been given (see Result below). == API =WHERE-EXPRESSION === The where expression must be in one of these four forms (tuples or lists allowed): * None - no filter applied to data* (dot-separated field path string, COMPARISON-OPERATOR, value)* ["AND" | "OR", [WHERE-EXPRESSION, WHERE-EXPRESSION, ...]]* ["NOT", WHERE-EXPRESSION] COMPARISON-OPERATOR is one of: * "LIKE" - use with "%" wildcard* "="* "!=", or "<>"* "<* "<="* ">"* ">="* "IS"* "IS NOT"* "IN" Examples: * ("primary_name.first_name", "=", "Mary")* ["OR", [("primary_name.first_name", "=", "Mary"), ("primary_name.first_name", "LIKE", "Eliza%")]]* ["NOT", ("primary_name.first_name", "=", "Mary")] === ORDER-BY-LIST === The ORDER-BY-LIST is either None or is a list of dotted-field path strings paired with "ASC" or "DESC". Example: * [("gramps_id", "DESC")]* [("gramps_id", "DESC"), ("primary_name.first_name", "ASC")] == Result == The database.select() method will always return a Result. A result is a collection of all of the data (ie, it is not a generator). Results are a subclass of the Python list object, with additional properties:
There * total - total number of records (in the case of start or limit is given)* time - the time in seconds it too to collect the data* expanded - whether the data needed to be expanded (unpickled and primary objects created). BSDDB selects alway are three items expanded* query - the actual SQL query, if one can specify
* select fields* where* order_byThe goal is to always have a query, and to always have expanded be False. Those will be the fastest queries.
== Implementation ==
There are now two database backends: Berkeley DB (BSDDB), and [https://www.python.org/dev/peps/pep-0249/ Python's DB-API]. BSDDB is a data store with much of the database code written in Python, and DB-API is a common interface to the popular SQL engines. We have used BSDDB in Gramps for many years, but are now transitioning to DB-API.
With BSDDB, Gramps has a pipeline design when it comes to accessing the data. For example, consider getting the People for the flat view. First we get a cursor that iterates over the data. Then we sort it, on whatever criteria we have requested. Finally, we filter the data. The select method will always perform a linear search on fully expanded data.
In order to make this the select operation faster for DB-API, we need to know the filter information, and sort order when we ask for the data. With SQL we can simply add WHERE clauses and ORDER BY clauses to the basic SELECT statement. But these are only useful if we can have indexes on the relevant data.
This is made more difficult because Gramps uses a hierarchical representation of data. For example, we might wish to have the People data sorted by "surname, given" of the primary_name. But that information is actually in:
* person.primary_name.first_name
respectively. We could make special fields for these, and special indexes. But it would be much more flexible if we could create a variety of ad hoc queries on the flyGramps 5The BSDDB datastore doesn't have any schema, which means that it has no idea of 0 creates "gramps_idsecondary" fields and indexes in SQL for every str, int, or "primary_name" or any fieldbool data on a primary object. An idea of These secondary fields are known from the primary object's schema has been developed over the last few years. This makes possible the Database Differences Report, without having to write any field-specific code: the data knows its own structure.
The schema idea has been augmented with additional methods based on the idea of "fields". Now, you can ask a person object:
"Johnson"
Building on that, a db.select method has been added so that you can get all of those fields from all People, and sort and filter on all of the regular (string, int) fields. This works independently of SQL. However, if you re-implement that method for DB-API, and have the appropriate sql-fields, and sql-indexes, then you have an large speed-up for large data.or even:
So, our old system required a scan of all data, unpickling, creating objects, and sorting for any use >> person. If you have 100k records, that required processing all of themget_field("primary_name. With the new DB-API implementation, you can do that same query in a fraction of that timesurname_list. Views (written appropriatelysurname") will appear very quickly (milliseconds) regardless of the size of the database. ["Johnson", "Johansen", "Johnston"]
To interactively test it out, you can try these methods with a regular BSDDB database, but you won't see any speed enhancements. To see it work fastIn the last example, you'll need a DBAPI database built from the latest addons-source/DBAPIBackend. The new webapp takes search text and turns it into a SQL statement directly for seeing the speed and flexibility "surname" field was applied to each of the functionalitysurname_list items.
== Speed Tests ==
In this stringified Using he dotted-field path string field-based Select API, we would can write code as follows. Consider that we want to select the handle of all people whose surname is "Smith", given name starts with a "J", and ordered by gramps_id. We would write:
db.select("Person",
So, where we can access the data via SQL, we can get a speedup, the biggest will always be in the filter as it makes it so we don't have to load into Python many objects. We have linear code in many places that could benefit from using db.select().
 
Currently DB-API is automatically creating SQL fields and indexes for all regular (non-list, standard Python types) primary-object attributes (like gramps_id, privacy, etc.). This takes time and space. We may want to manage this a bit more carefully.

Navigation menu