Database abstraction

My recent work on the gramps database objects has led me to reflect upon their structure from an abstract data type point of view. I’m hoping that this blog entry will start a discussion that will ultimately lead to the development of a true ADT-based structure for gramps databases with multiple implementations.

Current structure using BSDDB

The current implementation of gramps uses BSDDB as the storage engine.  This has served us well and provides an efficient back end with good Python wrappers to access it.  (For those wanting more detail, see the official documentation.)  Gramps today is tightly bound to BSDDB.  This is not optimal and could make porting to a different engine troublesome and tricky.   On the other hand, it is helpful to examine the structure and use the insight gained from that examination to design good ADTs for future implementations.

As most readers will know, a gramps database is actually a collection of BSDDB databases of various types that live together in subdirectory in some filesystem.  There are eight primary BSDDBs, one for each primary object type (Person, Family, Event, Place, Source, MediaObject, Repository and Note).  These are keyed by a program-generated hash that is called a handle internally.   Mirroring these are eight secondary databases that are indexed by gramps id and point (via handle) to the primary objects. Additionally, there are databases to track cross-references and other things.

Within the BSDDB world, the subdirectory is an important entity in its own right.  It is considered to be the environment within which the underlying databases function.  This environment provides centralized control over transactions and serialization among other things.

Requirements for gramps ADTs

Here is my initial take on requirements for ADTs for gramps:

  • High-level ADT
    • Groups underlying ADTs that together comprise a gramps database
    • Provide transaction, logging and serialization methods
  • Object-specific ADT
    • Handles data and methods for a gramps object type
    • Uses Python dictionary methods for access
    • Automatically  performs appropriate transaction processing, logging and serialization using high-level ADT methods

First attempt at implementation of ADTs

I’ve been building prototype ADT implementations that attempt to fulfill these requirements for the current BSDDB structure.   I’m working with six classes:

  1. MyEnv: Manages the BSDDB environment; provides transaction support
  2. GenDb: Generic database type implementing Python dictionary methods
  3. MyDb: BSDDB DB type (derives from bsddb.dbobj.DB and GenDb)
  4. MyDbShelf: BSDDB DBShelf type (derives from bsddb.dbobj.DBShelve and GenDb)
  5. MyTxn: Context manager for transactions (used by MyEnv)
  6. MyCursor: Context manager for cursors

Though I’m still in the (very) early stages with these classes, they have already allowed for some simplifications.  For example, today the get_person_by_handle method calls get_by_handle specifying the actual database and handle.  Within the new structure, this can simply be person[handle] if “person” is a MyShelf object for the person database.  The details of how the handle is accessed and the person object returned are hidden by the class.

It should in principle be very easy to implement the same classes using regular Python dictionaries instead of BSDDB databases.  This would allow for working with non-persistent databases entirely in RAM — a possible boon for testing new modules or isolating odd troubles.

More Discussion Needed

I hope by this posting to spur others on to think more about defining ADTs for gramps with an eye to implementing them for other back ends.  Please add your thoughts.

2 Comments

  • Benny

    Some points.

    1. not for 3.2 🙂

    2. the problem i mainly see is that the bsddb organization is on primary objects, whereas every other backend will need many, many more tables, so we need access methods also for the secondary objects. However, it is impossible to provide this for the bsddb backend due to its limitation which is coupled to the speed of access which should be better than something based on mysql

    What I mean is: attribute[1253] would have no meaning on top of a bsddb backend, but on top of sql this makes perfect sense (although not the most usefull thing). More meaningfull, marker[4] would have meaning, and a query: all people with this marker makes sense. On bsddb this is a loop over the people table, in another implementation it is not.

    As a consequence, I believe we must work with gen.lib, and see that as ‘THE’ way to access the genealogy data stored. How this interacts with the backend is up to the implementation of the backend. Some logic of the filters should move to that, so for marker we have a filter to select on marker chosen, how this is done should be more backend agnostic.

    We should also not reinvent the wheel. Access to mysql backend already exists for python in many different frameworks, of this the django model approach is a very nice one. So if we consider different backends, we should make bsddb fit into an existing scheme, not make a scheme for sql that fits our bsddb scheme which we then have to support. The support part is really what bothers me in this.

    So in all, yes, a discussion is needed, but I would rather see as the way forward:

    1. in gep013-server branch, merge in the latest changes done to gen.db
    2. create a dbdjango.py method, that can work with an sql backend, and see how that can be achieved
    3. once we have an idea how 2 very different backends need to be implemented, see how we can offer to he plugin writers a unified way so they need not bother about the database layer.

    Benny

  • Gerald Britton

    I see value in preserving the current setup of access based on primary objects. I also see that this is a simple concept to port to other backends, since we simple pickle the object and store it keyed by handle. Keeping that approach will make porting to other storage engines extremely easy.

    I do see value in keeping the abstractions separate from the implementations. So gen.lib for the abstractions and gen.db for the bsddb implementation, e.g. might be one approach, though not necessarily the optimal one. I can’t see that it makes much difference for plugins if the methods come from one place or the other, though.

    I think that the proper way forward is:

    1. Define ADTs for storing the primary objects
    2. Implement the ADTs for BSDDB, Dbdjango, MySQL and Python dictionary, as prototypes
    3. Expose the methods for accessing the ADTs for use in plugins, which would then not have to think about anything beyond the objects they are interested.

Join the Conversation!

You must be logged in to post a comment.