GEPS 017: Flexible gen.lib Interface

From Gramps
Revision as of 15:34, 15 October 2011 by Dsblank (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

gen.lib is the Python interface for all of the objects in Gramps. Currently, it is not directly tied to any data storage mechanism, except for the implicit assumption that objects are created through an unserialize method for each object.

This proposal explores the possibility of making the creation of objects more general, and less tied to the particular unserializing process.

Update: After building a prototype, it was found to be too slow for general use. Rather, it seems to be better to cache the data as it appears when it comes from the BSDDB database (pickled, serialized versions of gen.lib objects). Thus, this proposal has been withdrawn.

The prototype uses a combination of Delayed evaluation, and removing of gen.lib objects' properties. When a property was accessed, the delayed object was evaluated, and the set to the attribute.

Overview

Currently, the main database interface for getting an object looks like:

>>> db.get_person_from_handle(handle)

This uses the only existing manner of creating a person supported by gen.lib:

>>> Person().unserialize(data)

where data is a serialized (non-object) representation of a Person.

This has several issues:

  1. Person() is first initialized as a completely empty object
  2. it may unserialize data that isn't needed
  3. it only allows data to be created in this particular manner
  4. it can be very slow, specifically when unserializing primary objects containing with many secondary objects or reference objects
  5. the unserialize is directly linked with the bsddb table layout. As a consequence, database layouts that are different suffer a huge penalty (not possible to do only sweeps over one table only, it is necessary to always hit multiple tables)

This proposal would use an alternative gen.lib construction, that avoids these problems.

As further evidence that there are problems with the current approach in Gramps, it suffices to look at src/gui/views/treemodels. Eg for eventmodel.py, we have:

def column_date(self,data):
       if data[COLUMN_DATE]:
           event = gen.lib.Event()
           event.unserialize(data)
           return DateHandler.get_date(event)
       return u

In this code, data was obtained as raw data from the database: data = db.get_raw_event_data(handle) The model needs to store the position in the data of the date storage, COLUMN_DATE. This couples the database table with the view implementation. Only when present is an event object created, so the overhead of making an event can no longer be avoided. This however is a very costly operation as now the entire event is initialized, also eg EventType(), NoteBase(), .... All this to only obtain the date contained in the event object.

Possible Fixes

Overview

In the detailed mailing-list discussion [1], there were four possible solutions dicussed:

  1. If an alternative is needed, use something outside of gen.lib
  2. Using a lazy() wrapper to only evaluate what is necessary
  3. Explicit delayed unpickling
  4. Use an Engine inside each object to retrieve data when necessary

These come down to:

  1. Replicate
    Replicating gen.lib has the benefit of having zero impact on the current gen.lib. However it would require two separate code paths to maintain, and does nothing to address unnecessary unpickling in BSDDB. It also means gramps-connect and gramps proper will have no real code to share
  2. Lazy Wrapper
    The lazy wrapper idea was shown to have some savings in postponing unserializing (see patch in bug report [2]). However, the requirement to wrap all data in lazy(), and the unintended side-effects were too great a cost.
  3. Explicit delayed unpickling
    Just save the data of substructure until you need to unserialize it. This is still based on pickling and is limiting in future approaches.
  4. Engine
    The best choice considered so far is to build an invisible engine into the gen.lib framework. This proposal would use an alternative gen.lib construction, that avoids the problems listed in the introduction. We will detail it below.

A gen.lib Engine

Introduction

This proposal would use an alternative gen.lib construction, that avoids the problems listed in the introduction.

The core concept is that when using gen.lib on a database, an Engine object must be created, which will contain the methods needed to map database data to object attributes. All objects will have access to this Engine via a factory method.

Furthermore, all compound gen.lib objects will understand the concept of delayedaccess. That is, the object is not fully initialized on init. When not yet initialized pieces are needed (like eg the medialist of a person object), the object first initializes this piece, then returns it.

It would provide:

  1. init of objects in one single call.
    • So Person() provides an empty Person object
    • Person(data) initializes an object, where data is the data about Person in the db which can be interpreted by an Engine object to set the attributes
    • Person(source=pers) remains possible, to duplicate an existing object
  2. When using gen.lib on a database, one must set an Engine that gen.lib should use. The engine knows how data is present in the database, and what fields in the objects correspond to this
  3. objects only set attributes that have no processing overhead at init. Other attributes are set only when they are needed, at which time they are further unpacked or fetched from db, via the engine.
  4. unserialize/serialize are removed as methods of an object, and are moved to the engine
  5. get and set methods are remove, and replaced by attribute access and the property method to do the delayed access as needed
  6. gen.lib will obtain two engines to start with. One for bsddb, and one for a django backend.
    1. BsddbEngine will be pure software. The engine will contain all present serialize/unserialize methods present now in the objects themselve.
    2. DjangoEngine will have a pointer to the django models. When eg a person objects needs access to it's media_list, the DelayedObj will call the DjangoEngine to obtain the media list, which will use the sql mediareference table to return the list of all MediaRef data

Suggested Implementation

No serialize/unserialize

Objects have no serialize/unserialize anymore. This is present in the engine of a database that needs it, and only there. So in practice, the bsddb engine.

Example usage code on bsddb

def get_person_from_handle(self, handle) 
   return Person(db.get_raw_person_data(handle))

The person class will call the bssdb engine from the factory to unserialize this data. Engine will be stored to avoid calling factory every time. So obj.__engine will store the engine, and obj.engine make it accessible. This is part of the DelayedAccess object API, of which all gen.lib objects will inherit. To store data:

 def commit_person(self, person, ...) 
    ....
    db_data = person.engine.person_serialize()
    ...

This works because engine is a bsddb engine, and hence the person_serialize method exists.

DelayedAccess

All gen.lib objects know the concept of delayed access, using an engine to obtain the not yet initialized data.

class DelayAccessObj(object):
   """
   An object that supports delayed access of the data. 
   gen.lib objects are large constructs. Depending on the storage backend
   one can create objects of which part of the data is not yet retrieved or
   constructed for performance reasons. 
   On access of these parts, the data must be obtained or constructed. 
   
   The DelayAccessObj provides the infrastructure to obtain this. It holds:
   1. an engine which is used to obtain the missing data.
   """
   
   def __init__(self):
       self._engine = EngineKeeper.get_instance().engine

Note that above should be done with properties, so that _engine is only obtained when requested and still None. Note also that all gen.lib obects should perhaps use __slots__ to reduce memory footprint.

When not yet initialized attributes are needed, the engine is requested for the data. For example the marker attribute of a person, which is a MarkerType() object. Eg, the code fragment

pers = db.get_person_from_handle(handle)
print pers.marker

This initializes a Person. In the new setup, Person has it simple attributes set, and the rest is handle by delayedaccess. In essense, this means that pers.private is already set True or False in the __init__ of Person, but pers.marker is a property. Simplified, we have a setup as:

 def __init__(self, data):
     DelayedAccess.__init__(self)
     (self.private, self.__marker, self.__media_list) = self._engine.unpack_person(data)

For bssdb, we will have eg: self.private = False, self.__marker = 1, self.__media_list the raw tupled mediareference data

For django, with mediaref in another table: self.private = False, self.__marker = 1, self.__media_list = ('Person', handle)

The aim should be clear, each engine unpacks the data passed in a way that allows delayed access of the attribute. The bsddb engine, uses only the tuple data passed by the database table. The django engine however, sets media_list to the value needed to obtain a media_list from the media reference table.

Next, pers.marker or pers.media_list is called:

  @property
  def marker(self):
      if not isinstance(self._marker, MarkerType):
           #delayed retrieval of marker from the engine using the key
           self._marker = self._engine.get_markertype(self._marker)
       return self._marker 
  @property
  def media_list(self):
      if not isinstance(self._media_list, list):
           #delayed retrieval of media list from the engine using the key
           self._media_list = self._engine.get_medialist(self._media_list)
       return self._media_list

So, as _marker is not initialized, the engine is used to obtain the marker from the data. Same for _media_list. Note that media_list returns a list of MediaRef objects, which however will use themselves delayed access to further unpack themselves as needed, so a minimal overhead has happened.

It is important to note here that media_list is in reality defined in the MediaBase() object, not in Person, as Person inherits from MediaBase. However, unpack_person must take this entire inheritence tree into account. This must be designed cleverly, allowing for the multiple inheritence available in gen.lib. Ideas??

Unpack and slots

To allow to init an object from another object, it is needed to load over the private/protected attributes without extra processing, so that delayed access can continue in the new object. That is, we cannot access eg .marker in the other object, we need to assign directly __marker.

To achieve this, all non-property attributes are added in the __slots__ (this does not work good with multiple inheritance, so probably not an option) list of the object, and an unpack method is created that can list them out for assignment. With the example above

 def __init__(self, data, source=source):
     DelayedAccess.__init__(self)
     if source: 
         (self.private, self.__marker, self.__media_list) = source.unpack()
     else:
         (self.private, self.__marker, self.__media_list) = self._engine.unpack_person(data)

Where the unpack returns the private/protected variables:

 def unpack(self):
     return (self.private, self.__marker, self.__media_list)

As in the previous section, it is important to note here that media_list is in reality defined in the MediaBase() object, not in Person, as Person inherits from MediaBase. So, the unpack method must take this entire inheritence tree into account. This must be designed cleverly, allowing for the multiple inheritence available in gen.lib. Ideas?? We would want to avoid that adding a field means we need to edit all inheriting objects because the unpack needs to change everywhere. Well, not that big deal probably, because present un/serialize does it already like that. Probably, it is advantageous to use a construct:

  self.__pack(source.unpack())

This needs to be designed cleverly because we want really fast __init__ and assign. Looking at the present serialize in eg Address:

  def serialize(self):
       """
       Convert the object to a serialized tuple of data.
       """
       return (PrivacyBase.serialize(self),
               SourceBase.serialize(self),
               NoteBase.serialize(self),
               DateBase.serialize(self),
               LocationBase.serialize(self))
   def unserialize(self, data):
       """
       Convert a serialized tuple of data to an object.
       """
       (privacy, source_list, note_list, date, location) = data
       
       PrivacyBase.unserialize(self, privacy)
       SourceBase.unserialize(self, source_list)
       NoteBase.unserialize(self, note_list)
       DateBase.unserialize(self, date)
       LocationBase.unserialize(self, location)
       return self

In the worst case unpack needs to work likewise.


getters and setters

The typical get and set methods in gen.lib would be deprecated. For 3.3 it would print a Deprecated warning, for 3.4 they should be completely removed.

bsddb get_raw methods

The get_raw_person_data and friends methods would become private/protected to the bsddb. They should not be used outside gen.db, so the code in the models will no longer depend on it, allowing for a backend based on another bsddb schema or another database

Advantages

The advantages of this approach are:

  • the delayed access is behind the scenes, and via a standard easy to understand mechanism. The hard part of obtaining data is all in the db code in gen.db, and the engine code for a db in gen.lib.
  • we can move more freely to another database schema. This might be several things: add bsddb tables, or use an sql backend. Upgrade of bsddb could even be done while supporting still normal read of the old bsddb layout (so without expensive upgrade before you can access the data). The only thing that would be needed is write a new engine for the new schema. As an example, suppose we add type tables to store all used custom types, then this change to bsddb can be done without influence on how gen.lib works. In the present setup serialize/unserialize must be changed.
  • In the future, the engine could be used for more advanced stuff. Eg, doing Person().obtain(name="McDonald") could be implemented. In that case, obtain accesses the engine and does the query. Note that this is not the aim of the change, it is just a remark that this is a possibility.


References

  1. - mailing list discussion
  2. - Lazy experiment (patch)
  3. - Blog post discussing ideas