Difference between revisions of "Database Formats"

From Gramps
Jump to: navigation, search
m (Why no simultaneous access ?)
(Detailed Changes)
(39 intermediate revisions by 12 users not shown)
Line 1: Line 1:
[[Category:Developers/Reference]]
+
{{languages|Database Formats}}
 +
==History==
  
==History==
+
Gramps default data format has evolved over time. Each major change in format usually results in an increase in the major version number.
  
GRAMPS default data format has been evolving over time. Each major change in format usually results in an increase in the major version number.
+
* See also [[Gramps XML]]
  
===GRAMPS 1.0 and earlier===
+
===Gramps 1.0 and earlier===
  
GRAMPS 1.0 used XML as its default format. This format is portable and easily read by both computers and people. It has two major issues that caused problems with used as a default format.
+
Gramps 1.0 used XML as its default format. This format is portable and easily read by both computers and people. It has two major issues that caused problems when it was used as the default format.
  
* Slow to load and save. The entire file had to be parsed to load the data, and parsing XML is not fast. Similarly, to save any changes, the entire file had to be written. People with larger databases found the load and save times to be unusable.
+
* Slow to load and save. The entire file had to be parsed to load the data, and parsing XML is not fast. Similarly, to save any changes, the entire file had to be written. People with larger databases (thousands of persons) found the load and save times to be unusable.
  
* Consumes a lot of memory. The XML format required that all data be loaded and stored in memory. Larger databases could consume all memory on the system, bring the system to a virtual halt.
+
* Consumes a lot of memory. The XML format required that all data be loaded and stored in memory. Larger databases could consume all memory on the system and as a result bring the system to a virtual halt.
  
===GRAMPS 2.0===
+
===Gramps 2.0===
  
To solve the capacity issues, GRAMPS 2.0 switched to using the Berkeley database, using the ".grdb" extension to identify the file. All database information was stored in this file. This resolved both the load/save time issues and the memory consumption issues. Using a real database backend allowed us to only load the data into memory when we needed it.
+
To solve the capacity issues, Gramps 2.0 switched to using the Berkeley database, using the ".grdb" extension to identify the file. All database information was stored in this file. This resolved both the load/save time issues and the memory consumption issues. Using a real database backend allowed us to only load the data into memory when we needed it.
  
The grdb format was a significant step forward for GRAMPS. However, it was susceptible to data corruption. Since data commits were not atomic (all related changes occurred at once), data could get corrupted if an error occurred while a change was being made.
+
The grdb format was a significant step forward for Gramps. However, it was susceptible to data corruption. Since data commits were not atomic (all related changes saved at once), data could get corrupted if an error occurred while a change was being made.
  
===GRAMPS 2.2===
+
===Gramps 2.2===
  
GRAMPS 2.2 started using the transaction capability of the Berkeley database. This feature ensures that all related data is committed at once. So, if an error occurs while that data is being saved, the database remains intact. Either the entire set of changes makes it to the database, or none of the changes make it.
+
Gramps 2.2 started using the transaction capability of the Berkeley database. This feature ensures that all related data is committed at once. So, if an error occurs while that data is being saved, the database remains intact. Either the entire set of changes makes it to the database, or none of the changes make it.
  
The problem with the approach in 2.2 is that a single file (the grdb file) is no longer enough for the Berkeley database to handle the data. An "environment" directory is needed to store log files that make the transactions possible. We needed a place to keep these files so that the user would not delete them. We chose to store these in an environment directory under the ~/.gramps directory. We map the log files to the database using the path name of the original grdb file.
+
The problem with the approach in 2.2 is that a single file (the grdb file) is no longer enough for the Berkeley database to handle the data. An "environment" directory is needed to store the transactions log files. A place was needed to keep these files so that the user would not delete them. An environment directory under the ~/.gramps directory was chosen for this. The log files are mapped to the database using the path name of the original grdb file.
  
This works very well as long as the file is never moved. If the user renames the file, restores a backup of the file, or copies it to another machine, the file will no longer work, since it would no longer coorelate to the log files stored under ~/.gramps.
+
This works very well as long as the file is never moved. If the user renames the file, restores a backup of the file, or copies it to another machine, [[Recover_corrupted_family_tree#What_causes_this_corruption.3F_2|the file will no longer work]], since it would no longer correlate to the log files stored under ~/.gramps.
  
==The Future - GRAMPS 3.0==
+
==Gramps 3.0==
  
We take a new approach with GRAMPS 3.0. The grdb file is being replaced. While we still use the Berkeley database, the user will no longer see a file. The user will not open a file, but will instead open a symbolic database name. This name will map a directory under ~/.gramps that will contain all the needed database files.
+
A new approach was taken with Gramps 3.0. While the Berkeley database is still used, the user will no longer see a file. Instead of the actual database file, user opens a symbolic database name. This name will map to a subdirectory under ~/.gramps directory which contains all the needed database files.
  
Since all files will be in the same directory, advanced users can make a backup of the entire directory, preserving the entire data. New users, who may not be familiar with the Linux filesystem, will not have to worry about finding their database, since a new Family Tree Manager will replace the old Open File dialog.
+
Since all the files will be in the same directory, advanced users can make a backup of the entire directory, preserving the entire data. New users, who may not be familiar with the Linux filesystem, will not have to worry about finding their database, since a new Family Tree Manager will replace the old Open File dialog.
  
 
===Family Tree Manager===
 
===Family Tree Manager===
Line 37: Line 38:
 
The new Family Tree Manager replaces the File Open dialog. Version 3.0 does not work on files, but on Family Tree Databases. Since there is no file to open, a file open dialog makes no sense. The Open button has been replaced with a Family Trees button. Clicking this button brings up the Family Tree Manager, shown below.
 
The new Family Tree Manager replaces the File Open dialog. Version 3.0 does not work on files, but on Family Tree Databases. Since there is no file to open, a file open dialog makes no sense. The Open button has been replaced with a Family Trees button. Clicking this button brings up the Family Tree Manager, shown below.
  
[[Image:Dbmanager01.png]]
+
[[Image:ManageFamilyTrees-40.png]]
 +
 
 +
The Family Tree Manager allows the user to create a new database, rename an existing database, delete a database, or load a database. All databases appear in the list, so the user does not need to worry about where the databases are located. If a database is open, an icon will appear next to the name.
 +
 
 +
==== Versions ====
 +
 
 +
If the RCS revision control system is installed on your system, Gramps allows you to archive specified versions of your database. To save a version, open the Family Tree Manager and select the opened database. Simply clicking on the Archive button will save the current version to the revision control system. If a database has one or more saved versions, the databases appear as a tree view, with the available versions displayed under them.
 +
 
 +
[[Image:Dbmanager03.png]]
 +
 
 +
In this example, the ''Gramps Example'' database has two versions which have been saved. These versions are saved snapshots of your database contents.
 +
 
 +
The advantage of storing versions is that you can go back to one of these saved versions. To select a version to restore, select the version and click on the Restore button.
 +
 
 +
[[Image:FileManageFamilyTrees-ArchiveSelectToExtract-40.png|Selecting a version to restore]]
 +
 
 +
Gramps will extract the version into a new database. The database name is based on the original database name and the revision name.
 +
 
 +
[[Image:FileManageFamilyTrees-ArchiveExtractedVersionShown-40.png|Restored version]]
 +
 
 +
Versions may be deleted and renamed.
 +
 
 +
==== Multiple users ====
 +
 
 +
Unlike previous versions, Gramps 3.0 supports limited sharing of databases. Multiple users may edit the same database, just not at the same time. The Family Tree Manager will identify a database that is open for another user, and you will not be able to load the database until the other user has closed the database.
 +
 
 +
[[Image:Dbmanager06.png]]
  
The Family Tree Manager allows the user to create a new database, rename an existing database, delete a database, or load a database. All databases appear in the list, so the user does not need to worry about where the databases are located. If a database is open, and icon will appear next to the name.
+
==== Repairing a Corrupt Database ====
  
 
On the odd chance that database corruption occurs, the Family Tree Manager will show the corrupted file with an Error icon next to it. If this database is selected, a Repair button will appear (as seen below).
 
On the odd chance that database corruption occurs, the Family Tree Manager will show the corrupted file with an Error icon next to it. If this database is selected, a Repair button will appear (as seen below).
  
[[Image:Dbmanager02.png]]
+
[[Image:Dbmanager07.png]]
  
 
Clicking the repair button will rebuild the database from the backup files that are automatically created on exit.
 
Clicking the repair button will rebuild the database from the backup files that are automatically created on exit.
Line 49: Line 76:
 
===Automatic Backup Files===
 
===Automatic Backup Files===
  
To protect against file corruption problems in the Berkeley database, GRAMPS 3.0 will generate a backup file exit if any data has changed.  
+
To protect against file corruption problems in the Berkeley database, Gramps 3.0 will generate a backup file at exit if any data has changed.
 
+
Unlike 2.2, the backup files are not in XML format. The new backup files are a dump of the database tables. This allows the data to be saved quickly. One backup file exists for each primary table in the database. The backup files are not visible to the user, being held in the database directory.
+
Unlike in Gramps 2.2, the backup files are not in XML format. The new backup files are a dump of the database tables. This allows the data to be saved quickly. One backup file exists for each primary table in the database. The backup files are not visible to the user, being held in the database directory.
 
 
  
== Why no simultaneous access ?==
+
== Why no simultaneous access? ==
  
From time to time people want to use GRAMPS for collaborative research, and are then stopped as GRAMPS does not allow simultaneous access. That is, you can simultaneously access the database, but this typically results in corrupt data, destroying your database.  
+
From time to time people want to use Gramps for collaborative research, and are then stopped as Gramps does not allow simultaneous access. That is, you can simultaneously access the database, but this typically results in corrupt data, destroying your database.  
  
 
The motivation for this is the following:
 
The motivation for this is the following:
 
#We would need a server/client infrastructure, like eg MySQL. However, this is also why MySQL is non-trivial for most users to maintain. We get into system startup files, gramps daemons running, and a whole mess of other stuff. And when you consider that probably under 5% of the users would be interested in something like this, we have to wonder about our return on investment. Is our time more valuable working on other stuff?
 
#We would need a server/client infrastructure, like eg MySQL. However, this is also why MySQL is non-trivial for most users to maintain. We get into system startup files, gramps daemons running, and a whole mess of other stuff. And when you consider that probably under 5% of the users would be interested in something like this, we have to wonder about our return on investment. Is our time more valuable working on other stuff?
#Who to maintain this code? The core developers don't need this feature. Somebody should join the team to make this happen
+
#Who to maintain this code? The core developers don't need this feature. Somebody should join the team to make this happen.
  
Technically, BSDDB can be made to work with a server environment. However, in GRAMPS we give the <code>env.open()</code> the <code>DB_PRIVATE flag</code>. The docs say:
+
Technically, BSDDB can be made to work with a server environment. However, in Gramps we give the <code>env.open()</code> the <code>DB_PRIVATE flag</code>. The docs say:
;<code>DB_PRIVATE</code>:Specify that the environment will only be accessed by a single         process (although that process may be multithreaded). This flag has two effects on the Berkeley DB environment. First, all underlying data structures are allocated from per-process memory instead of from shared memory that is potentially accessible to more than a single process. Second, mutexes are only configured to work between threads.
+
;<code>DB_PRIVATE</code>:Specify that the environment will only be accessed by a single process (although that process may be multithreaded). This flag has two effects on the Berkeley DB environment. First, all underlying data structures are allocated from per-process memory instead of from shared memory that is potentially accessible to more than a single process. Second, mutexes are only configured to work between threads.
  
 
:This flag should not be specified if more than a single process is accessing the environment because it is likely to cause database corruption and unpredictable behavior. For example, if both a server application and the Berkeley DB utility db_stat are expected to access the environment, the DB_PRIVATE flag should not be specified.
 
:This flag should not be specified if more than a single process is accessing the environment because it is likely to cause database corruption and unpredictable behavior. For example, if both a server application and the Berkeley DB utility db_stat are expected to access the environment, the DB_PRIVATE flag should not be specified.
  
:Source: http://pybsddb.sourceforge.net/api_c/env_open.html
+
:Source: [http://pybsddb.sourceforge.net/api_c/env_open.html DB_ENV->open]
 +
 
 +
Note however that consecutive access from different places to the same underlying database is possible with Gramps 3.0, so a collaboration based on time sharing (using different hours to input data in Gramps) is possible.
 +
 
 +
== Detailed Changes ==
 +
 
 +
This table lists the database changes for each version.
 +
 
 +
{|{{prettytable}} 
 +
!Gramps
 +
!Database
 +
!Changes from previous version
 +
|-
 +
|Gramps 4.1 - master
 +
|17
 +
|
 +
* added Tags to Event, Place, Repository, Source, and Citation
 +
* added alternate names to Place
 +
* Source.data became SourceAttributes
 +
* Added optional support for checksum on Media object
 +
* Added PlaceRef and Place Hierarchies
 +
|-
 +
|Gramps 3.4 - 4.0
 +
|16
 +
|
 +
* converted SourceRef to new Citation object
 +
* Source and Citation objects gained a private flag
 +
|-
 +
|Gramps 3.1 - 3.3
 +
|15
 +
|
 +
* added Tags to Person, Family, and Note
 +
* added Surname list
 +
* removed Marker
 +
|-
 +
|Gramps 3.0
 +
|14
 +
|
 +
* added newyear to Dates
 +
* Replace plain text with StyledText in Notes
 +
|-
 +
|Gramps 2.x
 +
|13
 +
|
 +
* changed name formats
 +
|}
  
Note however that consecutive access from different places to the same underlying database will become possible with GRAMPS 3.0, so a collaboration based on time sharing (using different hours to input data in GRAMPS) will become possible.
+
==See also==
 +
* [[Gramps XML]]
 +
* [[Generate XML]]
 +
* [[GEDCOM]]
 +
* [[Gramps and GEDCOM]]
 +
* [[Gramps Old database]]
 +
 
 +
[[Category:Developers/General]]
 +
[[Category:Developers/Reference]]

Revision as of 23:05, 9 May 2015

History

Gramps default data format has evolved over time. Each major change in format usually results in an increase in the major version number.

Gramps 1.0 and earlier

Gramps 1.0 used XML as its default format. This format is portable and easily read by both computers and people. It has two major issues that caused problems when it was used as the default format.

  • Slow to load and save. The entire file had to be parsed to load the data, and parsing XML is not fast. Similarly, to save any changes, the entire file had to be written. People with larger databases (thousands of persons) found the load and save times to be unusable.
  • Consumes a lot of memory. The XML format required that all data be loaded and stored in memory. Larger databases could consume all memory on the system and as a result bring the system to a virtual halt.

Gramps 2.0

To solve the capacity issues, Gramps 2.0 switched to using the Berkeley database, using the ".grdb" extension to identify the file. All database information was stored in this file. This resolved both the load/save time issues and the memory consumption issues. Using a real database backend allowed us to only load the data into memory when we needed it.

The grdb format was a significant step forward for Gramps. However, it was susceptible to data corruption. Since data commits were not atomic (all related changes saved at once), data could get corrupted if an error occurred while a change was being made.

Gramps 2.2

Gramps 2.2 started using the transaction capability of the Berkeley database. This feature ensures that all related data is committed at once. So, if an error occurs while that data is being saved, the database remains intact. Either the entire set of changes makes it to the database, or none of the changes make it.

The problem with the approach in 2.2 is that a single file (the grdb file) is no longer enough for the Berkeley database to handle the data. An "environment" directory is needed to store the transactions log files. A place was needed to keep these files so that the user would not delete them. An environment directory under the ~/.gramps directory was chosen for this. The log files are mapped to the database using the path name of the original grdb file.

This works very well as long as the file is never moved. If the user renames the file, restores a backup of the file, or copies it to another machine, the file will no longer work, since it would no longer correlate to the log files stored under ~/.gramps.

Gramps 3.0

A new approach was taken with Gramps 3.0. While the Berkeley database is still used, the user will no longer see a file. Instead of the actual database file, user opens a symbolic database name. This name will map to a subdirectory under ~/.gramps directory which contains all the needed database files.

Since all the files will be in the same directory, advanced users can make a backup of the entire directory, preserving the entire data. New users, who may not be familiar with the Linux filesystem, will not have to worry about finding their database, since a new Family Tree Manager will replace the old Open File dialog.

Family Tree Manager

The new Family Tree Manager replaces the File Open dialog. Version 3.0 does not work on files, but on Family Tree Databases. Since there is no file to open, a file open dialog makes no sense. The Open button has been replaced with a Family Trees button. Clicking this button brings up the Family Tree Manager, shown below.

ManageFamilyTrees-40.png

The Family Tree Manager allows the user to create a new database, rename an existing database, delete a database, or load a database. All databases appear in the list, so the user does not need to worry about where the databases are located. If a database is open, an icon will appear next to the name.

Versions

If the RCS revision control system is installed on your system, Gramps allows you to archive specified versions of your database. To save a version, open the Family Tree Manager and select the opened database. Simply clicking on the Archive button will save the current version to the revision control system. If a database has one or more saved versions, the databases appear as a tree view, with the available versions displayed under them.

Dbmanager03.png

In this example, the Gramps Example database has two versions which have been saved. These versions are saved snapshots of your database contents.

The advantage of storing versions is that you can go back to one of these saved versions. To select a version to restore, select the version and click on the Restore button.

Selecting a version to restore

Gramps will extract the version into a new database. The database name is based on the original database name and the revision name.

Restored version

Versions may be deleted and renamed.

Multiple users

Unlike previous versions, Gramps 3.0 supports limited sharing of databases. Multiple users may edit the same database, just not at the same time. The Family Tree Manager will identify a database that is open for another user, and you will not be able to load the database until the other user has closed the database.

Dbmanager06.png

Repairing a Corrupt Database

On the odd chance that database corruption occurs, the Family Tree Manager will show the corrupted file with an Error icon next to it. If this database is selected, a Repair button will appear (as seen below).

Dbmanager07.png

Clicking the repair button will rebuild the database from the backup files that are automatically created on exit.

Automatic Backup Files

To protect against file corruption problems in the Berkeley database, Gramps 3.0 will generate a backup file at exit if any data has changed.

Unlike in Gramps 2.2, the backup files are not in XML format. The new backup files are a dump of the database tables. This allows the data to be saved quickly. One backup file exists for each primary table in the database. The backup files are not visible to the user, being held in the database directory.

Why no simultaneous access?

From time to time people want to use Gramps for collaborative research, and are then stopped as Gramps does not allow simultaneous access. That is, you can simultaneously access the database, but this typically results in corrupt data, destroying your database.

The motivation for this is the following:

  1. We would need a server/client infrastructure, like eg MySQL. However, this is also why MySQL is non-trivial for most users to maintain. We get into system startup files, gramps daemons running, and a whole mess of other stuff. And when you consider that probably under 5% of the users would be interested in something like this, we have to wonder about our return on investment. Is our time more valuable working on other stuff?
  2. Who to maintain this code? The core developers don't need this feature. Somebody should join the team to make this happen.

Technically, BSDDB can be made to work with a server environment. However, in Gramps we give the env.open() the DB_PRIVATE flag. The docs say:

DB_PRIVATE
Specify that the environment will only be accessed by a single process (although that process may be multithreaded). This flag has two effects on the Berkeley DB environment. First, all underlying data structures are allocated from per-process memory instead of from shared memory that is potentially accessible to more than a single process. Second, mutexes are only configured to work between threads.
This flag should not be specified if more than a single process is accessing the environment because it is likely to cause database corruption and unpredictable behavior. For example, if both a server application and the Berkeley DB utility db_stat are expected to access the environment, the DB_PRIVATE flag should not be specified.
Source: DB_ENV->open

Note however that consecutive access from different places to the same underlying database is possible with Gramps 3.0, so a collaboration based on time sharing (using different hours to input data in Gramps) is possible.

Detailed Changes

This table lists the database changes for each version.

Gramps Database Changes from previous version
Gramps 4.1 - master 17
  • added Tags to Event, Place, Repository, Source, and Citation
  • added alternate names to Place
  • Source.data became SourceAttributes
  • Added optional support for checksum on Media object
  • Added PlaceRef and Place Hierarchies
Gramps 3.4 - 4.0 16
  • converted SourceRef to new Citation object
  • Source and Citation objects gained a private flag
Gramps 3.1 - 3.3 15
  • added Tags to Person, Family, and Note
  • added Surname list
  • removed Marker
Gramps 3.0 14
  • added newyear to Dates
  • Replace plain text with StyledText in Notes
Gramps 2.x 13
  • changed name formats

See also