GEPS 023: Storing data from large sources

From Gramps
Revision as of 04:27, 26 November 2011 by Romjerome (Talk | contribs) (Deferred)

Jump to: navigation, search

Proposed changes for enhancing GRAMPS by enhancing the mechanism for storing data from ‘large’ sources.

See SVN and tarball.

User story (Problem that needs to be solved)

I have a book that details, on page 7:

“In the 1870s B moved to the town of BT. It was here that I's father K was born in 1860. By the time he was 30 he had married. His first child M was born there. Shortly afterwards his wife died and two years later he married G. M was 12 before her brother I appeared.”

So I wish to record B, K, and the fact that K was born in 1860, and married around 1890. K's children were M and I, M was born around 1890 and I was born around 1900. [Actually, from other sources he was born on 5 Dec 1902.] I need to record page 7 of this book as the source for all these pieces of information.

Some time later I decide I should record a transcript of the source text.

Some time later, I decide to scan that page of the book, and need to store the scan as the source.

Later still, I discover that page 212 of the same book details that I married W in 1946.

Now I wish to record W, and the marriage of W and I in 1946. The source for all this is page 212 of the book, and this time I record the scan against the source.

List various solutions

  • Record each page as a source reference with the book as a single source.
  • Record each page as a separate source (i.e. page 7 as one source and page 212 as a second source).
  • Modify Gramps to introduce Source Content that can be shared and can have media attachments and record each page as a Source Content.

Record each page as a source reference

The page number and the text from the page are stored in the Source Reference, while the Title and author of the book are stored in the shared Source.

This may be considered the natural approach, given the fact that the Source Reference headings include the page number, while the shared Source includes the Title and the Author etc.

This approach may be illustrated as follows (only some of the people and facts are shown here):

Separate source ref.gif

The problems with this approach are

  • The Source Reference does not allow the Media scan to be stored.
  • The Source Reference is not shared, there is a separate instance for each place where it occurs (e.g. each event).

The media scan can be stored with the source, as shown in the figure above. However, when one looks at the source, it will not be clear which media object in the gallery relate to which source reference, except by some naming convention (the applicable media file may be obvious in this case, but not in others). Also, in the source reference editor, one cannot immediately see which scan relates to this particular page.

The fact that the source reference is not shared (and is not a separate object) means that any updates to the source reference will have to be done for each occurrence, rather than once. For example, when I want to add the transcript of the paragraph to the source reference, I need to find each source reference and update it individually. It is also difficult to find all the source references, because they are not listed in a separate tab.

Note that there is an argument that separate source references for the different events is preferable, because the exact text that relates to that particular event can be attached. For example, for the birth event for person K, one could attach: “…the town of BT. It was here that I's father K was born in 1860…”. There are two objections to this:

  • It is difficult to identify exactly which parts of the text are relevant to each event. Should I’s father be included in the source for K’s birth?
  • It is far too tedious and laborious to devise separate source texts for each event. Given that the original paragraph giving family history information (this is a genuine example) is quite short, it is much quicker and easier to include the whole paragraph in each reference.

When it comes to adding the scan, the only option I really have is to add it to the source itself, despite the fact that the scan only relates to one page.

An alternative way to record the media is that suggested in the tutorial “Recording UK Census data”. Here the media object has a source reference which contains the details of the Date; Volume/page and Confidence that the media is associated with. This has the advantage that the details are unambiguously associated with the media. However, the relevant media cannot be found simply from the events.

Separate source ref media.jpg

Record each page as a source

The page number and the text from the page, the book Title and the author of the book are all stored in the Source. The Source Reference does not hold any particular information.

This arrangement can be illustrated as here (again only some of the people and facts are shown here):

Separate source.gif

This arrangement has the advantage that the transcript and media scan can be directly attached to the relevant page.

However it has a number of problems:

  • the Volume/Page field in the source reference is not used for its standard purpose
  • the information about the book itself is duplicated many times in each source object
  • the approach rapidly becomes unmanageable if there are a large number of pages to be recorded for a single source (i.e. if the source is 'large')

Modify Gramps to introduce a Citation object that can be shared and can have media attachments and record each page as a Citation

A new object which we call here a Citation is introduced.

A CitationBase would replace the current SourceRef. The CitationRef would not have any associated fields (just the reference to the Citation). The CitationBase plus the Citation would be equivalent to the Sourceef.

The Source object is unchanged.


The existing icon options in the Source tabs would be unchanged, but add existing source would bring up a treeview Source-Citation.

Add source citation.jpg

The treeview would be as follows:

Select source.jpg

The user would be able to select a Source as shown by the highlighting. In this case, the subsequent Citation dialogue would not be populated in the Citation area, only in the Source area. On the other hand, if a Citation were to be selected, then the subsequent Citation dialogue would be opened already populated with the shared Citation data.

The source reference editor would be changed to a Source-Citation editor as follows:

Source citation changes.jpg

Note that there is no need for a CitationBase editor, because this does not have any associated properties.

Recording the information about the people would proceed exactly as at present. The first source to be recorded would display an empty Citation editor, which would be completed in the normal way. Subsequent data would be linked to the same Citation object.

When it comes to record the transcript of the source text, there would only be a single Citation object to which the transcript would be added as a Note.

The scan of the page could be added as a Gallery item in the Citation object.

A new display category would be needed for the Citation, and it would be important to implement the merge facility for this category so that existing separate Citations could be merged.

There is no particular need to provide a means to make an existing Citation object refer to a different Source object. At present, it is not possible to change a source reference so as to preserve the information that has been input (such as date, Volume/Page, confidence or the links to notes) but to make it point to a different source. The source reference must be deleted and a new one created. Similarly, if one wanted to change a Citation to refer to a different source, the Citation would have to be deleted.


Either of the two ways of using the current Gramps features has difficulties. This is shown not only by the points made in the descriptions above, but also by the fact that there have been many discussions in the lists about how to use the features.

The new approach make it simple to implement the given user story. The story focusses on recording information from a book, because that is a scenario which everyone can understand and relate to. However, it also applies equally to other scenarios, like recording census data.

The new approach is a very simple extension to the current features from the user's point of view. Gramps can be used exactly as at present with no additional inputs required from the user (the workflow is completely unchanged). If a particular Citation is to be shared, then again there is no need for any additional inputs, just use of the feature to add an existing Citation, instead of creating a new one. Apart from the Citation category display there are no additional screens.

The new approach is somewhat similar to the subsource approach in Subsource corresponds with Citation. However, in order to avoid any extra complexity in the approach, it is proposed that the notes are kept to the 'subsource'/Citation only, and not to the reference to them.

Using subsources can remove the above problems, as a subsource would be a nucleus information set: one line in a census, one marriage act, ...., and that connected to a source.

Why not use the sourceref?

The sourceref cannot be used for this! The text of an entire source, or an image of an entire source is something you need to share between objects, so it must be a source, not a sourceref. The note in the sourceref should only be used to explain how the information of the source was used for the object.

However, it is true that a subsource can have a date/page/volume/number within the larger source. Note that in the census example this is repeated on *all* sourceref objects. In the case of a subsource, it would be entered once in the subsource, and then not be repeated in the sourceref. I don't care to much for this as in the case where subsource is usefull, there is no problem of mentioning this in the note, .... section.

Is sourceref useless then?

I don't think this. For sources of which little information is learnt, or real books (bibliography, diary, ...) the sourceref is ideal. However, for sources which contain many unrelated short subparts that each can lead to large amount of changes throughout your genealogical database, the sourceref is less usefull, and a subsource is in order.

Note that adding a hierarchy of subsource objects would introduce much greater complexity, which is not warranted by the problems outlined.


The problem is still there when you discover a mistake in the source ref (page number), or want to add a note to these source references (eg transcript of the original latin text). Then you need to track down all source ref that where copies, and change them all. This is why in my workflow I keep most notes in the source object, where the media files (scans) also live. This is why I think about a subsource implementation, it does not hurt the people who want to keep working as they do now. The main thing holding such a thing back is GEDCOM though. The more we deviate of that in GRAMPS, the more difficult to map our internal data to something that can eg be uploaded to websites logically as you want it. It is really hard to keep having to drag an old and dead standard with us :-(

The proposed approach is directly compatible with GEDCOM; variants like hierarchical sources would not be.


There have been a large number of posts about this topic on the Gramps mailing lists. A selection is shown below.

  1. database issues sourceref references (2005) among other things suggests "We want to allow in the future that the sourceref can have a media coupled to it"
  2. sources subsources and sourceref (2007) proposes subsources between sources and objects
  3. GEDCOM and sourceref (2007) too much information in sourceref/citation or problem summarised as need to manually edit every independent source reference to change a page number
  4. Sources and sourceref (2007) second proposal is for "sourceref is made a primary object that an object can share, but is unique to a source source 1 -----> n sourceref n <----> n object (person, attribute, event, ...)"
  5. Media and attributes in data model for Gramps 3.0 (2007)
  6. local gallery tab in source reference (2007)
  7. nested sources - sourceref - one big source (2007)
  8. sources vs. repositories (2007) "Some genealogy apps allow to subdivide a source in pieces, eg divide a book in chapters. GRAMPS does _not_ have this at the moment"
  9. Sources, media and galleries: how to tie it all up "as intended" (2008)
  10. Medias, sources and sourcerefs (2009) concerned with relationship between media and sourceref
  11. Source references names and notes (2009) "have to stop using my sourceref approach and just put everything as a source, which I find clumsy and lacking in elegance. Since sourcerefs can't be shared, only copied, it is extremely difficult to keep track of them."
  12. Storing data from large sources (2010)
  13. Sharing sources (2010) actually he is talking about sharing source references (i.e. similar to the current proposal but with different terminology)

Other changes

Some other related changes have been suggested here:

  1. Use the Data key-value pairs field to store Publication data in the Source, and split the Volume/Page in the Citation, and add a couple of extra fields. This seems to be related to changing the data stored in a Citation as part of GEPS 018: Evidence style sources. It may be better to wait till GEPS 018 is resolved.
  2. Adding deduction content to the CitationRef, namely a type, confidence argument and set of notes, and a global confidence field to the Source. This is related to the BetterGEDCOM and methodology proposals.


The design is based on the 'Other changes' not being carried out at this time.

The first is not applied because the approach to providing the information for GEPS 018: Evidence style sources has not yet been decided.

The second is not applied because it would be better as the subject of a separate GEPS.

The design is intended to have minimal change from the existing user interface. Aunt Martha should be able to continue to use Gramps just as at present. Only if a user wants to take advantage of the ability to share Citations should the user ned to be aware of the changes.

Database changes

The Source PrimaryObject is unchanged.

A new Citation PrimaryObject has the following content:

  1 RefBase  --> Source
  1 Gramps Id
  1 Confidence (5 values)
  1 Volume/Page
  1 Log Date (The date that this data was entered into the original source document)
  n Information (key-value pairs, current Data)
  n NoteIds
  n MediaRef (Region, Src, attr, notes)  --> Media
  1 Private

A Citation object always refers to one and only one Source object. Therefore when creating a new Citation object, one first chooses a Source object to refer to. When deleting a Source object, all Citations that refer to that Source object must first be deleted, and references to those Citations object must be deleted before the Citation object.

A new CitationBase object is simply a list of references to Citations, with no attributes. This object is analogous to the NoteBase and TagBase objects.

The proxies will also need to be updated.


When upgrading from an old database version all objects that have a SourceRef need to be changed to the new format. The primary objects Person, Family, Event, Media and Place contain SourceRef objects. These Primary objects also contain secondary objects which have SourceRef objects.

Repository (Repositories themselves do not have SourceRefs)

Each old SourceRef object should be used to create a new Citation record. The old SourceRef will be replaced by a new CitationBase.

Because upgrading an old database version is automatic, the program should not prejudge the user's intention for similar SourceRefs. Therefore, SourceRefs should only be merged if they have the same Volume/Page, Date, Confidence and source and all Notes refer to shared copies of the same Notes (this will be the case where a SourceRef had been created by dragging and dropping on the Clipboard). It would be convenient if there were a separate Gramplet to automatically merge Citation objects on less stringent and probably configurable criteria. Such a Gramplet would need to merge Notes into the merged Citation.

   Upgrade needs to process every SourceRef in primary objects and secondary objects.
   for each SourceRef:
       assemble Volume/Page, Date, Confidence and SourceId and all NoteIds
           for each Citation:
               if Volume/Page, Date, Confidence and SourceId and all NoteIdsfor are the same:
                   use the existing Citation
       if no match, then create a new Citation

Should the criteria for matching Citations be weaker, for example just the Volume/Page, Date, Confidence and source matching, with Notes from each SourceRef being added into the Citation?


The following formats will need to be updated: Gramps XML, GEDCOM, CSV, GeneWeb.

Gramps XML

This will need a new <Citations> section with <Citation> entries.

Gramps should be able to import both old and new versions of the Gramps XML. If a <sourceref> tag appears outside of a <Citation> entry then it will indicate an old version.


  n @<XREF:SOUR>@ SOUR {1:1}                              
    +1 DATA {0:1}                                         
      +2 EVEN <EVENTS_RECORDED> {0:M}                    
        +3 DATE <DATE_PERIOD> {0:1}                     
        +3 PLAC <SOURCE_JURISDICTION_PLACE> {0:1}       
      +2 AGNC <RESPONSIBLE_AGENCY> {0:1}                
      +2 <<NOTE_STRUCTURE>> {0:M}                       
Not supported. See feature request 1371. Discussion in Role and event tags...
If the publication data is changed to key-value pairs, then on import, this GEDCOM field should be stored as a predefined key (e.g. PUBL) , and on export all the value should probably be concatenated with comma separators.
    +1 TEXT <TEXT_FROM_SOURCE> {0:1}
Not directly supported. Note that in Gramps, one would store the text from source in a note. On import, the 'text from source' should probably be stored as the contents of a Source:NoteId. On export this GEDCOM field would not be output.
Contents stored as the content of a Source:RepoRef
Not supported for data interchange (discussion in REFN strategy)
Not supported for data interchange. The Source:gramps_id can be considered to be the AUTOMATED_RECORD_ID
    +1 <<CHANGE_DATE>> {0:1}
Source:change (automatically maintained by Gramps)
    +1 <<NOTE_STRUCTURE>> {0:M}                 ||
Contents stored as the content of a Source:NoteId
    +1 <<MULTIMEDIA_LINK>> {0:M}                 ||
Contents stored as the contents of a Source:MediaRef
Not supported Source:Global Confidence

  n SOUR @<XREF:SOUR>@ {1:1} p.27
    +1 PAGE <WHERE_WITHIN_SOURCE> {0:1} p.64
    +1 EVEN <EVENT_TYPE_CITED_FROM> {0:1} p.49
      +2 ROLE <ROLE_IN_EVENT> {0:1} p.61
Not supported. See feature request 2918 and 2924, (which are mostly duplicates of each other)
    +1 DATA {0:1}
      +2 DATE <ENTRY_RECORDING_DATE> {0:1} p.48
Citation:Log Date
      +2 TEXT <TEXT_FROM_SOURCE> {0:M} p.63
        +3 [CONC|CONT] <TEXT_FROM_SOURCE> {0:M}
Not directly supported. Note that in Gramps, one would store the text from source in a note. On import, the 'text from source' should probably be stored as the contents of a Citation:NoteId. On export this GEDCOM field would not be output.
    +1 <<MULTIMEDIA_LINK>> {0:M} p.37, 26
Contents stored as the contents of a Citation:MediaRef
    +1 <<NOTE_STRUCTURE>> {0:M} p.37
Contents stored as the contents of a Citation:NoteId. On export all the Data:Value pairs should probably be concatenated with comma separators into another separate note. On export, the fields in the CitationRef should probably be output as notes.

User Interface changes

Source model and view

The existing source model and the source view in the navigator are retained unchanged. This minimises the user interface changes. Source objects can be created, edited and deleted as at present.

Citation model

New CitationTreeModel and CitationListModel are introduced. The models encompass both Source objects and Citation objects. The tree model should have lines for Sources with a disclosure triangle, and subsidiary lines for Citations. Fields for Source should include Title, Author, ID, Abbreviation, Publication information and date last changed. Fields for Citation should include Volume/Page, ID, Date, Confidence and date last changed. The models should either a Source or a Citation can be selected, and the returned value indicates which.

This arrangement would allow sorting by Volume/Page (in case this were useful for some users - e.g. if Title is Birth certificate or Marriage certificate, and Volume/Page is the name of the individual).

Citation Selector

This uses the new CitationTreeModel.

Select source.jpg

The display for the source selector comprises Title/Page and Id.

Citation View

This uses the new citation models.

Add buttons to select Source view or Source Tree view. The default display for either source view comprises Title/Page, Id, Date and Confidence.

There are several possible objectives to add/edit:

  1. Edit a Citation (Citation view: select Citation, click Edit, - allows Citation and Source to be changed),
  2. Edit a Source (Source view: select Source, click Edit - only allows Source to be changed),
  3. Add a new Source (Source view: click Add - allows Source to be added),
  4. Add a new Citation to an existing Source (Citation view: click Add; select the source - allows new Citation, and allows the Source to be changed).
  5. Add a new citation to a new source (Source view: add the Source; then Citation view: add the Citation)

In the Citation view:

  • If the "Add a new Citation" button is clicked: Bring up the Source selector. When a Source is chosen, bring up the Source-Citation editor with the specified Source populated. On clicking OK, store the new Citation and if changed, update the Source object.
  • If the "Edit" button is clicked, and a Citation is highlighted, the Source-Citation editor should allow both the Source and Citation to be changed.
  • If the "Edit" button is clicked, and a Source is highlighted: Do nothing
  • If the "Remove" button is clicked, then the highlighted object and all objects that reference it should be removed.

Editor Source tabs

CitationEmbedList replaces all occurrences of SourceEmbedList. The default fields displayed and the buttons etc. remain unchanged, except that the ID field contains the Citation ID rather than the Source ID.

The same Source-Citation editor is used as for editing a Citation.

If the "Create and add a new citation" button is clicked, then allow the creation of both Source and Citation objects. This is consistent with the current model, where "Create and add a new source" adds a new Source object and creates the current sourceref. It is distinct from the "Add an existing source" which allows either an existing Source or an existing Citation and associated Source to be selected. Bring up the Source-Citation editor with all fields empty. On clicking OK, save both a new Source and a new Citation. Error if the source is not entered. Link the CitationBase to the new Citation.

If the "Add an existing source" (Share) button is clicked, the citation selector is dispalyed. The user will either select a Source or a Citation. Pre-populate the editor according to what has been selected. On clicking OK, save either a new or an updated Citation; save the Source if it was updated (actually, the Source seems to be re-saved even if it was not changed, which may affect the last changed date incorrectly). Link the CitationBase to the new Citation.

If the "Remove" button is clicked, then the highlighted Citation should be removed, together with the links to the citation. The Source is not removed.


There are two separate 'citation/source' editors:

Add source 1.png
Source citation changes.jpg
editcitation. As used with the Citation reference having data, then this would not have the warning signs in the top half.

editsource is used from

  • Source view when selecting a source and clicking the "Edit" button,
  • Source view when clicking the "Add" button.

Existing editsource is unchanged.

editcitation is used from

  • Citation view when selecting a Citation and clicking the "Edit" button,
  • Citation view when clicking the "Add a new citation" button (following the display of the source selector, and selection of a source),
  • Editor Source tabs (CitationEmbedList) when clicking the 'Create and add a new citation' button.
  • Editor Source tabs (CitationEmbedList) when clicking the 'Edit the selected citation' button.
  • Editor Source tabs (CitationEmbedList) when clicking the 'Add an existing citation' button.

This is similar to the current editsourceref. On clicking OK,

  • if a Citation was passed in, update the Citation and the linked Source,
  • if nothing was passed in, if the Citation is blank store the new Source else store a new Citation and Source,
  • if a Source is passed in, add a new Citation and update the Source.


Do we need extra rules to match a Citation?


Reports access Source References through the Bibliography and Endnotes functionality. This allows the changes to be made in a single place.

Some changes will be needed in the Narrative Web and Simple Database Access functionality.

Known Issues

The following are still to be done before merge into trunk

  • src/gen/
  • src/plugins/textreport/,, Update for use of bibiography
  • src/plugins/export/ Change references to SourceRef to Citation
  • src/ Change references to SourceRef to Citation
  • add a flat sources view
  • check the sorting on the 'Date' column in the Sources and Citations views.

The following need to be changed or removed as they are no longer used (I have not removed them yet because I am still referring to the old code for checking purposes)

src/lib/ remove srcref
src/lib/ remove
src/lib/ remove
src/gui/editors/displaytabs/ remove
src/gui/editors/dispalytabs/ remove
src/gui/editors/ remove editsourceref
src/gui/editors/ remove

I believe that all the known issues below can be deferred till after merging GEPS023 into trunk.


  • Count at bottom right corner of window in citation tree view is incorrect.
  • Add ID code for citations to the preferences dialog.
  • Checks and warnings on 'Save' and 'Cancel' in EditCitation are not quite right.


  • src/plugins/import/ Needs to be updated for citations
  • src/plugins/export/ Needs to be updated for citations
  • src/plugins/import/ Needs changing for citations, but I can't do this as I don't have access to the application : I should be able to do that --Romjerome 01:57, 26 November 2011 (MST)
  • src/plugins/export/ Needs changing for citations, but I can't do this as I don't have access to the application : I should be able to do that --Romjerome 01:57, 26 November 2011 (MST)
  • src/plugins/import/ Needs changing for citations, but I can't do this as I don't have access to the application
  • src/plugins/tool/ This has been changed so that it runs to some extent, but it needs further enhancement to specifically test citation data
  • src/plugins/tool/ This has been changed so that it runs, but it needs further enhancement to specifically verify citation data
  • src/webapp/grampsdb/  ??
  • src/webapp/grampsdb/  ??
  • src/webapp/grampsdb/  ??
  • SimpleAccess, SimpleDoc and SimpleTable Needs to be changed for citations, but I am not yet sure how to test these.
  • po/gramps.pot New files that need translation to be added [1]


  • The icon for citation might need to be changed to more closely meet the Tango guidelines.
  • Should the default views be changed (e.g. remove the separate source and citation views and provide flat and tree presentations as separate presentations within a single view. Also, should there be a flat sources view).
  • In the citationtreeview you can't filter on citation criteria (e.g. confidence) - This is because of the way the views work, the 'Source view' (or citationtreeview) is based on Sources only, so sorting, searching and filtering are just based on sources. If you want to filter the citations, then you have to use the citations view, and the citations filter in that view.
  • src/gen/proxy/ I have updated this for citations, but it does not appear to be used, so I have not tested it
  • src/plugins/export/ Not changed for citations because it does not appear to be used
  • src/plugins/import/ Don't plan to change this as it appears to be outdated - should it be removed?
  • src/test/GrampsDb/ ?? Is this actually used? could be done after merge into citations
  • src/lib/test/ I don't plan on upgrading this because it does not seem to be used
  • src/Merge/ I don't plan on upgrading this because it does not seem to be used

The test modules have not been updated because of the following comment The unit tests seem pretty skimpy

Yes, somebody started it and it got at the sidetrack. We hardly have time to develop, so unit tests is not something that is in the DNA of our community. Personally I believe that for GUI it is too much work to maintain for a small devel community. For the library stuff, I would love to see something easy. I however never run unit tests, never heard how to actually run those that are present. The only thing I do for certain parts is add tests at the bottom of the python file that run when the script is executed stand alone (see eg During my time at Gramps only those tests have actually catched anything, I never heard of the existing unit tests actually catching any bugs. Don't understand me wrong, I take testing of my own code very seriously, but I didn't yet see anybody proposing a unit test system for the Gramps codebase that is sufficiently developer friendly so that anybody actually runs them.



Bug tracker

1022: Sources dialog hangs for 2 minutes after opening
2918: Add Gedcom Source Citation Fields
4491 (Feature request): Matching source and quality level
4913 (Feature request): Additional Event Filters

Others interfaces