GEPS 023: Storing data from large sources
Proposed changes for enhancing GRAMPS by enhancing the mechanism for storing data from ‘large’ sources
- 1 User story (Problem that needs to be solved)
- 2 List various solutions
- 3 Discussion
- 4 Design
User story (Problem that needs to be solved)
I have a book that details, on page 7:
“In the 1870s B moved to the town of BT. It was here that I's father K was born in 1860. By the time he was 30 he had married. His first child M was born there. Shortly afterwards his wife died and two years later he married G. M was 12 before her brother I appeared.”
So I wish to record B, K, and the fact that K was born in 1860, and married around 1890. K's children were M and I, M was born around 1890 and I was born around 1900. [Actually, from other sources he was born on 5 Dec 1902.] I need to record page 7 of this book as the source for all these pieces of information.
Some time later I decide I should record a transcript of the source text.
Some time later, I decide to scan that page of the book, and need to store the scan as the source.
Later still, I discover that page 212 of the same book details that I married W in 1946.
Now I wish to record W, and the marriage of W and I in 1946. The source for all this is page 212 of the book, and this time I record the scan against the source.
List various solutions
- Record each page as a source reference with the book as a single source.
- Record each page as a separate source (i.e. page 7 as one source and page 212 as a second source).
- Modify Gramps to introduce SourceCitation that can be shared and can have media attachments and record each page as a SourceCitation.
Record each page as a source reference
The page number and the text from the page are stored in the Source Reference, while the Title and author of the book are stored in the shared Source.
This may be considered the natural approach, given the fact that the Source Reference headings include the page number, while the shared Source includes the Title and the Author etc.
This approach may be illustrated as follows (only some of the people and facts are shown here):
The problems with this approach are
- The Source Reference does not allow the Media scan to be stored.
- The Source Reference is not shared, there is a separate instance for each place where it occurs (e.g. each event).
The media scan can be stored with the source, as shown in the figure above. However, when one looks at the source, it will not be clear which media object in the gallery relate to which source reference, except by some naming convention (the applicable media file may be obvious in this case, but not in others). Also, in the source reference editor, one cannot immediately see which scan relates to this particular page.
The fact that the source reference is not shared (and is not a separate object) means that any updates to the source reference will have to be done for each occurrence, rather than once. For example, when I want to add the transcript of the paragraph to the source reference, I need to find each source reference and update it individually. It is also difficult to find all the source references, because they are not listed in a separate tab.
Note that there is an argument that separate source references for the different events is preferable, because the exact text that relates to that particular event can be attached. For example, for the birth event for person K, one could attach: “…the town of BT. It was here that I's father K was born in 1860…”. There are two objections to this:
- It is difficult to identify exactly which parts of the text are relevant to each event. Should I’s father be included in the source for K’s birth?
- It is far too tedious and laborious to devise separate source texts for each event. Given that the original paragraph giving family history information (this is a genuine example) is quite short, it is much quicker and easier to include the whole paragraph in each reference.
When it comes to adding the scan, the only option I really have is to add it to the source itself, despite the fact that the scan only relates to one page.
An alternative way to record the media is that suggested in the tutorial “Recording UK Census data”. Here the media object has a source reference which contains the details of the Date; Volume/page and Confidence that the media is associated with. This has the advantage that the details are unambiguously associated with the media. However, the relevant media cannot be found simply from the events.
Record each page as a source
The page number and the text from the page, the book Title and the author of the book are all stored in the Source. The Source Reference does not hold any particular information.
This arrangement can be illustrated as here (again only some of the people and facts are shown here):
This arrangement has the advantage that the transcript and media scan can be directly attached to the relevant page.
However it has a number of problems:
- the Volume/Page field in the source reference is not used for its standard purpose
- the information about the book itself is duplicated many times in each source object
- the approach rapidly becomes unmanageable if there are a large number of pages to be recorded for a single source (i.e. if the source is 'large')
A new object (which we call here a SourceCitation, but could be called SourceContent) is introduced.
A SourceCitationRef would replace the current SourceRef. The SourceCitationRef would not have any associated fields (just the reference to the SourceCitation).
The Source object is unchanged.
The existing icon options in the Source tabs would be augmented by adding an ‘Add an existing SourceCitation’ option.
The source refernce editor would be changed to a source citation editor as follows:
Note that there is no need for a SourceCitationRef editor, because this does not have any associated properties.
Recording the information about the people would proceed exactly as at present. The first source to be recorded would display an empty Source Citation editor, which would be completed in the normal way. Subsequent data would be linked to the same Source Citation object.
When it comes to record the transcript of the source text, there would only be a single Source Citation object to which the transcript would be added as a Note.
The scan of the page could be added as a Gallery item in the source Citation object.
A new display category would be needed for the Source Citation, and it would be important to implement the merge facility for this category so that existing separate Source Citations could be merged.
Either of the two ways of using the current Gramps features has difficulties. This is shown not only by the points made in the descriptions above, but also by the fact that there have been many discussions in the lists about how to use the features.
The new approach make it simple to implement the given user story. The story focusses on recording information from a book, because that is a scenario which everyone can understand and relate to. However, it also applies equally to other scenarios, like recording census data.
The new approach is a very simple extension to the current features from the user's point of view. Gramps can be used exactly as at present with no additional inputs required from the user (the workflow is completely unchanged). If a particular Source Citation is to be shared, then again there is no need for any additional inputs, just use of the feature to add an existing Source Citation, instead of creating a new one. Apart from the Source Citation category display there are no additional screens.
The SourceContent/SourceCitation object has the following content:
SourceContent 1 Source (GrampsID) 1 Confidence (5 values) 1 Volume/Page 1 LogDate (The date that this data was entered into the original source document) n Information (key-value pairs, current Data) n NoteIds n MediaRef (Region, Src, attr, notes) --> Media
Some means would be needed to make an existing SourceContent object refer to a different Source object. In order to avoid complicating the workflow for normal data input, this could perhaps be achieved just from the source content category screens.