Difference between revisions of "GEPS 023: Storing data from large sources"

From Gramps
Jump to: navigation, search
m
Line 185: Line 185:
  
 
tbd
 
tbd
 +
 +
[[Category:GEPS|S]]

Revision as of 16:34, 3 December 2010

Proposed changes for enhancing GRAMPS by enhancing the mechanism for storing data from ‘large’ sources

User story (Problem that needs to be solved)

I have a book that details, on page 7:

“In the 1870s B moved to the town of BT. It was here that I's father K was born in 1860. By the time he was 30 he had married. His first child M was born there. Shortly afterwards his wife died and two years later he married G. M was 12 before her brother I appeared.”

So I wish to record B, K, and the fact that K was born in 1860, and married around 1890. K's children were M and I, M was born around 1890 and I was born around 1900. [Actually, from other sources he was born on 5 Dec 1902.] I need to record page 7 of this book as the source for all these pieces of information.

Some time later I decide I should record a transcript of the source text.

Some time later, I decide to scan that page of the book, and need to store the scan as the source.

Later still, I discover that page 212 of the same book details that I married W in 1946.

Now I wish to record W, and the marriage of W and I in 1946. The source for all this is page 212 of the book, and this time I record the scan against the source.

List various solutions

  • Record each page as a source reference with the book as a single source.
  • Record each page as a separate source (i.e. page 7 as one source and page 212 as a second source).
  • Modify Gramps to introduce Source Content that can be shared and can have media attachments and record each page as a Source Content.

Record each page as a source reference

The page number and the text from the page are stored in the Source Reference, while the Title and author of the book are stored in the shared Source.

This may be considered the natural approach, given the fact that the Source Reference headings include the page number, while the shared Source includes the Title and the Author etc.

This approach may be illustrated as follows (only some of the people and facts are shown here):

Separate source ref.gif

The problems with this approach are

  • The Source Reference does not allow the Media scan to be stored.
  • The Source Reference is not shared, there is a separate instance for each place where it occurs (e.g. each event).

The media scan can be stored with the source, as shown in the figure above. However, when one looks at the source, it will not be clear which media object in the gallery relate to which source reference, except by some naming convention (the applicable media file may be obvious in this case, but not in others). Also, in the source reference editor, one cannot immediately see which scan relates to this particular page.

The fact that the source reference is not shared (and is not a separate object) means that any updates to the source reference will have to be done for each occurrence, rather than once. For example, when I want to add the transcript of the paragraph to the source reference, I need to find each source reference and update it individually. It is also difficult to find all the source references, because they are not listed in a separate tab.

Note that there is an argument that separate source references for the different events is preferable, because the exact text that relates to that particular event can be attached. For example, for the birth event for person K, one could attach: “…the town of BT. It was here that I's father K was born in 1860…”. There are two objections to this:

  • It is difficult to identify exactly which parts of the text are relevant to each event. Should I’s father be included in the source for K’s birth?
  • It is far too tedious and laborious to devise separate source texts for each event. Given that the original paragraph giving family history information (this is a genuine example) is quite short, it is much quicker and easier to include the whole paragraph in each reference.

When it comes to adding the scan, the only option I really have is to add it to the source itself, despite the fact that the scan only relates to one page.

An alternative way to record the media is that suggested in the tutorial “Recording UK Census data”. Here the media object has a source reference which contains the details of the Date; Volume/page and Confidence that the media is associated with. This has the advantage that the details are unambiguously associated with the media. However, the relevant media cannot be found simply from the events.

Separate source ref media.jpg

Record each page as a source

The page number and the text from the page, the book Title and the author of the book are all stored in the Source. The Source Reference does not hold any particular information.

This arrangement can be illustrated as here (again only some of the people and facts are shown here):

Separate source.gif

This arrangement has the advantage that the transcript and media scan can be directly attached to the relevant page.

However it has a number of problems:

  • the Volume/Page field in the source reference is not used for its standard purpose
  • the information about the book itself is duplicated many times in each source object
  • the approach rapidly becomes unmanageable if there are a large number of pages to be recorded for a single source (i.e. if the source is 'large')

Modify Gramps to introduce a Source Content object that can be shared and can have media attachments and record each page as a Source Content

A new object which we call here a Source Content is introduced.

A SourceContentRef would replace the current SourceRef. The SourceContentRef would not have any associated fields (just the reference to the Source Content).

The Source object is unchanged.

Newproposal.jpg

The existing icon options in the Source tabs would be unchanged, but add existing source would bring up a treeview Source-SourceContent.

Add source citation.jpg

The treeview would be as follows:

Select source.jpg

The user would be able to select a Source as shown by the highlighting. In this case, the subsequent SourceContent dialogue would not be populated in the SourceContent area, only in the Source area. On the other hand, if a SourceContent were to be selected, then the subsequent SourceContent dialogue would be opened already populated with the shared SourceContent data.

The source reference editor would be changed to a Source-SourceContent editor as follows:

Source citation changes.jpg

Note that there is no need for a SourceContentRef editor, because this does not have any associated properties.

Recording the information about the people would proceed exactly as at present. The first source to be recorded would display an empty Source Content editor, which would be completed in the normal way. Subsequent data would be linked to the same Source Content object.

When it comes to record the transcript of the source text, there would only be a single Source Content object to which the transcript would be added as a Note.

The scan of the page could be added as a Gallery item in the Source Content object.

A new display category would be needed for the Source Content, and it would be important to implement the merge facility for this category so that existing separate Source Contents could be merged.

There is no particular need to provide a means to make an existing Source Content object refer to a different Source object. At present, it is not possible to change a source reference so as to preserve the information that has been input (such as date, Volume/Page, confidence or the links to notes) but to make it point to a different source. The source reference must be deleted and a new one created. Similarly, if one wanted to change a Source Content to refer to a different source, the Source Content would have to be deleted.

Discussion

Either of the two ways of using the current Gramps features has difficulties. This is shown not only by the points made in the descriptions above, but also by the fact that there have been many discussions in the lists about how to use the features.

The new approach make it simple to implement the given user story. The story focusses on recording information from a book, because that is a scenario which everyone can understand and relate to. However, it also applies equally to other scenarios, like recording census data.

The new approach is a very simple extension to the current features from the user's point of view. Gramps can be used exactly as at present with no additional inputs required from the user (the workflow is completely unchanged). If a particular Source Content is to be shared, then again there is no need for any additional inputs, just use of the feature to add an existing Source Content, instead of creating a new one. Apart from the Source Content category display there are no additional screens.

The new approach is somewhat similar to the subsource approach in http://gramps.1791082.n4.nabble.com/sources-subsources-and-sourceref-td1794804.html. Subsource corresponds with SourceContent. However, in order to avoid any extra complexity in the approach, it is proposed that the notes are kept to the 'subsource'/source Content only, and not to the reference to them.

Using subsources can remove the above problems, as a subsource would be a nucleus information set: one line in a census, one marriage act, ...., and that connected to a source.

Why not use the sourceref?

The sourceref cannot be used for this! The text of an entire source, or an image of an entire source is something you need to share between objects, so it must be a source, not a sourceref. The note in the sourceref should only be used to explain how the information of the source was used for the object.

However, it is true that a subsource can have a date/page/volume/number within the larger source. Note that in the census example this is repeated on *all* sourceref objects. In the case of a subsource, it would be entered once in the subsource, and then not be repeated in the sourceref. I don't care to much for this as in the case where subsource is usefull, there is no problem of mentioning this in the note, .... section.

Is sourceref useless then?

I don't think this. For sources of which little information is learnt, or real books (bibliography, diary, ...) the sourceref is ideal. However, for sources which contain many unrelated short subparts that each can lead to large amount of changes throughout your genealogical database, the sourceref is less usefull, and a subsource is in order.

Note that adding a hierarchy of subsource objects would introduce much greater complexity, which is not warranted by the problems outlined.

Also, http://gramps.1791082.n4.nabble.com/Source-references-names-and-notes-td1813805.html#a1813818

The problem is still there when you discover a mistake in the source ref (page number), or want to add a note to these source references (eg transcript of the original latin text). Then you need to track down all source ref that where copies, and change them all. This is why in my workflow I keep most notes in the source object, where the media files (scans) also live. This is why I think about a subsource implementation, it does not hurt the people who want to keep working as they do now. The main thing holding such a thing back is GEDCOM though. The more we deviate of that in GRAMPS, the more difficult to map our internal data to something that can eg be uploaded to websites logically as you want it. It is really hard to keep having to drag an old and dead standard with us :-(

The proposed approach is directly compatible with GEDCOM; variants like hierarchical sources would not be.

Design

The SourceContent object has the following content:

SourceContent
   1 Source (GrampsID)
   1 Confidence (5 values)
   1 Volume/Page
   1 LogDate (The date that this data was entered into the original source document)
   n Information (key-value pairs, current Data)
   n NoteIds
   n MediaRef (Region, Src, attr, notes)  --> Media


tbd