GEPS 023: Storing data from large sources

From Gramps
Revision as of 10:31, 8 December 2010 by Kulath (talk | contribs) (restore noteid to source)
Jump to: navigation, search

Proposed changes for enhancing GRAMPS by enhancing the mechanism for storing data from ‘large’ sources

User story (Problem that needs to be solved)

I have a book that details, on page 7:

“In the 1870s B moved to the town of BT. It was here that I's father K was born in 1860. By the time he was 30 he had married. His first child M was born there. Shortly afterwards his wife died and two years later he married G. M was 12 before her brother I appeared.”

So I wish to record B, K, and the fact that K was born in 1860, and married around 1890. K's children were M and I, M was born around 1890 and I was born around 1900. [Actually, from other sources he was born on 5 Dec 1902.] I need to record page 7 of this book as the source for all these pieces of information.

Some time later I decide I should record a transcript of the source text.

Some time later, I decide to scan that page of the book, and need to store the scan as the source.

Later still, I discover that page 212 of the same book details that I married W in 1946.

Now I wish to record W, and the marriage of W and I in 1946. The source for all this is page 212 of the book, and this time I record the scan against the source.

List various solutions

  • Record each page as a source reference with the book as a single source.
  • Record each page as a separate source (i.e. page 7 as one source and page 212 as a second source).
  • Modify Gramps to introduce Source Content that can be shared and can have media attachments and record each page as a Source Content.

Record each page as a source reference

The page number and the text from the page are stored in the Source Reference, while the Title and author of the book are stored in the shared Source.

This may be considered the natural approach, given the fact that the Source Reference headings include the page number, while the shared Source includes the Title and the Author etc.

This approach may be illustrated as follows (only some of the people and facts are shown here):

Separate source ref.gif

The problems with this approach are

  • The Source Reference does not allow the Media scan to be stored.
  • The Source Reference is not shared, there is a separate instance for each place where it occurs (e.g. each event).

The media scan can be stored with the source, as shown in the figure above. However, when one looks at the source, it will not be clear which media object in the gallery relate to which source reference, except by some naming convention (the applicable media file may be obvious in this case, but not in others). Also, in the source reference editor, one cannot immediately see which scan relates to this particular page.

The fact that the source reference is not shared (and is not a separate object) means that any updates to the source reference will have to be done for each occurrence, rather than once. For example, when I want to add the transcript of the paragraph to the source reference, I need to find each source reference and update it individually. It is also difficult to find all the source references, because they are not listed in a separate tab.

Note that there is an argument that separate source references for the different events is preferable, because the exact text that relates to that particular event can be attached. For example, for the birth event for person K, one could attach: “…the town of BT. It was here that I's father K was born in 1860…”. There are two objections to this:

  • It is difficult to identify exactly which parts of the text are relevant to each event. Should I’s father be included in the source for K’s birth?
  • It is far too tedious and laborious to devise separate source texts for each event. Given that the original paragraph giving family history information (this is a genuine example) is quite short, it is much quicker and easier to include the whole paragraph in each reference.

When it comes to adding the scan, the only option I really have is to add it to the source itself, despite the fact that the scan only relates to one page.

An alternative way to record the media is that suggested in the tutorial “Recording UK Census data”. Here the media object has a source reference which contains the details of the Date; Volume/page and Confidence that the media is associated with. This has the advantage that the details are unambiguously associated with the media. However, the relevant media cannot be found simply from the events.

Separate source ref media.jpg

Record each page as a source

The page number and the text from the page, the book Title and the author of the book are all stored in the Source. The Source Reference does not hold any particular information.

This arrangement can be illustrated as here (again only some of the people and facts are shown here):

Separate source.gif

This arrangement has the advantage that the transcript and media scan can be directly attached to the relevant page.

However it has a number of problems:

  • the Volume/Page field in the source reference is not used for its standard purpose
  • the information about the book itself is duplicated many times in each source object
  • the approach rapidly becomes unmanageable if there are a large number of pages to be recorded for a single source (i.e. if the source is 'large')

Modify Gramps to introduce a Source Content object that can be shared and can have media attachments and record each page as a Source Content

A new object which we call here a Source Content is introduced.

A SourceContentRef would replace the current SourceRef. The SourceContentRef would not have any associated fields (just the reference to the Source Content).

The Source object is unchanged.

Newproposal.jpg

The existing icon options in the Source tabs would be unchanged, but add existing source would bring up a treeview Source-SourceContent.

Add source citation.jpg

The treeview would be as follows:

Select source.jpg

The user would be able to select a Source as shown by the highlighting. In this case, the subsequent SourceContent dialogue would not be populated in the SourceContent area, only in the Source area. On the other hand, if a SourceContent were to be selected, then the subsequent SourceContent dialogue would be opened already populated with the shared SourceContent data.

The source reference editor would be changed to a Source-SourceContent editor as follows:

Source citation changes.jpg

Note that there is no need for a SourceContentRef editor, because this does not have any associated properties.

Recording the information about the people would proceed exactly as at present. The first source to be recorded would display an empty Source Content editor, which would be completed in the normal way. Subsequent data would be linked to the same Source Content object.

When it comes to record the transcript of the source text, there would only be a single Source Content object to which the transcript would be added as a Note.

The scan of the page could be added as a Gallery item in the Source Content object.

A new display category would be needed for the Source Content, and it would be important to implement the merge facility for this category so that existing separate Source Contents could be merged.

There is no particular need to provide a means to make an existing Source Content object refer to a different Source object. At present, it is not possible to change a source reference so as to preserve the information that has been input (such as date, Volume/Page, confidence or the links to notes) but to make it point to a different source. The source reference must be deleted and a new one created. Similarly, if one wanted to change a Source Content to refer to a different source, the Source Content would have to be deleted.

Discussion

Either of the two ways of using the current Gramps features has difficulties. This is shown not only by the points made in the descriptions above, but also by the fact that there have been many discussions in the lists about how to use the features.

The new approach make it simple to implement the given user story. The story focusses on recording information from a book, because that is a scenario which everyone can understand and relate to. However, it also applies equally to other scenarios, like recording census data.

The new approach is a very simple extension to the current features from the user's point of view. Gramps can be used exactly as at present with no additional inputs required from the user (the workflow is completely unchanged). If a particular Source Content is to be shared, then again there is no need for any additional inputs, just use of the feature to add an existing Source Content, instead of creating a new one. Apart from the Source Content category display there are no additional screens.

The new approach is somewhat similar to the subsource approach in http://gramps.1791082.n4.nabble.com/sources-subsources-and-sourceref-td1794804.html. Subsource corresponds with SourceContent. However, in order to avoid any extra complexity in the approach, it is proposed that the notes are kept to the 'subsource'/source Content only, and not to the reference to them.

Using subsources can remove the above problems, as a subsource would be a nucleus information set: one line in a census, one marriage act, ...., and that connected to a source.

Why not use the sourceref?

The sourceref cannot be used for this! The text of an entire source, or an image of an entire source is something you need to share between objects, so it must be a source, not a sourceref. The note in the sourceref should only be used to explain how the information of the source was used for the object.

However, it is true that a subsource can have a date/page/volume/number within the larger source. Note that in the census example this is repeated on *all* sourceref objects. In the case of a subsource, it would be entered once in the subsource, and then not be repeated in the sourceref. I don't care to much for this as in the case where subsource is usefull, there is no problem of mentioning this in the note, .... section.

Is sourceref useless then?

I don't think this. For sources of which little information is learnt, or real books (bibliography, diary, ...) the sourceref is ideal. However, for sources which contain many unrelated short subparts that each can lead to large amount of changes throughout your genealogical database, the sourceref is less usefull, and a subsource is in order.

Note that adding a hierarchy of subsource objects would introduce much greater complexity, which is not warranted by the problems outlined.

Also, http://gramps.1791082.n4.nabble.com/Source-references-names-and-notes-td1813805.html#a1813818

The problem is still there when you discover a mistake in the source ref (page number), or want to add a note to these source references (eg transcript of the original latin text). Then you need to track down all source ref that where copies, and change them all. This is why in my workflow I keep most notes in the source object, where the media files (scans) also live. This is why I think about a subsource implementation, it does not hurt the people who want to keep working as they do now. The main thing holding such a thing back is GEDCOM though. The more we deviate of that in GRAMPS, the more difficult to map our internal data to something that can eg be uploaded to websites logically as you want it. It is really hard to keep having to drag an old and dead standard with us :-(

The proposed approach is directly compatible with GEDCOM; variants like hierarchical sources would not be.

Design

Database changes

A Global Confidence field is added to the Source object:

Source
  1 Title
  1 Author
  1 Gramps ID
  1 Abbreviation
  1 Pulication Information
  1 Global Confidence
  n Publication Data (key value pairs, eg Publication Date, Publisher, ...)
  n NoteIds
  n MediaRef (Region, Src, attr, notes)  --> Media
  n RepoRef (Type, Callnumber)           --> Repository

A new SourceContent object has the following content:

SourceContent
  1 SourceRef  --> Source
  1 Confidence (5 values)
  1 Volume/Page
  1 Log Date (The date that this data was entered into the original source document)
  n Information (key-value pairs, current Data)
  n NoteIds
  n MediaRef (Region, Src, attr, notes)  --> Media

Should this be a Primary object or a Table object?

A new SourceContentRef object replaces existing SourceRef objects. Benny suggested that this object should hold the argument by which the validity of the SourceContent is deduced. The object has the following content:

SourceContentRef
  1 Type: Transcript or Deduction
  1 Deduction Confidence (5 values)
  1 Argument (one line string)
  n Note

Should this be a separate change?

Upgrade

When upgrading from an old database version all objects that have a SourceRef need to be changed to the new format. The primary objects Person, Family, Event, Media and Place contain SourceRef objects. These also contain secondary objects which have SourceRef objects.

Person
 Name
 Address
 Attribute
 PersonRef
 MediaRef
  Attribute
 LdsOrd
Family
 Attribute
 ChildRef
 MediaRef
  Attribute
 LdsOrd
Event
 Attribute
 MediaRef
  Attribute
MediaObject
 Attribute
Place
 MediaRef
  Attribute

Each old SourceRef object should be used to create a new SourceContent record. The old SourceRef will be replaced by a new SourceContentRef.

Should all old SourceRefs with the same Volume/Page create a single new SourceContent record or should Date and Confidence also be the same? Should Notes on old SourceRefs be merged into the new SourceContent record?

Import/Export

The following formats will need to be updated: Gramps XML, GEDCOM, CSV, GeneWeb.

Gramps XML

This will need a new <sourcecontents> section with <sourcecontent> entries.

Gramps should be able to import both old and new versions of the Gramps XML. If a <sourceref> tag appears outside of a <sourcecontent> entry then it will indicate an old version.

GEDCOM

User Interface changes

Source Selector

This should be changed to include a Source-SourceContent hierarchy.

Source View

This should be changed to include a Source-SourceContent hierarchy.

What should happen if the "Add" button is clicked? (Just add a Source, or a Source and SourceContent?) If the "Edit" button is clicked, and a Source is highlighted, the editor should allow only the Source to be changed. If the "Edit" button is clicked, and a SourceContent is highlighted, the editor should allow both the Source and SourceContent to be changed. If the "Remove" button is clicked, then the highlighted object and all objects that reference it should be removed.

Editor Source tabs

Should these be hierarchical or flat?

If the "Add" button is clicked, what should happen? Should the user be able to create a new Source and SourceContent? Should the user select a Source from a list and then enter a new SourceContent? Do we need an extra button ("Add new SourceContent" and "Add new Source and SourceContent")? If the "Share" button is clicked, then the user should be allowed to select a SourceContent using the Source selector. Should they be allowed to select a Source only? If the "Remove" button is clicked, then the highlighted SourceContent should be removed.

Rules

Do we need extra rules to match a SourceContent?

Reports

Reports access Source References through the Bibliography and Endnotes functionality. This allows the changes to be made in a single place.

Some changes will be needed in the Narrative Web and Simple Database Access functionality.