GEPS 018: Evidence style sources

From Gramps
Jump to: navigation, search

Background

Many users, particularly if they aren't experienced researchers, may have difficulty abstracting the details from the wide variety of source types they encounter into the 4 fields that Gramps provides. Worse, the 4 fields aren't really adequate to capture all of the possible source information and redisplay it in well-formatted footnotes or endnotes in a report or reference links in a web page.

Elizabeth Shown Mills is an eminent American genealogist who has written extensively about collecting, analyzing, and citing evidence in genealogical research and publications, including the books Evidence! Citation and Analysis for the Family Historian and an expanded version, Evidence Explained: Citing Family History Sources from Artifacts to Cyberspace. While most readers focus on the formats of the citations provided in the books, in reality every publisher has a style guide and Evidence Explained isn't used by any of them. The real value in these books is Mills's explanation of how to effectively analyze the evidence and how to integrate the many pieces of evidence (and Mills is well known for taking the "reasonably exhaustive search" requirement of the BCG's Genealogical Proof Standard to the absolute limit) into a well supported conclusion.

Citation styles are the concern of published material, and will differ both for the medium and for the publisher. So long as the necessary information of creator, title, enclosing work (for e.g. magazine or jouranl articles), publisher (if published) or repository (if not), date, and details (like page number) are available in the citation, the style isn't very important to the reader. Publishers want all of their publications to have a consistent style and issue style manuals to help authors prepare their work.

For a computer program like GRAMPS, the goals should be to collect all of the necessary information noted above in a way that is easy for users to enter, to support evidence analysis and comparison to create "proof arguments", and to link those proof arguments to the genealogical conclusions in the database.

GRAMPS's present data structure maps directly to the SOUR and SOUR_CITATION structures in the GEDCOM5.5 standard, and the source entry form maps directly to the data structure. While it's possible to cram everything needed for a good citation into those three fields, parsing the information back out to actually create a citation is unnecessarily challenging.

Bibliography Data Formats

  • BibTeX has emerged as a common format (for interchange at least) among bibliography and reference management tools and offers a much richer set of available fields.
  • The U.S. Library of Congress has published the Metadata Object Description Schema, an XML schema for encoding library catalog data. That wouldn't be very interesting except that BibUtils uses it as an intermediate format for converting between a variety of bibliography file standards.
  • Zotero, Mendeley, and Papers use Citation Style Language (CSL), an XML schema, at least as an import/export medium. (Zotero uses a relational database for its actual storage.)
  • Thompson-Reuters EndNote is easily the most popular commercial reference management program. It uses a proprietary file format which has nevertheless been reverse-engineered many times so that bibliographies can be easily exchanged between EndNotes and other reference managers.
  • Most of the major commercial genealogy programs use a proprietary relational schema for storage of citation data. These fall into two broad categories, binary (similar to GRAMPS's key/value schema, where a citation is composed of several records each having a key/value pair and the program's logic parses the keys to display the citation in the desired format), single-table (where a database tuple is defined which contains the maximum needed fields, each of which is assigned a value according to a parsing scheme in the programs logic), and multiple-table, where different citation types are stored in tables with tuple schema which reflect the requirements of each. As so often in programming, each has costs and benefits with respect to

Further Reading

Elizabeth Shown Mills' has a website that includes sample text pages from Evidence Explained, sample QuickCheck Models and a forum that includes exhaustively extensive discussions of citation issues.

John Yates has, with Mills's permission, encoded the elements of the specific examples in Evidence Explained: Two Computer Ready Parametrizations of "Evidence Style" Historical Sources.

A simpler template system is "Simple Citations", see the templates.

See also :

Citations and bibliography search engines:

As an example of discussions of use of Evidence Explained we may consider citation for the UK Census. Anatomy of a Census Page provides quite a good general illustration of the use of Class, Piece, Folio and Page (though it shows only the 1891 census). UK Census Citations shows how this scheme applies to other censuses.

In a rootsweb mailing list ESN proposes the following reference:

1861 census of England, Middlesex, Shoreditch, Haggerstone East,
Haggerstone St. Mary, folio 5, page 4, household 16, William Loe;
PRO HO 9/249, The National Archives, Kew, Surrey, UK;
via Family History microfilm 542,590;
imaged in the database "1861 England Census," _Ancestry.com_
(www.ancestry.com : 2 August 2009).

There is a subsequent extensive discussion that may (or may not?) suggest changing (part of) the reference to

Office for National Statistics, London; Census - England & Wales - 1861;
RG09/3753 Folio 12 Page 17, Schedule No. 86; Durham, Newbottle, District 6;
The National Archives, Kew, Richmond, Surrey, TW9 4DU.

In a separate discussion in the Evidence Explained forum, 'Ann' asks:

EE  page 303 shows an example for the 1841 census.

This is my start for my  citation;

1841 census of England, Warwickshire, [city],  [parish], folio 6, lines 5-15,
William Vero household;
digital image  Ancestry.com, (http://www.ancestry.com  :  accessed 24 September 2004);
citing PRO HO  107/1127/10

'AdrianB' proposes:

1841 census of England,
Warwickshire, Atherston, William Vero household;
digital image Ancestry.com, (http://www.ancestry.com : accessed 24 September 2004);
citing The National Archives of the UK, HO 107 piece 1127 book 10, folio 6, p. 4, lines 5-15

(because the folio and page are part of the TNA reference) which ESM calls workable but queries whether there is enough geographical information, while 'Ann' comes up with the following reference:

1841 Census of England,
Warwickshire, Atherston Township, Hemlingford Hundred, Mancetter Parish, Enumeration District (ED) 1,
folio 6, p. 4, lines 5-15, Long Street, William Vero household;
digital image, Ancestry.com, (http://www.ancestry.com : accessed 24 September 2004);
citing PRO HO 107/1127/10.

(line breaks inserted for easy comparison).

Needed

We need for this

  1. fix of bug 2332 [done]
  2. convert [1] to format usable in Gramps, so sourcetypes, source attribute types, ..., and business logic Evidence (no templates needed, all business logic) [partially done]
  3. Adapt GUI to allow Evidence style sources input. Is a database change needed? Don't think so at the moment.

Storing the data

  1. Data is stored as SrcAttribute (key,value) pairs in Source and Citation.
  2. To decide:
    1. In Source, do we keep "Author" and "Pub Info" ? These can be stored also in Source Attributes, and be extracted from them to show in an overview. There is already a type AUTHOR. As Pub Info goes to GEDCOM, this could be type GEDCOM_PUB_INFO. If present it is used, otherwise it is generated
    2. In Citation, do we keep "Date" and "Volume/Page". Like for Source, all can be in the Citation Attributes. We can store which attributes typically are Dates, and allow a Date Editor input. Storage would be plain text though. Is this a problem somewhere in the code ??


If we decide "yes" to above, then source and citation objects must be changed, upgrade must be done.

Also:

  1. In Source, what to do with Title. This becomes like Description of Event, somewhat redundant. A field though in Gedcom. Ideally, if given used, otherwise generated from the source attributes.
  2. Abbreviation is for the storage in your _local_ archive, so as to allow easy retrieval. We need to make this clearer in the user interface.

GUI ideas

  • Don't use a wizard (Nick)
  • Benny:
  1. Instead of the tab 'General' for source and citation, we show the tab 'Overview', which would have only few fields editable that make sense, and then show concise the important things.
  2. For a new citation/source, user starts on a new 'Definition' (??) tab. Here he can give source type. Setting a source type, generates the fields needed as per the template definition. Note that some people have asked already for some other editors such a setup, with overview on not new objects with a nicer layout.
  3. I would like to enable some copy paste function though on the Definition tab. So, I would like to offer some mechanism to quickly copy paste or select existing parts of title/pub info (for users fixing imported gedcom or old gramps sources), and to import a bibtex and select fields from that. Perhaps a bottom part with buttons, or drag and drop to a top part with the actual fields? Need to try some GUI ideas for how to do this.
  4. In the definition tab, if entry is in a table, column for author, pubinfo can be added with checkbox to indicate what to use. Idea here is that we don't need our old Title, Author and Pub Info, but we do need to make clear to users what would be exported to Gedcom, as that is important. In Overview this could shown in a Gedcom section.
  • Nick:
  1. have a "Preview" tab to show a preview of the Gedcom output as well as the F, S and L format
  2. can we not store dates as value that is date object for SrcAttribute?

In branch - testing of ideas

An Overview window. Gep18 02.png

A Cited In tab shows all citations and where they are used: Gep18 03.png

From that tab, different citations can be loaded. The top part then becomes the citation editor. Not finished yet in following screenshot, citation template attributes still to be added. Gep18 04.png

A Template tab allows selection of templates, and generates fields needed. These fields are stored to the attributes as the user types. Following screenshot is not yet finished version. Default citation fields will be added, as well as short versions as needed. Gep18 05.png

Working in the GEP 18 branch. To experiment with.

Ideas:

  • a Cited in tab, showing in treeview objects, secondaryobjects that use the source.

Old data of GEP

Entering source information

There are three broad alternatives for entering source data into a program, and GRAMPS should support all three:

  • Form based: The traditional keyboard data entry method. The fields can be fixed or flexible: The former is easier on the developers, the latter more helpful to users, especially if they are inexperienced.
  • Import: GRAMPS should be able to import (and export) source data from regular reference managers like Zotero, Pybliographer, and BibTeX. The Perl code in BibUtils could be adapted for to speed development.
  • Parsing: It's becoming more common for the large reference websites to provide a ready-made citation on the webpage along with the data or image being presented. (Google Books even offers to download it as a BibTeX file, but that's unfortunately not yet common). It would be very helpful if the user could just paste this citation into a block and GRAMPS took care of parsing it into the appropriate database fields. Experienced users might find typing into the parsing text-entry to be a faster way of inputting source data than using form based input.

Further Discussion of Form-based Input

When the end user cites a source for information, they would be prompted with a window where they would select a main type and drill down through subtypes, as in the first few columns of the table presentation I've given. Once it is selected, the user will be prompted for the required (and perhaps optional) fields specific for that type of source reference.

The user would select the type of the source, and fill in the fields, for L (biblio list), F (full citation), and S (short citation) at citation time. The templates I've provided would be in pop up menus for the user to select.

comment: popup is not very user friendly, better would be a wizard button on the source editor, this lets you define the source, asks for fields, and shows the automatic citation markup based on the templates at the bottom while user adds fields. On Save, all this data is saved in the attributes as needed. To investigate if a new field is needed on source editor.
comment: The fields to input for the source are the same regardless of how the citation is formatted for output. It's the output template's job to select and order the fields, provide common abbreviations, and so on.
comment: The formats that Mrs. Mills provides in Evidence Explained are examples, suitable for personal use. But she's published very widely as well as having edited a journal for many years and knows better than most that everyone has their own style. There should be a facility for customizing the output formats to suit the user or whoever is publishing the work.

Generating citation in reports

Then, when generating a report that contains citations, the mark up needs to be done on the fields according to the specifications in the table method or template method I've provided. (e.g. substitute the variables, italicize, embed with the proper punctuation, etc. Remove optional variables (and their punctuation) if the variable was not input. Remove privacy fields unless a privacy flag is turned on so that things like home addresses and phone numbers of people aren't put in reports unless you "force" it.

And the first time a citation is encountered in a report, use the Full version (F). The second and succeeding times use the Short (S) version. And when a bibliography is called for, use the L (List) template for that.

template definition

The templates would be stored in an internal database, as would the completed citations for storage and retrieval.

But, these would only be a (good) starting set. Part of the beauty of this parametrization is that the end user can use the language of the mark up in this table or template to define his own source style, punctuation, field quoted or italicized, etc. So in essence, any source output style can be accommodated, and is under full control of the end user. Evidence Style templates can be supplied as a starting set, not the only set. New Evidence Styles can be added, old ones deleted or modified, as the user wishes.


Proposed changes

User Interface

Instead of a one size fits all editor window, a template-controlled window (or wizard, but wizards get annoying for experienced users) would display a set of fields tailored to the selected source type.

A Multiline widget could be available for displaying the long-form citation as one filled in the fields; pasting or typing into the Multiline would be parsed and would change the field entries. This would speed entry for expert users and could be used by anyone to paste in citations provided by database websites like Ancestry.com.

Import/Export

Continued support of GEDCOM and GrampsXML is of course given. A BibTex, MODS, or Citation Style Language import/export of source data would allow easy interchange with bibliography managers like Zotero or EndNote.

Storage

Several approaches are available for storage of this more elaborate source data:

  • Keep the existing GEDCOM-based arrangement, mapping the elements other than Author, Title, Id, and Abbreviation into a fixed-format string in Pub Info. This has the advantage of maintaining easy GEDCOM export, but packing and parsing the pub-info string carries a computational cost.
  • Keep the existing GEDCOM-based arrangement, leaving the Pub Info string empty and storing everything beyond Author and Title in key-value tuples in a separate table.
  • Change the database to accommodate a fixed number of fields, the data for which are controlled by a template description. Fields would be assigned by priority, with a certain number being common to all (e.g., Title and Creator). This model affords the easiest querying and fastest record assembly (many fewer joins), but wastes database space when not all fields are required for a source type.
  • Change to a pure binary-relational data structure, where each source datum is stored in a named tuple and the structure and mapping of each format is controlled by a template table. This is on the one hand the most flexible and storage-efficient, but on the other greatly complicates querying, as multiple joins are required to construct a record.
  • Some combination of the last two, where the most common fields (perhaps Source-Type, Title, Creator/Agency, Publisher/Periodical, Location/URI, and Date) are stored in a single record and the remaining are stored in a separate key/value table -- perhaps the same one as used for by the Data tab in the Source Editor window, though flagged so as not to appear there.
  • Store a formatted string. This is space-efficient and eminently flexible, but requires formatting the string on storage and parsing it for any manipulation.

The present author favors a combination fixed table with mapped common fields and an auxiliary key/value table for additional elements which are less commonly used. An important consideration is the ability of users to fashion their own source types: Mrs. Mills has provided templates for those types she thinks researchers are most likely to encounter, but could not possibly have provided for every available source-type.

Additionally, some source-types require additional information which doesn't fit well into a database field format. One example is records in private hands: Mrs. Mills recommends including the provenance of the record showing that it indeed indicates the individuals claimed. Such a provenance is best recorded in a source note. She also stresses the importance of a detailed evaluation of the source by the researcher, and this evaluation is also best recorded in a source note.

Proposed Report changes

Reports use the new citation style, using templates to build the citation.

References

  1. - Original Users Mailing list discussion: Evidence Explained Style Sources
  2. - ISO 2709/MARCsupport?