Difference between revisions of "Place completion tool"
m (→Part 3. Overview of the results: spelling)
m (→Download: warning)
|Line 227:||Line 227:|
* global install: <code>place_to_gramps_install/src/plugins</code>
* global install: <code>place_to_gramps_install/src/plugins</code>
DO NOT BETA TEST WITH YOUR RESEARCH DATA. EXPORT DATA FIRST TO HAVE A BACKUP, THEN RUN THE TOOL
== Patchs ==
== Patchs ==
Revision as of 06:30, 21 December 2008
A tool to bring the places in your GRAMPS database in accordance with the GRAMPS requirements: batch add country, county; look-up latitude-longitude; set description (title); ...
There is a version for 3.0.1, as well as an old version that works with GRAMPS version 2.2.5+ ! Download beta.
- 1 Place Completion tool
- 2 Design Specification
- 3 Manual
- 4 Example
- 5 Advanced Usage
- 6 Troubleshooting
- 7 Download
- 8 Patchs
Place Completion tool
This tool helps you fill in the place attributes like county, country, ..., by allowing you to select the places you work on, and do changes on all these places with one button click.
The general aims are:
- Place/Location is a newer concept in GRAMPS. Many older databases only have a Place title field which is a descriptive text containing city, state, country. This should be parsed to insert the values in the correct attribute fields.
- Latitude and longitude are important to show data on a map. However, doing a look-up of this data on the internet is slow and time consuming. The tool allows to search in the free resources on the net.
- Setting of an attribute of a set of places in one go. Eg you give a
- Conversion of latitude and longitude to a fixed data format. On import one might obtain latitude and longitude in several different formats. A conversion tool to store them all in the same format is usefull.
- Construction of a uniform title/description field, from the data in the place object
The place completion tool gives a lot of functionality. This manual should help you to understand how it works.
The place completion tool can look up for you latitude/longitude, add county information (USA), ... . For some of this functionality, you must download datafiles of the countries you are interested in. Right now you have three options:
- Download geonames country files. You can do this here freely. Geonames parses fastest, so is the advised format to use
- Download geonames USA state files. You can do this here freely. This is advised for USA searches, as the data in the USA country file contains many doubles, which can be avoided by searching state per state. State info also contains county information.
- download GNS Geonet country files (not available for usa). You can do this here freely with ftp.
Watch out, some of these downloads are VERY large, especially USA data. Only download what you need!
Note: The geonames data of popular places is in English, so eg municipalities in Italy will be found, but Roma not, as this is Rome in English. To find data with these you need to search in the localised variants of the name (see below)
DO NOT BETA TEST WITH YOUR RESEARCH DATA. EXPORT DATA FIRST TO HAVE A BACKUP.
Starting the tool
You will find the plugin under 'Tools > Utilities > Interactive place completion'
The dialog explained
The Dialog consists of 4 parts:
Part 1: selection of places
First you need to choose with which places you want to work. You can use several methods to define your places:
- Use a place filter. You can use two preset filters: All places, which returns all places, and No Latitude/Longitude given, which returns all places of which the latitude or the longitude is not set. You can also created a custom place filter in the place view, test it with the filter sidebar, and then use it in this tool. All custom filters you made will be available
- To prevent the need to make a filter for every city, ... in your data, you can set country,state,county,city or parish of the places you want to search on. This works just like in the filter sideview in the places view.
- Use a latitude, longitude rectangle. Eg, suppose you have the latitude and longitude of all places in the UK, and now want to add in the state attribute Wales, for all places in Wales. You can look on a map, note down the centre of Wales in latitude and longitude, as well as roughly the width and height of this rectangle. This will allow you to obtain all places in Wales (and some in England), allowing to much faster set the state information.
Part 2. Completion of Places
- The first possibility is to look up in a datafile the latitude and longitude of your places. For this you must have downloaded the necessary resources, see section above. You can select with a file dialog the file you want to search, and set how this data must be parsed. The following parsing options are available:
- GeoNames country file, city search: use the city attribute to look for lat/lon in a GeoNames country file. This is the fastest search.
- GeoNames country file, city localized variants search: use the city attribute to look for lat/lon in a GeoNames country file using the localised (non-English) known names in the GeoNames file. Eg, Roma will be found with this option (as Roma is the Italian local variant of the English name Rome)
- GeoNames country file, title begin, general search: Use the start of the title field to search in a GeoNames file. With start it is meant everything before a comma:, . This allows to find landmarks, squares, ... . Eg, if the title of your place is: Piazza Navona, Rome, using this search will find you the latitude and longitude of this famous square in Rome.
- GeoNames USA state file, city search: Looking for places in the USA file is almost worthless: it takes a long time and every name exists several times. Hence, it is worthwhile to use state by state. If a USA state file is selected for doing a search, you must select this option. The city attribute is used for the search.
- GNS Geonet country file, city search: use the city attribute to search in a GNS file (slower than GeoNames search!).
- GNS Geonet country file, title begin search: use the start of the title of a place to search in a GNS file. With start everything appearing before the first comma is meant.
- A second option is to parse some existing data in your places.
- You can parse the title attribute to extract information from it. Eg a title like Albany, NY can be used to set the city attribute to Albany and the state attribute to NY.
- You can set the title of all the selected places to a uniform way. This is interesting if due to imports you have different styles for the title field, which can be annoying in reports. At the moment there are two options:
- Set title field to City[, State]: This means the title of your places will contain the city, and if the state field is present, the state will be appended with a comma.
- Set title field to Titlestart[, City][, State]: This means the present start of your title will be kept. If this start is not the city, then the city will be appended. If state is present, also state will be appended. An example: suppose your title is Piazza Navona, Italy, the city is Rome and the State is Lazio. Using this option to set the title would change the title attribute into Piazza Navona, Rome, Lazio.
- Convert latitude and longitude to a uniform way. Again due to import, copy/paste, you might have latitude and longitude entered in different formats. This is annoying on reports. This options allows you to set for all selected places the lat/lon to one form. The options are:
- All in degree notation: use the classical degree notation with degree, minutes and seconds.
- All in decimal notation: use the decimal system to denote lat/lon.
- Correct -50° in 50°S: a much seen error is to use - for the classical degree notation, which is wrong, and which GRAMPS will not be able to interpret. With this option this error is looked for and corrected.
- A third option is to set attributes of all selected places. You can set the country, state, county, parish, zip/postal code and city attributes of all places in one sweep.
Part 3. Overview of the results
After having entered all data in Part 1 and 2, you click find for GRAMPS to search all changes that will occur. This part of the dialog shows all changes that will occur.
All selected places are shown. If changes will be done all changes are listed as subentries of the place. Every change will be a subentry.
If the change will overwrite an existing entry, the subentry is shown in orange.
TO AVOID PROBLEMS, GO OVER ALL CHANGES QUICKLY, AND CHECK ALL ENTRIES IN ORANGE!
The following actions are possible in the result screen:
- press delete to delete the entry, making sure that this change will not occur. You can delete the entry to delete all changes, or select one subentry, to only delete that specific chagne
- double-click on an entry to open the place dialog. If you double-click on the entry, all changes will be preentered. If you double-click on a subentry, only this specific change will be preentered in the place dialog.
- press tab to open in a browser window google maps. Pressing tab on a subentry showing a new lat/lon entry will open google maps on this new lat/lon position. Pressing tab on the top place entry will give open google maps with the old lat/lon position, or if that is not known the title/city field is used for the search.
Part 4. Actions
After you have checked the changes in Part 3, you can now apply them all with one button click, by clicking the Apply button.
Clicking Help will bring you to this page, clicking Close will close the window and clicking Google Maps when an entry is selected in the results field has the same effect as pressing tab on an entry (see above).
Open the example file from the examples where latitude and longitude are empty: example.gramps.
We will now show how the places in this file can be completed. The best thing to do is open a new family tree (.grdb), give it a name, and import the example.gramps file. This file has 852 places, which would mean a lot of manual edits if you do not use this tool!
Now, open the place view. You will see all places are of the form:
- Aberdeen, WA
This value is the
Place Name attribute (the title or description of the place).
Step 1: City and State data
Our first step will be to split this field into a
City value (here Aberdeen), and a
State value (here WA).
We open the Place completion tool: Here we have selected All Places, and we parse the title as City [,|.] State. Click on Find, quickly scan the data if all looks ok, and then click on Apply. You are notified that 851 places are updated. This is one less that the number of places. Indeed, one place does have a different type of title: Puerto Rico has no state information.
Step 2: Look-up latitude and longitude
We have downloaded the GeoNames datafiles for the USA states, and will now use that to complete the latitude and longitude of the data. At the same time, this will fill up the county field.
In the above screenshot, you see we have selected All Places with State=AK. In the second part of the window we give that we want to search in the AK_DECI.txt file downloaded from GeoNames, using the parsing method: GeoNames USA state file, city search.
Note that if you want to change AK into Alaska, this would be possble. Just set state=Alaska in the set attributes section of the window.
Do this now for all the states. Always check for doubles. Eg, for state AL, going over the changes, we encounter:
We see that the first time 'Enterprise' if found, it is in county Coffee in lat/lon:31.31/-85.85. The second hit is for county Chilton with lat/lon:32.73/-86.62.
You can now use the Google Maps button (or press TAB key) while the lat/lon subentry is selected to see where this city is in both cases. From this it will be clear for example that one is a hamlet, not really a city, while the first is a real city. So now, select the second lat/lon entry, and delete it by pressing the DEL key. Do the same for the second county entry.
In case google maps did not allow you to determine which is the correct city, you can double click on the city to open the Place Dialog (Warning: this will preenter the data of the Place Completion tool. So hit cancel here if you want to exit without these changes done). In this dialog the references tab allows you to navigate to all events coupled to this place. This will give you extra information you might use to decide which of the two found places is the correct place.
Step 3: Problem entries
While updating all places in step 2, you will have noticed some errors in the state information: Some places have a dubious state: eg OH-AL
You can obtain these states by choosing All Places en setting the state search box to -. Clicking Find will give you all these problem places. You can use google maps or the place dialog to sort them out. You can also use the USA country GeoNames file to search these places in the entire USA. You will need sufficient memory for this, or you will obtain a MemoryError (see below)!
Step 4: Lat/Lon not found
After the above, still some 45 places have no latitude/longitude found. You can now select these places by setting the Place filter to 'No Latitude/Longitude', which will find you all places with no coordinates.
It will be clear that many of those can be quickly corrected: abbreviations, eg the city field contains St.George, which should be Saint George; double names, eg Waterloo-Cedar Falls, IA means Waterloo near Cedar Falls, changing the city to Waterloo and redoing the search using Google Maps will allow to quickly find which coordinates for Waterloo are needed.
This is for advanced users only knowing regular expressions.
The parsing fields have entry fields allowing you to give your own parsing. Parsing uses regular expressions. You can use this to parse your title, and to parse a lat/lon file in your own way. For reference, here an overview of the parsing codes used for the predefined parses:
Parse title details
The following regex expressions are used, where for brevity we use some variables defined lower.
Note: For those new to Python and Regex please review the HOWTO here:
- "City [,|.] State" is parsed by : r'\s*(?P<'+city_translated +r'>.+?)\s*[.,]\s*(?P<'+state_translated +r'>.+?)\s*$'
- "City [,|.] Country" is parsed by : r'\s*(?P<'+city_translated +r'>.+?)\s*[.,]\s*(?P<'+country_translated +r'>.+?)\s*$'
- "City (Country)" is parsed by : r'\s*(?P<'+city_translated +r'>.*?)\s*\(\s*(?P<'+country_translated +r'>[^\)]+)\s*\)\s*$'
- "City" is parsed by : r'\s*(?P<'+city_translated +r'>.*?)\s*$'
Here the variables used are: lat_translated = _('lat') lon_translated = _('lon') city_translated = _('city') county_translated = _('county') state_translated = _('state') country_translated = _('country')
You can use one of these variables as a group, and the tool will recognise them, and use as values for the corresponding place attributes.
Lat/Lon lookup parsing
For the regex of lat/lon lookup, you need to indicate which data must be replaced with existing place attributes for the search, as well as indicate which regex groups must be extracted.
- "GeoNames country file, city search" is parsed with: r'\t'+CITY_transl +r'\t[^\t]*\t[^\t]*\t' +latgr + r'[\d+-][^\t]*)\t' + longr + r'[\d+-][^\t]*)\tP'
- "GeoNames country file, city localized variants search" is parsed with: r'[\t,]'+CITY_transl+r'[,\t][^\t\d]*\t?' +latgr + r'[\d+-][^\t]*)\t' + longr + r'[\d+-][^\t]*)\tP'
- "GeoNames country file, title begin, general search" is parsed with: r'\t'+TITLEBEGIN_transl +r'\t[^\t]*\t[^\t]*\t' +latgr + r'[\d+-][^\t]*)\t' + longr + r'[\d+-][^\t]*)\t[PSTV]'
- "GeoNames USA state file, city search" is parsed with: r'\t'+CITY_transl+r'\tPopulated Place\t[^\t]*\t[^\t]*\t' + countygr + r'[^\t]*)' + r'\t[^\t]*\t[^\t]*\t[^\t]*\t' +latgr + r'[\d+-][^\t]*)\t' + longr + r'[\d+-][^\t]*)'
- "GNS Geonet country file, city search" is parsed with: r'\t'+latgr+r'[\d+-][^\t]*)\t'+longr+r'[\d+-][^\t]*)' + r'\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\tP\t[^\t]*\t[^\t]*' + r'\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*' r'\t[^\t]*\t[^\t]*\t[^\t]*' + r'\t'+CITY_transl+r'\t[^\t]*\t[^\t\n]+$'
- "GNS Geonet country file, title begin search" is parsed with: r'\t'+latgr+r'[\d+-][^\t]*)\t'+longr+r'[\d+-][^\t]*)'+ r'\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[PLSTV]\t[^\t]*\t[^\t]*'+ r'\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*' + r'\t[^\t]*\t[^\t]*\t[^\t]*' + r'\t'+TITLEBEGIN_transl+r'\t[^\t]*\t[^\t\n]+$'
- Read of mediawiki CSV dump. This reads the files on  (for more information, see http://meta.wikimedia.org/wiki/WikiProjects_Geographical_coordinates) (Contribution by nomeata)
For extraction of data you can use the same groupnames as in title parsing, so eg latgr in above should read: r'(?P<'+lat_translated +r'>' .
The syntax for the values that need to be used for searching in the file, eg CITY_transl, is given by : _('CITY'). You can use as substitution values: _('CITY'), _('TITLE'), _('TITLEBEGIN'), _('STATE'), _('PARISH').
The tool will read in the given regex, replace the substitution strings by the values in the place object, do the search, and extract the regex groups given from the result.
Non UTF-8 latitude/longitude file
The place completion tool expects the input files for location lookup to be in unicode (utf-8). On the occasion this is not the case, you will get the error:
File "/home/benny/programms/gramps/gramps2/src/plugins/PlaceCompletion.py", line 851, in load_latlon_file self.latlonfile_datastr = infile.read() File "/usr/lib/python2.4/codecs.py", line 481, in read return self.reader.read(size) File "/usr/lib/python2.4/codecs.py", line 293, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1610092-1610094: invalid data
Note that the Place Completion tool catches these errors and shows you an information box. After this, the tool will attempt to read the file with utf-8 (unicode), ignoring errors. This might give good results, but will of course fail to produce results on non-unicode encoded files.
In the above example it is clear the problem is in two bytes, so you can correct this manually: open the file with eg
KHexEdit Binary Editor, go to the specified position (offset 1610092), and change the two bytes with a space.
In the case the file is completely non-unicode, you will have to convert it to unicode with a tool, before using it in the placecompletion tool.
The tool might fail with the error:
self.latlonfile_datastr = infile.read() File "/usr/lib/python2.4/codecs.py", line 481, in read return self.reader.read(size) File "/usr/lib/python2.4/codecs.py", line 293, in read newchars, decodedbytes = self.decode(data, self.errors) MemoryError
The tool has to load the datafile for latitude/longitude searching into memory. For large files like USA.txt, this might be impossible if you have limited memory. You can try to close as many programs running at together with GRAMPS, and try the tool again.
You can download the beta version. For Gramps 3.0.x: You find it at placecompletion_1_1.tar.gz. For Gramps 2.2.5+: You find it at placecompletion_1_0.tar.gz. Extract the two files in this archive. Put the .glade and .py file both in the plugins directory, linux:
- local install: place in
- global install:
DO NOT BETA TEST WITH YOUR RESEARCH DATA. EXPORT DATA FIRST TO HAVE A BACKUP, THEN RUN THE TOOL
Parsing place title
For France, some practical rules could be useful for seizing place. We need :
- the city name + INSEE code (at option). This code is unique and can identify with certainty a common (with the county, district, township and municipality). It can identify with a common insurance even if it has changed its name. This code is used in Archives. Using postal code is not advisable ...
- a subdivision: identifies a parish or a place called within a municipality
- the state (at option) or county but is already on INSEE code
- the country (at option). Ideally it should still take the country. It is understandable that this is tedious. Maybe do not enter the country if the genealogy is mostly of one country and seize enter the country for events outside the country's main. Everyone will appreciate.
If you have a place name like : City, code, state, country, you could try this patch on current Place completion tool for parsing place title on location fields :
1080a1081 + code_translated = _('zip') 1091a1093 + codegr = r'(?P<'+code_translated +r'>' 1100a1103,1106 + ("citycodestatecountry", "City[,|.] Code[,|.] State[,|.] Country", + _("City[,|.] Code[,|.] State[,|.] Country") + , r'\s*'+citygr+r'.*?)\s*[.,]\s*'+codegr+r'.*?)\s*[.,]\s*' + +stategr+r'.*?)\s*[.,]\s*'+countrygr+r'.*?)\s*$'),