how much duplicate botanical data is available for
play

How much duplicate botanical data is available for digitization - PowerPoint PPT Presentation

How much duplicate botanical data is available for digitization reuse? igo Granzow-de la Cerda 1 & Ben Anhalt 2 TDWG 2011, New Orleans 17 Oct. 1 Autonomous University of Barcelona 2 Biodiversity Institute, University of Kansas


  1. How much duplicate botanical data is available for digitization reuse? Íñigo Granzow-de la Cerda 1 & Ben Anhalt 2 TDWG 2011, New Orleans 17 Oct. 1 Autonomous University of Barcelona 2 Biodiversity Institute, University of Kansas

  2. Addressing a major challange to herbarium digitization: Cost of specimen data acquisition Mainly caused by:  Populating locality information  Georeferencing Addressed by  Minimizing number of fields populated but … How far are we willing to go minimizing data acquisition, and still remain usefull?

  3. Addressing a major challange to herbarium digitization: Cost of data acquisition Alternatively: HARVESTING data that have already been acquired elsewhere

  4. Addressing a major challange to herbarium digitization: Cost of data acquisition  Historically, plant collectors (still do so) have distributed duplicates of specimens among peer institutions (for id/verification by specialists, specimen exchanges, etc.)  Institutions undergo digitization projects, independently from each other  So specimens become databased regardless of whether their duplicates have been already digitized by one or more peer institutions

  5. Addressing the main challange to herbarium digitization  The overall amount of specimen duplication is largely unknown  Search for duplicates has been attempted through filter-push network architecture  We have developd a more rudimentary but more direct way of doing it: SGR ( Scatter, Gather, and Reconciling of Specimen data)

  6. What can metadata do for you?  After decades of databasing efforts, it is likely that a significant volume of specimen duplication has accumulated into metadata , like GBIF .  Redundancy is good , despite perceived inefficiency of repeatedly capuring data from the same specimen  Latecomers into the databasing process have an advantage: many of their specimens (if duplicates) have already been digitized by someone else

  7. What can metadata do for you? GBIF can serve your data acquisition needs in two ways: 1. Data from duplicate specimens already acquired by prior databasing effort(s) can be used as a source for populating part of your own records Specimen per specimen, harvesting data, field for field  This includes fields that are most expensive to acquire  (e.g. locality data and assigning geocoordinates) 2. Help identify which of your specimens are UNIQUE (absent from existing metadata). These are the most valuable specimens in your  collection, the ones to be prioritized for full digitization.

  8. What is the level of duplication out there? SOURCE: 5 botanical databases among the 13 with most records in GBIF (July 2011 release), global in scope, and sufficiently DarwinCore-compliant for fields of interest # recs. Institution/project database In GBIF Bundesamt fuer Naturschutz / Netzwerk Phytodiversitaet Deutschland 3,916,545 BfN/NetPhyD 3,741,903 MO Tropicos 1,217,931 O Oslo Bot. Mus. VXL 1,118,715 ANTHOS GBIF Spain/Fundación Biodiversidad 924,217 NY Herbarium 658,511 S S-Vascular 595,642 NSW Royal Bot. Gdn., Sydney 588,872 NHN Nationaal Herbarium Nederland 538,851 KNA Plant 509,255 MNHN Paris 490,550 O V 424,133 US Botany 359,134 K RBG Kew Herbarium 31,343,738 records in all ca. 650 botanical collections with > 1,000 records in GBIF

  9. What is the level of duplication out there? SOURCE: sample data sets, each consisting of 30-60K botanical records, belonging to 4-5 of 6 selected countries such that • # recs. >2k and <35k for any given country had to be • 3 with highly diverse floras (MX, CR, ZA), • 2 large in size (CDN, AU) and • 1 relatively small, rich flora but less well-collected (GY) • countries to which any of the institutions belonged were excluded Australia Canada Costa South Total Institution Guyana Mexico Rica Africa records S (Naturh.Riksmus. 8,400 4,700 6,100 15,600 34,800 Stockholm) MO (MOBot/ 5,500 15,300 20,600 17,400 58,800 Tropicos) K (Royal Bot. Garden 12,200 2,200 4,500 13,200 25,000 57,100 Kew) NY (NY Bot. Garden) 3,700 3,600 3,400 23,500 34,200 US (Smithsonian 3,100 10,600 3,600 32,700 50,000 inst.) total 32,900 20,000 16,400 32,100 75,500 58,000

  10. What is the level of duplication out there? TARGET: Lucene index of GBIF records The SGR search algorithm was run for each source dataset, targeting the fields: • Collector name • Corrector’s field number • Collection date • Taxon name Generated a Matching Index

  11. What is the level of duplication out there? % duplic- Multiplicity of Institution Total ation duplicates, average S (Naturh.Riksmus. Stockholm) 34,800 10.4% 2.62 12.1% 1.4 MO (MOBot/Tropicos) 58,800 K (Royal Bot. Garden Kew) 57,100 15.4% 1.94 NY (NY Bot. Garden) 34.8% 2.43 34,200

  12. What is the level of duplication out there?  Why is matching not any higher?  Matches are often missed because datasets from some collections don’t show data for key fields, such as  Collector name (US)  Collector’s # (MNHN/P , SANBI, NHN)  Collection date (DUKE, HBG) Some attach coll. # to coll. name (LD) However, multiple copies exist (sometimes > 10) for many specimens, including intra-collection duplication average of 1.4 to 2.6 duplicates across target records

  13. Maximizing SGR functionality  But even in the absence of full matches (e.g., because of collector field # are different), essential data can still be harvested for identical localities (including georeferences)

  14. Maximizing SGR functionality  The matching algorithm will improve and become smarter (i.e. by running consecutive analyses with diferent algorithms to minimize missed matches and false matches, respectively)  Data quality of what goes into GBIF will improve (to maximize fully DwC-compliant data on key fields)

  15. Maximizing SGR functionality A caveat:  In order to benefit from SGR, collections need to be pre-catalogued first:  Generate a skeletal database that includes 4-6 minimal fields to act a source for SGR. (you still want to know, roughly, what you have in your collection)

  16. Maximizing SGR functionality  The main contribution of SGR is not just that helps populate records in your collection But… its real contribution to science is  Allowing to identify specimens that do not have duplicates anywhere else, so as to prioritize resources toward populating these records to be incorporated into existing metadata

  17. Thank you And thanks to : National Science Foundation – BRC Biodiversity Institute, University of Kansas MNH University of Michigan Herbarium

Recommend


More recommend