tabular data extraction
play

Tabular Data Extraction Epidemiology Table Classification and - PowerPoint PPT Presentation

Tabular Data Extraction Epidemiology Table Classification and Factor Alignment Garrick Sherman Last semester... Worked with Dr. Andrew Leakey Plant biologist The effects of carbon dioxide on photosynthesis Data is locked


  1. Tabular Data Extraction Epidemiology Table Classification and Factor Alignment Garrick Sherman

  2. Last semester... ● Worked with Dr. Andrew Leakey ○ Plant biologist ○ The effects of carbon dioxide on photosynthesis ● Data is “locked away” ● Goals ○ Extract data from articles ○ Keep data associated with articles ○ Add structure to data

  3. Last semester... ● Look for a set of search terms ● Parse HTML tables into CSV files ○ Also extract table captions ● Identify columns, “subtables,” and captions about the search terms

  4. Last semester... ● Column-based ○ 53 columns extracted ○ Recall: 0.1130 or 0.6279 ○ Precision: 0.3774 or 0.5094 ● Subtables ○ 23 extracted ○ Recall: 0.1356 ○ Precision: 0.3158

  5. This semester... ● Epidemiology journals ○ 11 high impact breast cancer journals ■ e.g. British Medical Journal, Cancer, International Journal of Breast Cancer, etc. ● Classify table as containing summary sample characteristics ● Align factors ○ e.g. “Marital Status” and “Married”

  6. {"Age at diagnosis (years):"=> {"<40"=>["9 (146)", "5 (12)"], "40-49"=>["26 (437)", "29 (71)"], "50-59"=>["37 (631)", "39 (94)"], "≥60"=>["29 (500)", "27 (66)"]}, "Marital status:"=> {"Living with partner"=>["79 (180)"], "Living alone"=>["21 (48)"]}, "Metropolitan classification:"=> {"Metropolitan area"=>["59 (1001)", "49 (119)"], "Non-metropolitan area"=>["41 (711)", "51 (124)"]}, ….

  7. This semester... ● Goal: ○ Automated metadata extraction ○ Faceted search ■ Find studies of related populations

  8. Dataset ● First table ○ ~1,500 first tables ○ Train: 1,001 ○ Test: 497 ● NXML format ● Fresh codebase ○ But same table parsing approaches

  9. Training ● Manual annotation ○ Classify based on first 10 lines (or more, if needed) and caption ○ Final tally: ■ 41.36% sample characteristics ■ 58.64% other ● Would certainly be improved with domain expertise

  10. Classification ● Information gain ○ Tokens from factor and options

  11. Classification ● Test results: ○ 177 predicted positive ■ Random sample of 50 ■ Precision: 85.71% ○ 300 predicted negative ■ Random sample of 50 ■ Precision: 76.00%

  12. Factor Alignment ● Alignment approaches ○ Literal ○ Percentage-based ○ Name-inclusive ● Evaluation ○ Choose 10 randomly, calculate precision ○ Report average precision ○ Has some drawbacks

  13. Factor Alignment ● #1 Histology (N = 20) ○ ■ Ductal,Lobular,Other Morphological type ○ Ductal,Lobular,Other,Unknown ■ Histological type ○ ■ Ductal,Lobular,Other,NA ● #2 Histological type ○ Ductal,Lobular,Ductulolobular,Medullary ■ ○ Histology Ductal,Lobular,Medullary ■

  14. Factor Alignment ● Results ○ Literal: 0.9167 ○ Percentage: 0.8624 ○ Name-based: 0.9500

  15. Conclusion ● Naive Bayes classifier works well because data is independent ● Simple methods of factor alignment are effective ● Automated approaches can help resolve table structure and contents ● Potential applications for faceted search

Recommend


More recommend