Tabular Data Extraction Epidemiology Table Classification and - PowerPoint PPT Presentation
Tabular Data Extraction Epidemiology Table Classification and Factor Alignment Garrick Sherman Last semester... Worked with Dr. Andrew Leakey Plant biologist The effects of carbon dioxide on photosynthesis Data is locked
Tabular Data Extraction Epidemiology Table Classification and Factor Alignment Garrick Sherman
Last semester... ● Worked with Dr. Andrew Leakey ○ Plant biologist ○ The effects of carbon dioxide on photosynthesis ● Data is “locked away” ● Goals ○ Extract data from articles ○ Keep data associated with articles ○ Add structure to data
Last semester... ● Look for a set of search terms ● Parse HTML tables into CSV files ○ Also extract table captions ● Identify columns, “subtables,” and captions about the search terms
Last semester... ● Column-based ○ 53 columns extracted ○ Recall: 0.1130 or 0.6279 ○ Precision: 0.3774 or 0.5094 ● Subtables ○ 23 extracted ○ Recall: 0.1356 ○ Precision: 0.3158
This semester... ● Epidemiology journals ○ 11 high impact breast cancer journals ■ e.g. British Medical Journal, Cancer, International Journal of Breast Cancer, etc. ● Classify table as containing summary sample characteristics ● Align factors ○ e.g. “Marital Status” and “Married”
{"Age at diagnosis (years):"=> {"<40"=>["9 (146)", "5 (12)"], "40-49"=>["26 (437)", "29 (71)"], "50-59"=>["37 (631)", "39 (94)"], "≥60"=>["29 (500)", "27 (66)"]}, "Marital status:"=> {"Living with partner"=>["79 (180)"], "Living alone"=>["21 (48)"]}, "Metropolitan classification:"=> {"Metropolitan area"=>["59 (1001)", "49 (119)"], "Non-metropolitan area"=>["41 (711)", "51 (124)"]}, ….
This semester... ● Goal: ○ Automated metadata extraction ○ Faceted search ■ Find studies of related populations
Dataset ● First table ○ ~1,500 first tables ○ Train: 1,001 ○ Test: 497 ● NXML format ● Fresh codebase ○ But same table parsing approaches
Training ● Manual annotation ○ Classify based on first 10 lines (or more, if needed) and caption ○ Final tally: ■ 41.36% sample characteristics ■ 58.64% other ● Would certainly be improved with domain expertise
Classification ● Information gain ○ Tokens from factor and options
Classification ● Test results: ○ 177 predicted positive ■ Random sample of 50 ■ Precision: 85.71% ○ 300 predicted negative ■ Random sample of 50 ■ Precision: 76.00%
Factor Alignment ● Alignment approaches ○ Literal ○ Percentage-based ○ Name-inclusive ● Evaluation ○ Choose 10 randomly, calculate precision ○ Report average precision ○ Has some drawbacks
Factor Alignment ● #1 Histology (N = 20) ○ ■ Ductal,Lobular,Other Morphological type ○ Ductal,Lobular,Other,Unknown ■ Histological type ○ ■ Ductal,Lobular,Other,NA ● #2 Histological type ○ Ductal,Lobular,Ductulolobular,Medullary ■ ○ Histology Ductal,Lobular,Medullary ■
Factor Alignment ● Results ○ Literal: 0.9167 ○ Percentage: 0.8624 ○ Name-based: 0.9500
Conclusion ● Naive Bayes classifier works well because data is independent ● Simple methods of factor alignment are effective ● Automated approaches can help resolve table structure and contents ● Potential applications for faceted search
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.