Tabular Data Extraction Epidemiology Table Classification and Factor Alignment Garrick Sherman
Last semester... ● Worked with Dr. Andrew Leakey ○ Plant biologist ○ The effects of carbon dioxide on photosynthesis ● Data is “locked away” ● Goals ○ Extract data from articles ○ Keep data associated with articles ○ Add structure to data
Last semester... ● Look for a set of search terms ● Parse HTML tables into CSV files ○ Also extract table captions ● Identify columns, “subtables,” and captions about the search terms
Last semester... ● Column-based ○ 53 columns extracted ○ Recall: 0.1130 or 0.6279 ○ Precision: 0.3774 or 0.5094 ● Subtables ○ 23 extracted ○ Recall: 0.1356 ○ Precision: 0.3158
This semester... ● Epidemiology journals ○ 11 high impact breast cancer journals ■ e.g. British Medical Journal, Cancer, International Journal of Breast Cancer, etc. ● Classify table as containing summary sample characteristics ● Align factors ○ e.g. “Marital Status” and “Married”
{"Age at diagnosis (years):"=> {"<40"=>["9 (146)", "5 (12)"], "40-49"=>["26 (437)", "29 (71)"], "50-59"=>["37 (631)", "39 (94)"], "≥60"=>["29 (500)", "27 (66)"]}, "Marital status:"=> {"Living with partner"=>["79 (180)"], "Living alone"=>["21 (48)"]}, "Metropolitan classification:"=> {"Metropolitan area"=>["59 (1001)", "49 (119)"], "Non-metropolitan area"=>["41 (711)", "51 (124)"]}, ….
This semester... ● Goal: ○ Automated metadata extraction ○ Faceted search ■ Find studies of related populations
Dataset ● First table ○ ~1,500 first tables ○ Train: 1,001 ○ Test: 497 ● NXML format ● Fresh codebase ○ But same table parsing approaches
Training ● Manual annotation ○ Classify based on first 10 lines (or more, if needed) and caption ○ Final tally: ■ 41.36% sample characteristics ■ 58.64% other ● Would certainly be improved with domain expertise
Classification ● Information gain ○ Tokens from factor and options
Classification ● Test results: ○ 177 predicted positive ■ Random sample of 50 ■ Precision: 85.71% ○ 300 predicted negative ■ Random sample of 50 ■ Precision: 76.00%
Factor Alignment ● Alignment approaches ○ Literal ○ Percentage-based ○ Name-inclusive ● Evaluation ○ Choose 10 randomly, calculate precision ○ Report average precision ○ Has some drawbacks
Factor Alignment ● #1 Histology (N = 20) ○ ■ Ductal,Lobular,Other Morphological type ○ Ductal,Lobular,Other,Unknown ■ Histological type ○ ■ Ductal,Lobular,Other,NA ● #2 Histological type ○ Ductal,Lobular,Ductulolobular,Medullary ■ ○ Histology Ductal,Lobular,Medullary ■
Factor Alignment ● Results ○ Literal: 0.9167 ○ Percentage: 0.8624 ○ Name-based: 0.9500
Conclusion ● Naive Bayes classifier works well because data is independent ● Simple methods of factor alignment are effective ● Automated approaches can help resolve table structure and contents ● Potential applications for faceted search
Recommend
More recommend