Text Mining in Clinical Domain: Dealing with Noise Author: Hoang Nguyen, Jon Patrick Source: KDD’16 Advisor: Jia-Ling Koh Speaker: Avon Yu Date: 2018/12/4 � 1
Outline • Introduc*on • Method • Experiment • Conclusion � 2
Introduction • MoOvaOon • High level of noise in clinical corpus. • unknown word (ex. misspellings, acronym, abbreviaOons) • non-word (clinical scores & measure ex. BP140/65, HR 72…) • poor grammaOcal sentence • Costly labelled data, which sOll o\en contain errors and inconsistencies. • Imbalanced data distribuOon. � 3
Introduction • Goal • Introduces a general clinical data mining architecture which is potenOal of addressing these challenges using: • Pre-processing system (proof-reading) • InteracOve model development • AcOve learning � 4
Introduction • Framework � 5
Outline • IntroducOon • Method • Experiment • Conclusion � 6
Method • StandardisaOon • Ring-fencing tokeniser • Finite State Recognizer (FSR) uses training examples to recognize token paaerns consOtuOng a score or measurement that requires standardisaOon. � 7
Method • NormalisaOon & Clinical Concepts RecogniOon • The Lexicon Management System (LMS) store the accumulated lexical knowledge and contains categorizaOons of spelling errors, acronyms and non- word tokens. dictionary for English and Medical terms � 8
Method • IteraOve Model Development • The model is evaluated and the algorithm is revised in a feedback process to produce a more accurate result. � 9
Method • IteraOve Model Development • Feature selecOon: • Bag of words(BOW) • Proof reading • Ring-fencing • Lemma • Medical term and gazeaer • Bag of tags(BOT) • Context feature • NegaOon and modality � 10
Method • IteraOve Model Development • New model is delivered to the Visual Annotator(VA) to perform manual correcOon with the support of an annotaOon validaOon tool. � 11
Method • AcOve Learning • The learner queries the most informaOve instances to retrain the model instead of making a random selecOon. � 12
Method • AcOve Learning • Pool-based acOve learning � 13
Method • AcOve Learning • Simple AL Data within the margin is less imbalanced than the enOre data. � 14
Method • AcOve Learning • Self Confident • Chooses the next example to be labeled so that, when it is added to the training data, the future generalizaOon error probability is minimized • log-loss funcOon: � 15
Method • AcOve Learning • Kernel Farthest-First • The most informaOve instance is the farthest instance in the unseen pool from the current training set � 16
Method • AcOve Learning • Balanced Explora*on and Exploita*on(Balance-EE) • A combinaOon of Simple and KFF • The probability p for exploraOon will be updated as: � 17
Outline • IntroducOon • Method • Experiment • Conclusion � 18
Experiment • Dataset: • All reports provided in a year’s data collecOon by the imaging services in Australia. • Sample of 16472 reports was drawn from Lake Imaging and assigned to cancer (4784 reports) or non-cancer (11 688 reports) classes by the cancer registry � 19
Experiment • Descriptor (De) • 形態學、地形學、細胞型態 .. • EnOty (En) • subject of the report • LinguisOc (Li) • lexical polarity, normality and modifier • Radiologist’s coding (Ra) • cancer stage , TNM • Structure (St) • heading tags � 20
Experiment � 21
Experiment The evaluaOon of the reportability classifier presented here was executed independently at the Cancer Registry. The final version is implemented based on two ML algorithms, they are CondiOonal Random Fields(CRFs) and SVMs. ‘ sensitivity ’ is equal to ‘recall’ of the posiOve class (reportable) ‘ specificity ’ is the ‘recall’ of the negaOve class (non-reportable) � 22
Outline • IntroducOon • Method • Experiment • Conclusion � 23
Conclusion • Presents a general system for text mining in clinical domain with a focus on dealing with mulOple frequent kinds of noise. • Can dramaOcally reduce human effort in idenOfying relevant reports from the large imaging pool for further invesOgaOon of cancer. • The classifier is built on a large real-world dataset and can achieve high performance in filtering relevant reports. � 24
Recommend
More recommend