Introduction Material and Methods Results and Discussion Conclusion and Perspectives Comparative study between expert and non expert biomedical writings: their morphology and semantics Jolanta Chmielik 1 , Natalia Grabar 1 , 2 1 INSERM UMRS 872, eq. 20 Universit´ e Ren´ e Descartes Paris France; 2 DIH-HEGP - APHP - 20 rue Leblanc - Paris 15 (29/08/2009 - 02/09/2009 — MIE 2009) Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Outline Introduction Material and Method Results and Discussion Conclusions and Perspectives Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Introduction Internet: favoured place for searching medical and health information (Fox, 2006) Quality of health information: HON, CISMeF Technical heterogeneity: expert and non-expert documents co-exist negative effect on communication between medical professionals and patients (AMA, 1999; McCray, 2005) Distinction of discourses: HON, CISMeF, GoogleCoop health categorization remains manual = ⇒ Propose criteria for the automatic distinction of discourses guide users (especially non experts) towards appropriate sources of information Hypothesis: morpho-semantic level provides relevant criteria Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Similar work Readability formulae: mean length of words and sentences (Flesch, 1948; Gunning, 1973; Bj¨ ornsson et al. 1979) Readability formulae and medical vocabulary (Kokkinakis & Gronostaj, 2006) Supervized learning with different features (Poprat et al., 2006; Zheng et al., 2002; Grabar et al., 2007; Goeuriot et al., 2007; Miller et al., 2007) Combination of various features: linguistic features, readability, salient lexicon (Wang, 2006) readability, frequent grammatical categories, familiarity of terms (Zeng-Treiler et al., 2007) More detailed linguistic analysis of discourses: Consumer Health Vocabulary: aligned expert and non expert vocabularies in English (Zeng et al. 2006; Zeng & Tse, 2006) analysis of the syntactic level (Zeng-Trailer et al. 2007) acquisition and alignement of expert-non/expert paraphrases in French (Del´ eger & Zweigenbaum, 2008) Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Objectives Objectives: Analyze and exploit the morpho-semantic level of documents Define salient features for the distinction of expert and non expert medical discourses Facilitate the automatic recognition of discourses Framework of the study: Language: French Source of documents: CISMeF portal www.cismef.org Three thematics: cardiology, pneumology, diabetes Three discourses: expert, didactic, non expert Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Objectives Objectives: = ⇒ Analyze and exploit the morpho-semantic level of documents Define salient features for the distinction of expert and non expert medical discourses Facilitate the automatic recognition of discourses Framework of the study: Language: French Source of documents: CISMeF portal www.cismef.org Three thematics: cardiology, pneumology, diabetes = ⇒ Three discourses: expert, didactic, non expert Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Building the corpora Source of documents: CISMeF Over 43,000 health and medical documents in French Various characterizations of documents: documents accessible through their URL links 1 documents indexed with MeSH key-words: 2 cardiology, pneumology, diabetes documents profiled according their discourse: 3 for students, professionals, patients Preparing the corpus: Automatic downloading of documents Filtering and selection of HTML and XML files Conversion to raw text format Application of NLP tools for accessing the morpho-semantic level of documents Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Building the corpora Size of corpora Specialties Numbers of documents Number of occurrences expert didactic non expert expert didactic non expert Cardiology 1,583 205 143 942,409 449,765 157,382 Pneumology 742 127 134 600,524 213,379 96,559 Diabetes 213 23 52 181,039 44,847 29,817 Cardiology provides the largest number of medical documents Expert corpora are the most complete Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Accessing the morpho-semantic level Applied NLP tools TreeTagger : morpho-syntactic tagging (Schmid, 1994) est VER:pres ˆ etre antiinflammatoire PRO:POS antiinflammatoire FLEMM : lemmatizer and morphological checker (Namer, 2000) est VER(pres):3p:s:pst:ind ˆ etre:3g antiinflammatoire NOM: :s antiinflammatoire eriF : morpho-semantic analyzer (Namer, 2003) D´ angioblastique/ADJ [ [ angi N* ] [ blast N* ] ique ADJ ] (angioblastique/ADJ, [angi,N*]:blast/N*) Qui est en relation avec cellule embryonnaire et vaisseau Which is in relation with embryonic cell and vessel Constituants = /angi/blast/ique Type = anatomie Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Accessing the morpho-semantic level Selection of bases For each speciality and discourse, the most productive bases are selected: size of morphological families number of lexems formed with a given base: cardio- ( cardie- , carde- ) 57 (cardio-did); 26 (cardio-exp); 20(cardio-nexp) Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Studying and contrasting discourses Two approaches for contrasting expert, didactic and non expert discourses: 1 Productivity of bases within morphological families: size of morphological families the more a base is productive the larger its families is 2 Frequencies of lexems within corpora Two values for features: Raw values: exactly the number of constructed lexems Normalized values: normalization by the size of corresponding corpus Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Results and Discussion 1 Selected morphological material 2 Productivity of bases 3 Frequencies of bases Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Selected morphological material Total number of the selected bases: n=45 38 suppletive bases: 32 Greek bases: angi(o)- 6 Latin bases: art´ erio- 7 autonomous bases: bronches ( bronchus ), bact´ erie ( bacterium ), ... 2,295 lexems constructed with these 45 bases Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Selected morphological material Limitations of the NLP tools TreeTagger: Web-related problems: concatenated words: etiquesaide , diab´ etiquedidier , bronchopathieschronique diab´ encoding: diab˜ A¨tique misspelling: diab` e´ ete , diad´ etique , cardiomoypathie Conversion not processed: { diab´ etique/Adj , diab´ etique/N } D´ eriF: Lexems missing in the reference lexicon and not processed: ´ epin´ ephrine , r´ et´ eplase , phosphodiest´ erase Some morphological components not processed: -logue : neurologue , diab´ etologue , ... Erroneous morpho-semantic analyses: gymnase : enzyme du nu ( enzyme of the nude ) Non grouped morphological bases: h´ em(o)- , h´ em(a)- , h´ emato- , -` em- Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Productivity of bases 0.12 diabetes cardiology pneumology diabetes cardiology pneumology 20 0.1 normalized productivity raw productivity 0.08 15 0.06 10 0.04 5 0.02 0 expert didactic non expert expert didactic non expert expert didactic non expert expert didactic non expert expert didactic non expert expert didactic non expert (a) Raw values (b) Normalized values Didactic: the largest families (except diabetes): didactic documents ⇒ more diversified vocabulary Expert vs non expert: (a) raw size of families: expert > non expert (b) normalized size of families: expert < non expert Among the most productive morphological families: h´ em(o)- : relatif au sang -pathie : en relation avec une maladie -ite : une maladie inflammatoire Jolanta Chmielik, Natalia Grabar
Introduction Material and Methods Results and Discussion Conclusion and Perspectives Frequencies of bases 1 diabetes cardiology pneumology normalized frequency 0.8 0.6 0.4 0.2 0 expert didactic non expert expert didactic non expert expert didactic non expert Frequencies of bases within morphological families: didactic > non expert > expert important differences: didactic vs expert, didactic vs non expert Jolanta Chmielik, Natalia Grabar
Recommend
More recommend