automatic text classification and extraction of automatic
play

Automatic text classification and extraction of Automatic text - PowerPoint PPT Presentation

Automatic text classification and extraction of Automatic text classification and extraction of entities and their properties from the text entities and their properties from the text Anton Kolonin Webstructor project 2015, SIBIRCON/SibMedInfo


  1. Automatic text classification and extraction of Automatic text classification and extraction of entities and their properties from the text entities and their properties from the text Anton Kolonin Webstructor project 2015, SIBIRCON/SibMedInfo http://www.webstructor.net/

  2. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Unified approach : Different cases Category: “Healthcare” Brand: Tylenol IS Entity (Case): “Treatment: Substance: Healing anxiety with Tylenol” tylenol acetaminophen acetaminophen H A S placebo Reliability: medium Effect: positive significantly Diagnosis: Anxiety Here’s the Tylenol twist: Before reduce they began writing, half of each Reporter: Daniel Randles feelings group received acetaminophen study acetaminophen while the other half swallowed a may placebo. Even among those reduce people who wrote about death, anxiety acetaminophen may the Tylenol takers set bail at explains significantly reduce roughly $300—a sign that feelings of existential acetaminophen may acetaminophen may anxiety, explains significantly reduce feelings of significantly reduce study lead author existential anxiety, explains feelings of existential Daniel Randles. study lead author Daniel anxiety, explains Randles, a PhD candidate in study lead author UBC’s department of... Daniel Randles. psychology. 2 http://www.webstructor.net/

  3. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Classification with feature vectors : Training Process Feature Token instantiation Keywords Item and phrases FeatureToken TextToken descriptions Text TextFeature Feature Category Feature Text Category CategoryFeature inference Category instantiation Category Multiple categories and Domain specific attributes There are three sub-processes contributing to the learning process. The first process is Category instantiation which takes the attributes defined for text in training corpus (either encoded in the text as tags or taken from respective database table fields) and creates categories for them, given the domain indicated by the attribute. The second process id Feature instantiation which takes the text in training corpus and decomposes it into tokens and features accordingly to the parser, tokenizer and feature builder depending on the implementation. The two processes above are independent, but they precede the third process which is Category Feature inference . It employs statistics to infer the relevance of features encountered in the texts to the categories associated with those texts. 3 http://www.webstructor.net/

  4. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Classification with feature vectors : Recognition process Feature Token detection FeatureToken TextToken Text TextFeature Feature Text Category inference Text Category CategoryFeature Categories assigned to Category particular Category assignments descriptions learned from Domain training corpus There are two sub-processes contributing to the rule applying process and the following process flow diagram depicts the dependency between the sub-processes and the data. The first process is Feature detection which takes the text in novel data and decomposes it into tokens and features accordingly to the parser, tokenizer and feature builder depending on the implementation. This process is similar to Feature instantiation in the course of learning, but the key difference is that only the features instantiated earlier in the course of learning can be detected, no new features are instantiated. The second process is Text Category inference . It employs statistics to infer the relevance of texts to the categories associated with those texts through the features detected in the texts and learned for those categories. 4 http://www.webstructor.net/

  5. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Webcat: Plain user interface 5 http://www.webstructor.net/

  6. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Webcat: Expert user interface 6 http://www.webstructor.net/

  7. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Webcat: Extracting entity properties 7 http://www.webstructor.net/

  8. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Webcat: Complex patterns and rules Sparse N-gram: “tylenol significantly reduce feelings of existential anxiety” = “tylenol ... reduce ... anxiety” Priority on order: “reduce ... feelings” is more important than “reduce AND feelings” Boolean ranking: “acetaminophen AND tylenol” is more important than “placebo”, regardless of statistics Contextual scoping: “tylenol” implies that “may” is reliability measure, not month of the year Compression of vector space: “disease OR illness” is much faster than “disease OR illness OR ail OR blast OR sick” (even if little bit less accurate) 8 http://www.webstructor.net/

  9. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Accuracy : Human vs. Computer : Sources of errors 76%-96% Software matches Human 1-8% Software performs better than Human 1-8% Human typos in training corpus 1-8% Lack of data in training corpus 1-8% Software lacks Human-level intelligence 9 http://www.webstructor.net/

  10. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Statistical «fuzzy» learning vs. «rigid» patterns and rules Based on rules and patterns 100% Based on statistics Accuracy 0% Novel Familiar Situations 10 http://www.webstructor.net/

  11. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Finding entities with properties: Hierarchical patterns VERB NOUN OR OR “disease” “positive effect” “drug” “negative effect” “procedure” AND cold OR AND OR head OR OR reduces treat OR aspirin treats reduce existential frustration tylenol acetylsalicylic acid anxiety acetaminophen acetaminophen may significantly reduce feelings of existential anxiety 11 http://www.webstructor.net/

  12. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Hierarchical patterns: Definition <pattern> := <token> | <regexp> | <variable> | <set> <set> := <conjunctive-set> | <N-gram> | <syn-set> <conjunctive-set> := ( <pattern> * ) <N-gram> := [ <pattern> * ] <syn-set> := { <pattern> * } Examples {[$description catheter] [$coating coating] [$inner-diameter {diameter inner-diameter}] [$tip tip] [$pattern pattern]} X Convey Guiding Catheter. Unique hydrophilic coating. Small atraumatic soft tip. Ultra-thin 1 × 2 flat wire braid pattern = { coating : 'hydrophillic', description : 'convey guiding', pattern : 'ultra-thin 1 × 2 flat wire braid', tip : soft } 12 http://www.webstructor.net/

  13. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties I Я Goal: Big Abstract Learn patterns Cognition Picture experientially Linguistic Act document Act action r г Cognition Objective Cognition Social Cognition E lower E upper Emotional Perception Audial Visual Tactile Perception Olfactory Perception Perception Perception 13 http://www.webstructor.net/

  14. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Part-of-speech tagging? Need full semantic context to be precise Какой (состояние опьянения)? Какой (свойство зрения)? Кто (имя, кличка)? Кто (профессия)? Что делал? С чем? Чем? Где? Как? Косой косой косарь Косой с косой косой косил на косе косо 14 http://www.webstructor.net/

  15. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Current implementation 15 http://www.webstructor.net/

  16. Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties Thank you for attention! Thank you for attention! Anton Kolonin Webstructor project 2015, SIBIRCON/SibMedInfo 16 http://www.webstructor.net/

Recommend


More recommend