>
> Overview ● Goals of TM ● How does TM work ● Components ● History ● Problems particular to biomedical texts ● Examples 22/01/08 Tamara Polajnar > BRC > U Glasgow 2
> The Goal of Text Mining The small G-Protein Ras is Action Inter1 Inter2 activated by many growth factor receptors activate Ras gfr and binds to the Raf-1 bind ras raf-1 kinase kinase with high affinity. 22/01/08 Tamara Polajnar > BRC > U Glasgow 3
> How does TM work? Text Mining Text Mining External Text Selection Text collection Knowledge (IR, classification) Information Extraction Knowledge Management Visualisation 22/01/08 Tamara Polajnar > BRC > U Glasgow 4
> How does TM work? TM integrates: ● Information Retrieval in order to retrieve a high proportion of relevant documents (high recall). ● Text categorisation for a higher precision document selection. ● Named entity recognition to identify relevant proteins, genes, cellular components, processes, etc. ● Information extraction to explain the relationships between the entities ● Knowledge management and visualisation to store and access the results 22/01/08 Tamara Polajnar > BRC > U Glasgow 5
> A Note on Evaluation precision = tp/(tp+fp) recall = tp/(tp+fn) F- measure = 2*precision*recall/(precision+recall) B – Set of relevant A – Set of retrieved documents documents 22/01/08 Tamara Polajnar > BRC > U Glasgow 6
> Information Retrieval ● IR is used to manage and access vast numbers of documents. ● Many IR systems index documents and create dictionaries which associate words with documents. ● Others also cluster documents based on topic. ● Search engines retrieve many documents and rank them according to the query. Precision drops off as you look further down the ranked list. ● IR systems in general have high recall but low precision. 22/01/08 Tamara Polajnar > BRC > U Glasgow 7
> Medline ● A database of citations and abstracts of articles published in major peer reviewed journals since 1950s ● Each entry contains bibliographic information ● Some entries contain abstracts, MESH terms, citations... ● Is available for download for TM in XML format 22/01/08 Tamara Polajnar > BRC > U Glasgow 8
> IR for Biomedical Texts ● IR is key for biomedical research ● Search engines are the main way of looking for journal articles ● PubMed/Medline citations: > 2008: 16,880,015 > 2007: 16,120,074 > 2006: 15,433,668 22/01/08 Tamara Polajnar > BRC > U Glasgow 9
> Classification ● Classification is used to put documents into two or more classes. ● The classes are learned from examples, so usually manually compiled training data is needed. ● Clustering can be done without training data, but the meaning of the clusters has to be determined after. ● Classification usually has much higher precision than IR while also maintaining high recall (depending on the data). 22/01/08 Tamara Polajnar > BRC > U Glasgow 10
> Information Extraction ● Information extraction is a process by which knowledge contained in unstructured text is translated into predicated forms which can be used for reasoning. ● IE usually involves several layers of processing including: named entity recognition, parsing, and pattern recognition. ● In general, IE systems are manually engineered for particular problems. ● IE is usually high precision, but may miss a lot of information which is presented in a format which was not observed before. 22/01/08 Tamara Polajnar > BRC > U Glasgow 11
The small G-Protein Ras is activated by many growth > factor receptors and binds to the Raf-1 kinase with high affinity Legend sentence AP – adjective phrase NP – noun phrase NP VP PP – prepositional phrase VP – verb phrase DET AP Conj-VP Adj - adjective VP Conj - conjunction ADJ AP Det – determiner Conj VP VP PP N – noun ADJ N Prep – preposition V PP V V Prep AP V – verb Prep NP ADJ AP Det ADJ NP N N PP Prep AP ADJ N The small [G-Protein] Ras is activated by many [growth factor] receptors and binds to the [Raf-1 kinase] with high affinity subject verb object verb object activate(g-protein ras, gfr), bind(ras, raf-1 kinase) 22/01/08 Tamara Polajnar > BRC > U Glasgow 12
> A Little History ● The goal of natural language understanding has driven research into computational linguistics since the first computers. ● In the 1940s the behaviorist theory prevailed. Early on it was shown that language can be closely approximated by counting frequency/probability of words and by outputting the words with like frequency, which supported this view. ● In the 1950s this view was challenged by rationalists. Rationalists believe that human language is innate and only the syntactic rules of language are learned, whereas empiricists believe that human language is learned entirely through exposure. 22/01/08 Tamara Polajnar > BRC > U Glasgow 13
> Shannon ● In his 1948 paper A Mathematical Theory of Communication Claude Shannon showed that English can be approximated by statistical processes: ● THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED. From the Random Surrealism Shakespearian generator: To debug a kilt, or not to debug a kilt, that is the question! 22/01/08 Tamara Polajnar > BRC > U Glasgow 14 Is this a roadworks sign which I see before me, the doorbell toward my lecturer? Come, let me charge headlong at thee.
> Chomsky ● In 1956 Noam Chomsky revolutionised the study of linguistics by formalising grammars and by organising languages in a hierarchical structure according to their capabilities.Chomsky's ideas influenced much of theoretical Computer Science including automata theory and programming languages. ● Chomsky postulated that human beings are born with innate ability to process language and that different languages only differ by the syntactic surface structure while the semantics of language, the deep structure , is common to all people. ● The following sentences share the same deep structure: ● I am reading this slide. This slide is being read by 22/01/08 Tamara Polajnar > BRC > U Glasgow 15 me.
> And Then... ● Rationalists attempted to write down all the rules governing language while empiricists concentrated on developing representative statistical models. ● These approaches developed separately for several decades each achieving individual success, but neither providing a comprehensive solution. ● In the 90s the two approaches started coming together resulting in statistical learning of rules and linguistic improvements to statistical methods. This made natural language processing much more practical. 22/01/08 Tamara Polajnar > BRC > U Glasgow 16
> NLP vs. Text Mining ● Natural Language Processing is a research field in computing science sometimes interchangeably called Computational Linguistics or Language Engineering. ● It comprises of all research involving computers and human spoken languages whether in speech or text form. ● Text Mining is a combination of tools developed within NLP research which are aimed for extraction of specific information from large textual corpora. ● The term text mining is almost exclusively used for biological applications and is meant to allude to data mining . 22/01/08 Tamara Polajnar > BRC > U Glasgow 17
> Some Applications of NLP ● Machine translation ● Question answering ● Information extraction ● Speech generation ● Speech processing ● Dialogue systems ● Author identification ● Natural language querying 22/01/08 Tamara Polajnar > BRC > U Glasgow 18
> NLP for Biology ● There are some specific challenges in biological texts which have interested the NLP community. ● There is also a real need an use for the tools developed for biology which is also driving the research in text mining. 22/01/08 Tamara Polajnar > BRC > U Glasgow 19
> NLP Tools in Text Mining ● Tokenisation ● Named entity recognition ● Classification ● Shallow Parsing ● Full Parsing 22/01/08 Tamara Polajnar > BRC > U Glasgow 20
> Current Approaches to NLP ● In general most approaches are a combination of rule-based and stochastic methods ● Tokenisation, named entity recognition, and classification are generally done using statistical methods, but may employ dictionaries. ● Parsing is usually a combination of rules guided by probabilities that the rule will occur. 22/01/08 Tamara Polajnar > BRC > U Glasgow 21
> Tokenisation What is a word? ● A sequence of letters, all lowercase, with first capital letter, or all capitals? mRNA, hnRNP ● A sequence of letters? P26, RGL3, Nore1 ● A sequence of letters and numbers? Raf-1, JRE-IL6 ● Any sequence of symbols which ends with a space or a punctuation mark? H7-sensitive, PI3K-induced, 5'- triphosphatase ● Usually each application will have its own definition for what a word is. 22/01/08 Tamara Polajnar > BRC > U Glasgow 22
Recommend
More recommend