text stream processing
play

Text Stream Processing Dunja Mladeni Artificial Intelligence - PDF document

19.6.2012 Text Stream Processing Dunja Mladeni Artificial Intelligence Laboratory Marko Grobelnik Jo ef Stefan Institute Bla Fortuna Ljubljana, Slovenia Delia Rusu ailab.ijs.si Text stream Text stream processing processing Key


  1. 19.6.2012 Text Stream Processing Dunja Mladeni ć Artificial Intelligence Laboratory Marko Grobelnik Jo ž ef Stefan Institute Bla ž Fortuna Ljubljana, Slovenia Delia Rusu ailab.ijs.si Text stream Text stream processing processing • Key literature overview • What are text streams • Properties of text streams • Further publicly available tools • Motivation • Topic detection • Conclusions • Pre-processing of text • Entity, event and fact streams extraction and resolution • Questions and discussion • Text quality • Word sense disambiguation • Summarization • Sentiment analysis Introduction to Introduction to Concluding Concluding • Social network analysis text streams text streams remarks remarks ailab.ijs.si 1

  2. 19.6.2012 Introduction to Text Streams What are text streams Properties of text streams Motivation Pre-processing of text streams Text quality ailab.ijs.si What are data streams Continuously arriving data, usually in real-time Dealing with streams can be often easy, but… …gets hard when we have an intensive data stream and complex operations on data are required! In such situations usually… …the volume of data is too big to be stored …the data can be scanned thoroughly only once …the data is highly non -stationary (changes properties through time), therefore approximation and adaptation are key to success Therefore, a typical solution is… …not to store observed data explicitly, but rather in the aggregate form which allows execution of required operations ailab.ijs.si 2

  3. 19.6.2012 Stream processing Who works with real time data processing? “ Stream Mining ” (subfield of “ Data Mining ”) dealing with mining data streams in different scenarios in relation with machine learning and data bases http://en.wikipedia.org/wiki/Data_stream_mining “ Complex Event Processing ” is a research area discovering complex events from simple ones by inference, statistics etc. http://en.wikipedia.org/wiki/Complex_Event_Processing ailab.ijs.si Motivation for stream processing Why one would need (near) real-time information processing? …because Time and Reaction Speed correlate with many target quantities – e.g.: …on stock exchange with Earnings …in controlling with Quality of Service …in fraud detection with Safety , etc. Generally, we can say: Reaction Speed == Value …if our systems react fast, we create new value! ailab.ijs.si 3

  4. 19.6.2012 What are text streams Continuous, often rapid, ordered sequence of texts Text information arriving continuously over time in the form of a data stream News and similar regular report News articles, online comments on news, online traffic reports, internal company reports, web searches, scientific papers, patents Social media discussion forums (eg., Twitter, Facebook), short messages on phones or computer, chat, transcripts of phone conversations, blogs, e-mails Demo http://newsfeed.ijs.si ailab.ijs.si NewsFeed ailab.ijs.si 4

  5. 19.6.2012 Properties of text streams Produced with a high rate over time Can be read only once or a small number of times (due to the rate and/or overall volume) Challenging for computing and storage capabilities – efficiency and scalability of the approaches Strong temporal dimension Modularity over time and sources (topic, sentiment,…) ailab.ijs.si Example task: evolution of research topics and communities over time Based on time stamped research publication titles and authors Observe which topics/communities shrunk, which emerged, which split, over time, when in time were the turning points, … TimeFall – monitoring dynamic, temporally evolving graphs and streams based on Minimum Description Length find good cut-points in time, and stitch together the communities: good cut-point leads to shorter description length. fast and efficient incremental algorithm, scales to large datasets, easily parallelizable ailab.ijs.si 5

  6. 19.6.2012 Example task: evolution of research topics and communities over time Given: n time-stamped events ( eg. , papers), each related to several of m items (eg., title-words, and/or author-names) Find cluster patterns and summarize their evolution in time Time Papers Words Time Words Time Words 1990 1990 1990 1992 1992 1991 1991 V Papers 1 1990 2 1991 1990 1992 1992 1991 1991 1990 1990 1992 1991 1991 3 Time Word Clusters Time Word Clusters Time Word Clusters 1990 1990 1990 5 4 1991 1991 1992 1992 1992 ailab.ijs.si TimeFall on 12 million medical publications from PubMed MEDLINE over 40 years scales linearly with the product of the initial time point blocks and the number of non- zeros in the matrix J. Ferlez, C. Faloutsos, J. Leskovec, D. Mladenic, M. Grobelnik. Monitoring Network Evolution J. Ferlez, C. Faloutsos, J. Leskovec, D. Mladenic, M. Grobelnik. Monitoring Network Evolution ailab.ijs.si using MDL. International Conference on Data Engineering (ICDE 2008). using MDL. International Conference on Data Engineering (ICDE 2008). 6

  7. 19.6.2012 Pre-processing text stream Basic text pre-processing including removing stop-words, applying stemming Representing text for internal processing Splitting into units (eg., sentences or words) Mapping to internal representation (eg., feature vectors of words, vectors of ontology concepts) Pre-processing for aligning/merging text streams Time wise alignment of multiple text streams - coordinated text streams (appearing over the same time window, eg. news) Content alignment possibly over different languages ailab.ijs.si Example The city hosts a great number of religious buildings, many of them dating back to medieval times. Stop Words ailab.ijs.si 7

  8. 19.6.2012 Example city hosts great number religious buildings, host religi build many them dating back medieval times. date mediev time Stemming ailab.ijs.si Example city host great number religi build, many them date back mediev time. Splitting into units of words (city, host, great, number, religi, build, many, them, date, back, mediev, time) Feature vector of words ailab.ijs.si 8

  9. 19.6.2012 Text Quality Factors: Vocabulary use Grammatical and fluent sentences Structure and coherence Non-redundant information Referential clarity – e.g. proper usage of pronouns Models of text quality Global coherence - overall document organization Local coherence - Adjacent sentences Language model based approaches ailab.ijs.si Text stream Text stream processing processing • Key literature overview • What are text streams • Properties of text streams • Further publicly available tools • Motivation • Topic detection • Conclusions • Pre-processing of text • Entity, event and fact streams extraction and resolution • Questions and discussion • Text quality • Word sense disambiguation • Summarization • Sentiment analysis Introduction to Introduction to Concluding Concluding • Social network analysis text streams text streams remarks remarks ailab.ijs.si 9

  10. 19.6.2012 Text Stream Processing WEB Topic Topic Summarization Summarization Detection Detection Sentiment Sentiment Web Web Text Pre- Text Pre- Analysis Analysis Crawler Crawler Processing Processing Information Information Extraction Extraction Social Social Network Network Word Word Analysis Analysis Sense Sense Disambiguation Disambiguation Text Stream Processing Results ailab.ijs.si Topic Detection Religion Art ailab.ijs.si 10

  11. 19.6.2012 Topic Detection Supervised techniques The data is labeled with predefined topics Machine learning algorithms are used to predict unseen data labels Unsupervised techniques Identify patterns and structure within the dataset Clustering: grouping data sharing similar topics Statistical methods: probabilistic topic modeling ailab.ijs.si Probabilistic Topic Modeling Topic : a probability distribution over words in a fixed vocabulary Given an input corpus containing a number of documents, each having a sequence of words, the goal is to find useful sets of topics ailab.ijs.si 11

  12. 19.6.2012 Latent Dirichlet Allocation Documents can have multiple topics Religion Art D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research , 3:993 – 1022, January 2003 Machine Learning Research , 3:993 – 1022, January 2003 ailab.ijs.si LDA Generative Process A topic is a distribution over words A document is a mixture of topics (at the level of the corpus) Each word is drawn from one of the corpus-level topics For each document generate the words: 1. Randomly choose a distribution over the topics 2. For each word in the document a) Randomly choose a topic from the distribution over topics in (step 1) b) Randomly choose a word from the corresponding distribution over the vocabulary ailab.ijs.si 12

Recommend


More recommend