Web Mining Mining content Simple rank is confused by rank sinks, - PowerPoint PPT Presentation

Web Mining • Mining content • Simple rank is confused by “rank sinks”, e.g. two pages that point to each other but no other pages. If someone makes a link to one of these pages it will accumulate rank – Retrieval by content (RBC) during the iteration – Document classification • Fix by using a source of rank e r ′ = cA r ′ + c e • Mining web structure with | r ′ | 1 = 1 . – Hubs and authorities (Kleinberg) – PageRank (Page, Brin et al) – combined with RBC • Equivalent eigenvector problem r ′ = c ( A + e1 T ) r ′ • Web usage – access patterns of users (Cadez et al) • Computational problem for web is big! • e often taken as uniform, but can be “personalized” Mining Complex Data PageRank Chris Williams, School of Informatics • Each page u has a set of forward links F u (to other pages) and a set of backward links University of Edinburgh B u • Mining the WWW (content, structure, usage) • Simple count of | B u | is not sufficient, does not take into account relative importance of pages • Retrieval by Content (RBC) for text, images • Simple rank r ( u ) • Text mining r ( v ) � r ( u ) = c • Automatic Recommender Systems | F v | v ∈ B u • Mining image data 1 • Let A uv = | F u | if the is an edge from u to v , and 0 otherwise. Vector form of equation is • Time series and sequence data r = cA r • Data Mining: Summary • Eigenvector equation, find dominant eigenvector; can be found by power method Reading: HMS chapter 14

Text Data Mining where N is the total number of documents, n i is the number that contain term i w ij = f ij id f i • Can’t wait for complete natural language understanding solution, mapping from text to semantics! • w j is vector of w ij ’s for document j • Some example applications • Measure similarity between document and query using cosine distance – Classify newswire stories, email messages w j · q sim ( d j , q ) = √ w j · w j √ q · q – Predict if pre-assigned key phrases apply to a given document – Rank candidate key phrases based on features (e.g. frequency, closeness to start of • q has 1’s for terms in the query, 0’s elsewhere document) – Information retrieval • Evaluation in terms of precision and recall – Labelling information in text, e.g. names in documents • Latent Semantic Indexing (LSI): measure similarity in low-dimensional space found by – Probabilistic parsing of bibliographic references PCA on document-term matrix Retrieval by Content (RBC) Relevance Feedback • Documents: any segment of structured text • If user knew all relevant documents R and irrelevant documents NR , ideal query is q opt = 1 1 � � • Terms: words, word pairs, phrases w j − w j | R | | NR | j ∈ R j ∈ NR • Represent each document by which terms it contains • Rocchio’s algorithm: adjust by labelling a small set of returned documents as R ′ , NR ′ • Use TF-IDF weighting (Salton and Buckley, 1988) q new = α q current + β γ � � w j − w j | R ′ | | NR ′ | • Term frequency j ∈ R ′ j ∈ NR ′ count ij f ij = • Parameters α , β , γ chosen heuristically max l count lj where count ij is the number of occurrences of term i in document j • Inverse document frequency f i = log n id n i

Mining Image Data Paper Presentations • As with text, can’t wait for full AI solution to image understanding problem • Further examples of data mining of complex data will be found in the • Example problems student paper presentations – Classification of regions/objects (e.g. astronomy) – Retrieval • Example retrieval system QBIC (IBM), Query by Image Content • Measure similarity using – Global colour vector – Colour histogram – 3-d Texture feature vector – 20-d Shape feature for objects Automatic Recommender Systems Time Series and Sequences • Collaborative filtering: how can knowledge of what other people liked/disliked help you make your choice? • Time series but also other sequences, e.g. DNA, proteins • Example domains: movies, groceries • Data is sparse like/dislike data for each person • Predictive time-series modelling (e.g. financial, environmental modelling) • Empirical Analysis of Predictive Algorithms for Collaborative Filtering (Breese, Heckerman and Kadie, 1998) compared • Similarity search in sequences (define similarity!) – Memory-based methods (correlation, vector similarity) – Cluster models • Finding frequent episodes (in a window of length lwin ) from sequences. – Bayes Net models Uses APRIORI-style algorithm • Found that Bayes Nets and correlation methods worked best

Data Mining Tasks Some Issues in Data Mining • Mining methodology and user interaction • Visualizing and Exploring Data (incl Association Rules) – e.g. Incorporation of background knowledge – e.g. Handling noise and incomplete data • Data Preprocessing • Performance and scalability • Diversity of data types • Descriptive Modelling – Handling relational and complex types of data – Mining information from heterogeneous databases and WWW • Applications, social impacts • Predictive Modelling: Classification and Regression Datamining and KDD What is data mining? Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel KDD: Knowledge Dis- ways that are both understandable and useful to the data owner. covery in Databases Hand, Mannila, Smyth Knowledge Evaluation and� Presentation Patterns Data Mining We are drowning in information, but starving for knowledge John Naisbett Selection and� Transformation Data� warehouse Cleaning and� [Data mining is the] extraction of interesting (non-trivial, implicit, Integration previously unknown and potentially useful) information or patterns Figure from Han and from data in large databases. Han Databases Flat files Kamber

Web Mining Mining content Simple rank is confused by rank sinks, - PowerPoint PPT Presentation

Web Mining Mining content Simple rank is confused by rank sinks, e.g. two pages that point to each other but no other pages. If someone makes a link to one of these pages it will accumulate rank Retrieval by content (RBC) during

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Regulatory Framework The Mining Act 1971 The Crown owns the minerals Legislation

Welcome Haoma Mining NL Agenda Chairmans Address Business of the day Haoma Mining NL

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Association Rule Mining 1 What Is Association Rule Mining? Association rule mining is finding

PG Mining Management Overview Strategic Development Partner April 2014 PG Mining Management

MINING MINING IN IN GOA ? GOA ? by MR. RAMESH S. GAUNS Mining Leases running for 75 kms out

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Ministry of Mining The Overview of Mining Law Presenter: Eng. Ngigi Colin Mining Bill 2014

Contents Text Mining Concept Tasks Twitter Data Analysis with R Twitter Extracting Tweets

Week 5 Video 3 Relationship Mining Association Rule Mining Association Rule Mining Try to

Relationship Mining Association Rule Mining Association Rule Mining Try to automatically find

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Bitcoin Mining The Task of Bitcoin Miners Mining Hardware Energy Consumption &

THE DATA MINING PIPELINE What is data? The data mining pipeline: collection, preprocessing,

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Mining content Simple rank is confused by rank sinks, - PowerPoint PPT Presentation

Web Mining Mining content Simple rank is confused by rank sinks, e.g. two pages that point to each other but no other pages. If someone makes a link to one of these pages it will accumulate rank Retrieval by content (RBC) during

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Regulatory Framework The Mining Act 1971 The Crown owns the minerals Legislation

Welcome Haoma Mining NL Agenda Chairmans Address Business of the day Haoma Mining NL

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Association Rule Mining 1 What Is Association Rule Mining? Association rule mining is finding

PG Mining Management Overview Strategic Development Partner April 2014 PG Mining Management

MINING MINING IN IN GOA ? GOA ? by MR. RAMESH S. GAUNS Mining Leases running for 75 kms out

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Ministry of Mining The Overview of Mining Law Presenter: Eng. Ngigi Colin Mining Bill 2014

Contents Text Mining Concept Tasks Twitter Data Analysis with R Twitter Extracting Tweets

Week 5 Video 3 Relationship Mining Association Rule Mining Association Rule Mining Try to

Relationship Mining Association Rule Mining Association Rule Mining Try to automatically find

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Bitcoin Mining The Task of Bitcoin Miners Mining Hardware Energy Consumption &amp;

THE DATA MINING PIPELINE What is data? The data mining pipeline: collection, preprocessing,

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Bitcoin Mining The Task of Bitcoin Miners Mining Hardware Energy Consumption &