Leman Akoglu Stony Brook University - PowerPoint PPT Presentation

Leman ¡Akoglu ¡ Stony ¡Brook ¡University ¡ http://www.cs.stonybrook.edu/~cse590

¡ Users write product reviews on many online sites: Yelp, Amazon, TripAdvisor ¡ Data of the form: <userID, productID, review-text, timestamp, rating> Task: ¡ How to find fake reviews and reviewers ? ¡ What strange behaviors do fake reviewers have? ¡ Can you use the network to find anomalies? Data: Amazon: http://liu.cs.uic.edu/download/data/ Yelp: http://www.yelp.com/academic_dataset 2 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ IPs communicating with other IPs ¡ <IP1, IP2, #bytes, protocol, time> ¡ Simulated data, over ~10 days Tasks: ¡ How to find events ? ¡ How to pinpoint culprits ? ¡ How can you explain the anomalies? ¡ How to model the time series? Data: ¡ “Challenge” network (with subtle anomalies) ¡ Found at: http://www.cs.stonybrook.edu/~leman/courses/ 14CSE590/data/challenge_network.zip 3 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider a large news corpus over time, like all USA Today articles over many years, or opinion platform like Twitter/blogs/forums Tasks: ¡ How can we find sentiment (+/-) associated with locations (e.g. Pittsburgh), people (e.g. Obama), and organizations (e.g. IBM)? § Exploit a senti-graph (bipartite): nodes-1: words, nodes-2: entities § Exploit sentiment associated with words (e.g. bankrupt, success, etc.) ¡ How does sentiment change over time? Data: ¡ Collect your own data. Newspapers sell their database for several hundred dollars. 4 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider Q&A sites where people ask and/or answer questions. Some sites are focused: e.g. MathOverflow, StackOverflow Tasks: ¡ How to automatically identify experts ? ¡ How to detect whether a user is about to leave? ¡ How to estimate quality of answers/questions? ¡ How to estimate the response times to questions? Data: ¡ StackOverflow data available online: http://blog.stackoverflow.com/category/cc-wiki- dump/ 5 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider the time series of Internet trends, like memes, stock prices, or #online searches. Tasks: ¡ How do these time series look like? How to spot change-points ? ¡ Can we characterize the time series (e.g. shape, distribution) so as to differentiate/classify ‘rumor’-based trends from ‘serious’ trends/ topics? § e.g., searches on celebrities vs home sale prices? Data: ¡ Google trends data available to download: http://www.google.com/trends/explore#cmpt=q 6 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider online platforms where people share ‘stuff’, e.g. pictures, links, videos, such as Reddit ¡ Several questions one can ask: Tasks: ¡ How do upvoted posts differ from downvoted ones? (control for the same shared link) ¡ What makes a user more engaged to use such sites? (regular vs sporadic users) ¡ How to characterize the life-span of a post? Data: ¡ Collect your own data: http://www.reddit.com/ 7 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider very special-topic online sites for specific group of people, e.g. people interested in gardening, wine, etc., chess players, moms, etc. Tasks: ¡ What topics are being talked about? ¡ How often do they ask questions? What are they about? What type questions are discussed most? (type: recommendation, opinion, etc.) ¡ What are most mentioned feelings/reasons/words, for specific concepts like ‘opening’, `divorce’, or ‘wine storage’? Data: ¡ Collect your own data: http://www.youbemom.com/forum/all 8 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Brain networks of 114 human subjects § Nodes: brain regions § Edges: connection strengths (weighted) ¡ Small graphs 70x70 nodes (regions) ¡ Big graphs ~2M nodes Tasks: ¡ Classify human: 1- high math vs normal 2- creative vs normal 3- male vs female etc. § Using/finding discriminative patterns Data: http://www.cs.stonybrook.edu/~leman/ courses/ 14CSE590/data/brainnetworks.rar 9 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider a political forum where users discuss several issues: § e.g.: abortion, creation, gay rights, guns, healthcare, re-election of Obama ¡ Opinions : “in-favor” or “opposed” ( signed edges ) Tasks: ¡ Given a user u and a new issue i Predict u’s opinion on i ( use network ) ¡ Anomalies? Spam? Conflicts? Data: Can crawl http://www.politicalforum.com/forum.php http://www.politicsforum.org/forum/ Wikipedia? 10 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Consider a who-trusts-whom network § Users decide whether to “trust” or “not-trust” each other. ( signed edges ) Task: ¡ Given a user i and a user j Predict whether i trust j and vice versa ( use network ) Data: Epinions.com at http://snap.stanford.edu/data/soc-sign-epinions.html 11 Fall 2014 CSE 590 - Data Mining meets Graph Mining

Consider who-follows-whom Twitter network Tasks: ¡ Given a user i and a user j Predict whether i and j follow each other ( use network ) ¡ Find community structure § Measure quality of communities (conductance, modularity) § How dense are they? Are they well separated? § What size are they? Communities-within-communities? Data: http://an.kaist.ac.kr/traces/WWW2010.html 12 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Available from: https://nycopendata.socrata.com/browse ¡ Types of data include § Electric consumption by zipcode § Emergency (911) or community-concern (311) calls by zipcode § Restaurant inspections § Noise complaints by zipcode § … ¡ Tasks: § Find anomalies/fraud/events § Summarize the data and visualize 13 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ Given photos and tags of what people are wearing (temporal data) § Find trends (association rules: what is being worn with what) § How do these trends change over time, if at all? § What determines the #likes of a photo? (e.g., content, popularity, #friends) ¡ Data: § http://www.cs.stonybrook.edu/~leman/ courses/14CSE590/data/chictopia.tar 14 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ KDD is the premier data mining conference ¡ Every year there is a competition http://www.sigkdd.org/kddcup/index.php ¡ KDD-Cup 2010 - Student performance evaluation KDD-Cup 2009 - Customer relationship prediction KDD-Cup 2008 - Breast cancer KDD-Cup 2007 - Consumer recommendations KDD-Cup 2006 - Pulmonary embolisms detection from image data ... ¡ Similarly check out Kaggle : http://www.kaggle.com/ 15 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ MR: Distributed compute environment ¡ Hadoop: open-source version of MR § Can install on local machine http://snap.stanford.edu/class/cs246-2011/hw_files/ hadoop_install.pdf Tasks: ¡ How to partition a very large graph? Goal: § Many within-partition edges, Few cross-edges ¡ How to find single-source shortest paths ? § Given node i, find all shortest paths to other nodes § For weighted, directed graphs § Modification: with upper-bound on shortest path distance 16 Fall 2014 CSE 590 - Data Mining meets Graph Mining

¡ http://snap.stanford.edu/class/ cs224w-2010/datasetsInfo.html ¡ http://www.stanford.edu/class/cs341/ data.html ¡ http://snap.stanford.edu/data/ Don’t feel limited by these ideas/datasets You can come up with your own ideas and collect interesting datasets J 17 Fall 2014 CSE 590 - Data Mining meets Graph Mining

Leman Akoglu Stony Brook University - PowerPoint PPT Presentation

Leman Akoglu Stony Brook University http://www.cs.stonybrook.edu/~cse590 Users write product reviews on many online sites: Yelp, Amazon, TripAdvisor Data of the form: <userID, productID, review-text,

Focused Clustering and Outlier Detection in Large Attributed Graphs ACM SIG-KDD August 26, 2014

Ranking in Heterogeneous Networks with Geo-Location Information Leman Akoglu Abhinav Mishra

Spam-URL Detection via Redirects Heeyoung Kwon Mirza Basim Baig Leman Akoglu Era of Spam Era

Weighted Graphs and Disconnected Components Patterns and a Generator Mary McGlohon, Leman

Mining Rich Graphs Ranking, Classification, and Anomaly Detection Leman Akoglu Feb 9 th 2018

REPORT ON RESULTS OF 2011 AUDITS OF: Stony Brook University Hospital, Stony Brook University

Doing Business with Stony Brook University Useful Information for New Vendors 1. Introduction

Barbara Chapman Stony Brook University Brookhaven National Laboratory How To Get Tied Up In

Carrie-Ann Miller Director of Experiential Learning for STEM Smart Programs at Stony Brook

Academic Language and the edTPA Going Beyond Vocabulary Dr. Joy Janzen Stony Brook University

ROOT and C++11 ROOT Users Workshop 2013 Benjamin Bannier Stony Brook University March 13, 2013

Report on Results of 2013 Audits SUNY Board of Trustees Audit Committee Presentation October 28,

Scott D. Stoller Scott Stoller, Stony Brook University 1 Outline Introduction to Trust

ADVANCING ACADEMIC SUCCESS April 2, 2020 1 1 Stony Brook University One of SUNYs four

Languages CSE 307 Principles of Programming Languages Stony Brook University

A n Experiments Krishna Kumar Stony Brook University The Electroweak Box Workshop at ACFI,

Quantifier scope in Mandarin thetic sentences Hongchen Wu Stony Brook University

Traffic Driven Analysis of Cellular Data Networks Samir R. Das Computer Science Department Stony

IMPROVING THE ONBOARDING By Sue Fioto EXPERIENCE Stony Brook University WHY FOCUS TO IMPROVE

The COVID-19 Pandemic Sharon Nachman, MD Chief, Division of Pediatric Infectious Diseases Stony

Functional Programming CSE 215, Foundations of Computer Science Stony Brook University

Instantons and Sphalerons in a Magnetic Field G ok ce Ba sar Stony Brook University

Smallest GMC Structures Resolved in CO Absorption by ALMA Jin Koda Stony Brook University

Perturbative QCD Kyle Lee Stony Brook University GRADTALK 10/21/19 1 Introduction Physics at