Wrapping It Up Jilles Vreeken 31 July 2015
What did we do? Introduction Patterns Correlation and Causation (Subjective) Interestingness Graphs Wrap-up + < ask-me-anything>
T ake Home: ove overa rall Overview of the hot topics in data mining that Jilles thinks are cool strongly biased sample – by interest and available time I wanted to give a general picture of what data mining is, what makes it special, and what’s currently happening at the edge of human knowledge
Key T ake-Home Message Data mining is descriptive not predictive the goal is to give you insight into your data, to offer (parts of) candidate hypotheses, what you do with those is up to you.
T ake Home: In Informatio ion Th Theo eory Exploratory data analysis wandering around your data, looking for interesting things, without being asked questions you cannot know the answer of. Questions like: What distribution should we assume? How many clusters/factors/patterns do you want? Please parameterize this Bayesian network?
T ake Home: Patte tterns rns Pattern mining aims to provide a simple descriptions of the structures that your data exhibits locally. Mining patterns is easy . Mining interesting patterns that are significant and doing so without redundancy , not so much.
T ake Home: In Informatio ion Th Theo eory Information Theory is a branch of statistics, concerned with measuring information information = reduction of uncertainty Uncertainty can be quantified in bits Everything new you learn about your data allows you to compress it better
T ake Home: Correlat ations Correlations can be spurious and deceiving . Mutual information is a strong notion of interaction . Based on Shannon entropy MI is hard to compute for continuous-valued data without making assumptions on the distribution. Based on cumulative entropy MI can detect non-linear correlations without requiring assumptions.
T ake Home: Cau Causation Causality is a difficult concept. Standard probabilistic approaches based on likelihood cannot detect causal direction between pairs . Additive noise models and information theoretic measures can . Oh, and storks cause babies.
T ake Home: Interest stingne ness ss Interestingness is ultimately subjective Still, to have algorithms that can find potentially interesting things we somehow need to formalize it
T ake Home: Gr Graph ph Min Mining ing Most graph mining approaches are global and predictive ‘Explain everything in one go’ real graphs are too complex for that Taking a local and descriptive approach allows for more detailed results, richer problems, easier formalization, efficient solutions very little done so far, many cool open problems
T ake Home: Dy Dynamic Da Data Data is rarely static even though many algorithms expect that Streaming algorithms work when data is too big to fit anywhere while dynamic algorithms aim to adjust the answer with the changing data
T ake Home: Assign Assignmen ents “What the hell was he thinking??” I wanted you to learn to read scientific papers without getting lost in details quickly forming high level pictures of complex ideas read critically , seeing through scientific sales-pitches show independent thinking , make ideas your own I was not disappointed.
T ake Home: TA TADA Data analysis is important, upcoming , but still very young aims to tackle impossible problems , such as finding interesting things in enormous search spaces is a weird mix of theory and practice: likes to be foundational , yet not afraid of ad hoc and, not unimportant, it’s lots of fun.
Exam d date tes The Exam type: oral when: August 3 rd and 4 th time: individual where: E1.7 room 3.01 what: all material discussed in the lectures, plus one assignment (your choice) per topic The Re-Exam type: oral when: September 28 th time: individual where: E1.7 room 3.01
Evaluation: I did I did n not lik like “Class should end in time :)” “The amount of time necessary for every assignment. ”
Evaluation: Sugge gest stions ns “More motivated slide” More details on the why? “Bit heavy course for 5 ECTS“ Yes. “More practical follow-up to implement/text ideas” Maybe… “Discuss assignments the day it is brought online” We can do that.
Things to do Master thesis projects in principle: yes! in practice: depending background, motivation, interests, and grades --- plus, on whether I have time interested? mail me Student Research Assistant (HiWi) positions in principle: maybe… in practice: depends on background, grades, and in particular your motivation and interests interested? mail me, include CV and grades
Sample T opics Causality Graphs - did X cause Y? - characterising viruses - mining causal graphs - realistic graph generators - what’s the cause of this ? - interesting subgraphs - predicting the future - comparing graphs Useful Patterns Rich Data & Text - t ell me… about this - pattern-based topic models - privacy & data generation - grammar & compression - pattern-based indexing - rich MaxEnt modelling - noise reduction - outliers in rich data
Good Reads Data Analysis: a Bayesian Tutorial Elements of Information Theory The Information D.S. Sivia & J. Skilling Thomas Cover & Joy Thomas James Gleick (very good, but skip the MaxEnt stuff) (very good textbook) (great light reading)
T each us More! Well, ok… let me advertise Information Retrieval and Data Mining together with Gerhard Weikum Core Lecture 9 ECTS In addition, Hoang Vu and Mario will likely teach one or two courses next semester Options include: Causal Inference (seminar+lectures) Mining High Dimensional Data (seminar+lectures) Mining (Correlated) Patterns (seminar+lectures)
Quest uestio ion Tim Time! e!
Privacy & Data Mining “What is your opinion on privacy preserving data mining? Have you ever worked with it? Do you think it is useful, or does it somehow contradicts 'the spirit' of data mining?”
T ext Mining “Have you ever worked with text mining? Do you think considering grammar is necessary, or is mere statistics enough?”
Big Data “Does Big Data exist?” “How big is Big Data?” “When is the data Big enough? Is more data always better?”
Mining Massive Data Map Reduce, Hadoop, Big Table, Cassandra, Spark, Dremel, etc, etc engineering or science ? Essentially tricks – not magic – that work well for certain specific problems For KDD 2014, at least 25 out of 150 presentations will be specifically aimed at ‘large scale’ stuff
Mining the Cloud “How about data analytics in the cloud?”
Social Network Analysis Many, many, many papers about social network analysis So far: lots of statistics, not much ‘mining’ That is, most are about how to model a graph probabilistically , how to fit a given distribution . The Elephant in the Room: what is the ‘graph’ distribution? Nobody knows. Yet.
Yo Your Quest uestio ion Here!
Conclusi sions This concludes TADA’15. I hope you enjoyed the ride.
Thank you! This concludes TADA’15. I hope you enjoyed the ride.
Recommend
More recommend