TADA! T opics Algorithmic Data Analysis Jilles Vreeken 24 April 2015
Question of the Course What are the hot t topics in data mining that are coo cool*? * and important to know
Question of the Course How can we extract no novel kno knowledge and nd insi nsight from large data?
Organization This is an advanced ced lecture, with lectures, and reading, and assignments. Beware! this lecture will ill be well-worth its 5 ECTS
I’m I’m not a afraid id! You will be, you will be. I’m not afraid.
I’m I’m not a afraid id! You will be, you will be. Yes… I’m not afraid. you will be. You will be.
Organization This is an advanced ced lecture, with lectures, and reading, and assignments. Beware! this lecture will ill be well-worth its 5 ECTS a lot of reading, a lot of thinking; it’ll take quite some some effort, but you’ll le learn n a lo lot
Reading Materials We’ll mainly consider scientific articles All will be available on the website directly accessible from the MPI network, or using login/password that you can get by email
Lectures Meetings that cover the basic topics format: ‘sit, listen, shut up interact’ Required reading announced on website read at your own convenience but, strongly pref efer erred ed, before the lecture
Exam Type tba most likely oral Day and place tba most likely in early August Grading final grade will be based on final exam and assignments
Assignments: gen enera ral 4 assignments Grading scale: fa fail il, pass, excel cellent ent. You may fail on one assignment two fails ils and you fail il the course Every excel cellent ent gives 1/3 bo bonus nus poin oint on final exam grade with maximum of 1 full point You must u must p pass t ss the he fina nal exam t m to pass t ss the he co course
Assignments: requir equirem emen ents To be written in proper academic-style English Us Use proper cit citatio ions you are given sources you are encouraged to find additional sources all sources must be mentioned pla lagia iaris ism instant fail il (at best)
Assignments: format mat Return assignment reports as PDF files by email no .doc(x), .odt, .rtf, .txt, .xml, .html, .pages, .ps, .eps, .etc No page limit! probably most will need 3 to 5 pages more is not necessarily better Reports must clearly state on the first page name, matriculation number, email address and topic
Assignments: returning ng Return assignment reports are to be returned by email tada@ a@mpi-inf.mpg. g.de de Deadline is on 1400 hours on the stated day NO NO delays, no excuses, time base on mail time stamp. Submissions that I receive before the DL day I will ACK
Assignments: grading ing Assignments are not for repeating what papers say perhaps surprisingly, but I have already read the papers. You are expected to cr crit itic ically ly discuss the sources, build connections, point out differences, provide new insights, etc. Some assignments are marked as hard rd this is because they are and this will be taken n in into account unt when grading
News & Updates Urgent and personal messages by email everything else via the website
Question of the Course How can we extract no novel kno knowledge and nd insi nsight from large data?
1 st st Paradi digm gm: Empir pirical S l Scien ience For thousands of years, science was empir iric ical: describing natural phenome omena
2 nd nd Paradig igm: Th Theo eoretical l Scien ience The last few hundred years science was theoretical al: used models, generalizations, made predic ictio ions ns
3 rd rd Paradigm gm: C Computatio iona nal S l Scienc nce The last decades, science was comput utationa nal: complex models sim imul ulating ing complex phenome omena
4 th th Paradig digm: Da Data-Intensi nsive S Scienc nce Interesting phenomena are too oo compl plex x to come up with good hypotheses. We need to unify theory, experimentation, and simulation capture re data, mi mine ne hypotheses, inspec pect and evaluate, genera erate e extra data to sele lect ct the best ones, iterate itera erative e procedure between wo world and nd mod model, scientist in the middle
Power laws
Sho hopp ppin ing Da Data Which products are often bought toget ether er?
Train in Dela Delays Which trains are delayed because of othe other trains?
Dr Drug Disc Discover ery What part of the molecule makes the drug work?
More patterns than you can shake a stick at
Pattern-based Modelling support vector machin svm associ rule mine nearest neighbor frequent itemset mine naïv bay linear discrimin analysi lda cluster high dimension state art frequent pattern mine algorithm synthet real Mining Algorithm summary of JMLR abstract database
Summaris ising ing Which sales chara racteri rise se your customers?
Summaris ising ing
Jilles Vreeken’s Professional Network as of April 21, 2015 Jilles Vreeken
Go Google gle Flu
Quit uite He e Healt lthy hy
Patient D Dece ceased
Big Big Da Data, Bigg Bigger er Da Data, Big iggest gest Da Data
No model is del is per erfec ect
Scien ience h e has lo lots o s of data, not t the the to tools to to analy lyse se it it
Soci cial Sci cience & e & th the Web
Astronomy my Sloan Sky Su Survey: 100TB between 2000 and 2008 1 billion objects: 260M galaxies, 260M stars non on-trivia ial l analy lysis: currently impossible
With Your Help! Maybe!
Recommend
More recommend