the hitch hackers guide to data science
play

The Hitch-Hackers Guide to Data Science ... or what I wish Id known - PowerPoint PPT Presentation

Science Data Acquisition Machines ToolBox Conclusion The Hitch-Hackers Guide to Data Science ... or what I wish Id known when I was younger Jaroslav Vn Masaryk University / Astronomical Institute / Gauss Algorithmic / 4comfort.cz


  1. Science Data Acquisition Machines ToolBox Conclusion The Hitch-Hackers Guide to Data Science ... or what I wish I’d known when I was younger Jaroslav Vážný Masaryk University / Astronomical Institute / Gauss Algorithmic / 4comfort.cz 3. dubna 2014 Jaroslav Vážný Practical approach

  2. Science Data Acquisition Machines ToolBox Conclusion 1 Science 2 Data Acquisition 3 Machines 4 ToolBox 5 Conclusion Jaroslav Vážný Practical approach

  3. Science Data Acquisition Machines ToolBox Conclusion What is Science? The whole of science is nothing more than a refinement of everyday thinking. Albert Einstein Jaroslav Vážný Practical approach

  4. Science Data Acquisition Machines ToolBox Conclusion More than Science Mistakes/Feedback No pain no gain Pain == gain? Everything is hard until someone makes it easy Jaroslav Vážný Practical approach

  5. Science Data Acquisition Machines ToolBox Conclusion MOOC == new era? https://www.khanacademy.org/ https://www.coursera.org/ https://www.udacity.com/ https://www.edx.org/ Jaroslav Vážný Practical approach

  6. Science Data Acquisition Machines ToolBox Conclusion Reproducibility http://jakevdp.github.io/blog/2013/10/26/ big-data-brain-drain/ http://nbviewer.ipython.org/ http://pdos.csail.mit.edu/scigen/ ;-) Jaroslav Vážný Practical approach

  7. Science Data Acquisition Machines ToolBox Conclusion We are all humans Jaroslav Vážný Practical approach

  8. Science Data Acquisition Machines ToolBox Conclusion We are all humans/animals Jaroslav Vážný Practical approach

  9. Science Data Acquisition Machines ToolBox Conclusion We are all humans/animals/idiots Jaroslav Vážný Practical approach

  10. Science Data Acquisition Machines ToolBox Conclusion Probability Test your intuition! Roll dice. 5 times you got 6. What is P(6)=? Monty Hall problem Show examples in IPython! 1 2 ? ? Jaroslav Vážný Practical approach

  11. Science Data Acquisition Machines ToolBox Conclusion Bayes’s theorem Suppose the probability (for anyone) to have AIDS is: P(AIDS) = 0.001 P(no AIDS) = 0.999 Consider an AIDS test: result is + or - P(+|AIDS) = 0.98 P(-|AIDS) = 0.02 P(+|no AIDS) = 0.03 P(-|no AIDS) = 0.97 Jaroslav Vážný Practical approach

  12. Science Data Acquisition Machines ToolBox Conclusion Bayes’s theorem solution P (+ | AIDS ) P ( AIDS ) P ( AIDS | +) = P (+ | AIDS ) P ( AIDS ) + P (+ | noAIDS ) P ( noAIDS ) 0 . 98 × 0 . 001 = 0 . 98 × 0 . 001 + 0 . 03 × 0 . 999 = 0 . 032 Your viewpoint: my degree of belief that I have AIDS is 3.2% Your doctor’s viewpoint: 3.2% of people like this will have AIDS Jaroslav Vážný Practical approach

  13. Science Data Acquisition Machines ToolBox Conclusion We are all humans/animals/idiots/liars Jaroslav Vážný Practical approach

  14. Science Data Acquisition Machines ToolBox Conclusion Data Avalanche? Large Synoptic Survey Telescope 20 TB per night 60 PB for the raw data (after 10 years) 15 PB for the catalog database The total data volume after processing will be several hundred PB CERN 1 PB per day Jaroslav Vážný Practical approach

  15. Science Data Acquisition Machines ToolBox Conclusion Sloan Digital Sky Survey Why is it important? Lots of data (>10 6 objects) Perfect documentation Tools to access the data Where I can learn it? http://www.sdss3.org/ Jaroslav Vážný Practical approach

  16. Science Data Acquisition Machines ToolBox Conclusion Virtual Observatory Why is it important? Uniform access to astronomy data Based on Web standards Many tools with vo support (Topcat, Aladin, Tapsh) Where I can learn it? http://physics.muni.cz/~vazny/wiki/index.php/ Diploma_work Jaroslav Vážný Practical approach

  17. Science Data Acquisition Machines ToolBox Conclusion What is Machine Learning (Data astrology) Data Mining Artificial Inteligence Jaroslav Vážný Practical approach

  18. Science Data Acquisition Machines ToolBox Conclusion Supervised Machine Learning Supervised Learning Model Training Text, Feature Documents, Vectors Images, etc. Machine Learning Algorithm Labels Feature Vector New Text, Document, Expected Predictive Image, Model Label etc. Jaroslav Vážný Practical approach

  19. Science Data Acquisition Machines ToolBox Conclusion Overfit/underfit Jaroslav Vážný Practical approach

  20. Science Data Acquisition Machines ToolBox Conclusion Unsupervised Machine Learning Unsupervised Learning Model Training Text, Feature Documents, Vectors Images, etc. Machine Learning Algorithm Feature Vector New Text, Likelihood Document, Predictive or Cluster ID Image, Model or Better Representation etc. Jaroslav Vážný Practical approach

  21. Science Data Acquisition Machines ToolBox Conclusion Star spectrum Jaroslav Vážný Practical approach

  22. Science Data Acquisition Machines ToolBox Conclusion Example of feature extraction Jaroslav Vážný Practical approach

  23. Science Data Acquisition Machines ToolBox Conclusion Example: Decison Tree ug <= 0.663668 1 | gr <= -0.191208: 1 (7.0) 2 | gr > -0.191208: 3 (104.0/5.0) 3 ug > 0.663668 4 | ri <= 0.285854: 1 (88.0/5.0) 5 | ri > 0.285854 6 | | ri <= 0.314657 7 | | | gr <= 0.692108: 2 (6.0) 8 | | | gr > 0.692108: 1 (3.0) 9 | | ri > 0.314657: 2 (90.0/2.0) 10 Jaroslav Vážný Practical approach

  24. Science Data Acquisition Machines ToolBox Conclusion Example: Suport Vector Machine Jaroslav Vážný Practical approach

  25. Science Data Acquisition Machines ToolBox Conclusion Data exploration http://ipython.org/ http://scikit-learn.org/stable/ http://pandas.pydata.org/ Jaroslav Vážný Practical approach

  26. Science Data Acquisition Machines ToolBox Conclusion Developement https://github.com/ Tests Funny hat https://www.python.org/ Jaroslav Vážný Practical approach

  27. Science Data Acquisition Machines ToolBox Conclusion References http://ipython.org/ http://www.greenteapress.com/thinkstats/ http://www.greenteapress.com/thinkpython/ http://scikit-learn.org/stable/ http://pandas.pydata.org/ http://jakevdp.github.io/ blog/2013/10/26/big-data-brain-drain/ http://www.galaxyzoo.org/ http://www.planethunters.org/ http://www.sdss3.org/ Jaroslav Vážný Practical approach

  28. Science Data Acquisition Machines ToolBox Conclusion Discussion Jaroslav Vážný Practical approach

Recommend


More recommend