analyzing 750 billion events and 46 tb of code
play

Analyzing 750 billion events and 46 TB of code What you can learn - PowerPoint PPT Presentation

Analyzing 750 billion events and 46 TB of code What you can learn from GitHub's shared data on BigQuery Felipe Hoffa Developer Advocate @felipehoffa @felipehoffa @felipehoffa @felipehoffa @felipehoffa @felipehoffa DATA @felipehoffa Who


  1. Analyzing 750 billion events and 46 TB of code What you can learn from GitHub's shared data on BigQuery Felipe Hoffa Developer Advocate @felipehoffa

  2. @felipehoffa

  3. @felipehoffa

  4. @felipehoffa

  5. @felipehoffa

  6. @felipehoffa

  7. DATA @felipehoffa

  8. Who wants to analyze GitHub?

  9. Project maintainers - Popularity - Who and how? - Change management: - New APIs? - Breaking changes? - Is my project healthy? - Issues closed on time? - Community participation?

  10. Project users - What other projects to follow? - Requesting features - Data based requests - Effective phrasing

  11. Project choosers - Is this project popular? - Is this project healthy? - Is this project well adopted? - Related projects?

  12. Data lovers - Data integrators - You - Me :)

  13. 3 main datasets: - GitHub Archive - 8.7 billion events - Hourly updates - GHTorrent - These events annotated - Real-time updates - GitHub repos on BigQuery - 46 TB of code

  14. Google BigQuery @felipehoffa

  15. Google BigQuery • Fast: terabytes in seconds • Simple: SQL • Scaleable: From bytes to petabytes • No CAPEX: Always on • Interoperable: Tableau, R, Python... • Instant sharing • Free monthly quota 15 15

  16. Top projects by stars 2016?

  17. @felipehoffa

  18. Really?

  19. @felipehoffa

  20. I got stars! What else did they star?

  21. @felipehoffa

  22. How did they find me? Hacker News?

  23. @felipehoffa

  24. @felipehoffa

  25. Project health - Projects with most issues - Projects with most people filing issues - Projects with most engagement - Best projects at closing issues - Best phrasing for issue closing

  26. @felipehoffa

  27. @felipehoffa

  28. Even text analysis?

  29. @felipehoffa

  30. So where's the code?

  31. @felipehoffa

  32. @felipehoffa

  33. @felipehoffa

  34. @felipehoffa

  35. Rules to analyze [bigquery-public-data:github_repos.contents] • Text files <1MB • One copy of each unique file • JOIN with [github_repos.files] for paths • Don't JOIN with [github_repos.files] to get contents*path. • Extract first, analyze later • [github_repos.sample_contents] -> 10% of contents, top projects, 1 sample path. • Only open source projects - https://developer.github.com/v3/licenses/ • Some projects missing - why? @felipehoffa

  36. Top java imports growth 2013-16

  37. @felipehoffa

  38. Requesting a feature for Go

  39. @felipehoffa

  40. @felipehoffa

  41. Beyond regex Static code analysis with UDFs

  42. @felipehoffa

  43. @felipehoffa

  44. @felipehoffa

  45. @felipehoffa

  46. Spaces vs Tabs - GitHub on BigQuery edition The rules: • Data source : GitHub files stored in BigQuery. • Stars matter : We’ll only consider the top 400,000 repositories — by number of stars they got on GitHub during the period Jan-May 2016. • No small files : Files need to have at least 10 lines that start with a space or a tab. • No duplicates : Duplicate files only have one vote, regardless of how many repos they live in. • One vote per file : Some files use a mix of spaces or tabs. We’ll count on which side depending on which method they use more. • Top languages : We’ll look into files with the extensions (.java, .h, .js, .c, .php, .html, .cs, .json, .py, .cpp, .xml, .rb, .cc, .go). @felipehoffa

  47. Spaces vs Tabs - Extract @felipehoffa

  48. Spaces vs Tabs - Apply the rules @felipehoffa

  49. Spaces vs Tabs - Results @felipehoffa

  50. Who wants to analyze GitHub? Project maintainers Project users Project choosers Data lovers YOU!

  51. GitHub

  52. Way more: @felipehoffa

  53. @felipehoffa

  54. Questions? Rate me? News: reddit.com/r/bigquery Ask: stackoverflow.com Felipe Hoffa @felipehoffa bit.ly/bqfeedback

  55. @felipehoffa

  56. @felipehoffa

  57. @felipehoffa

  58. 2016 top imports vs 2010 top imports @felipehoffa

  59. @felipehoffa

  60. @felipehoffa

Recommend


More recommend