data cleaning integration
play

Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech - PowerPoint PPT Presentation

Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Aug 28, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos Last Time Big data analytics building


  1. Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Aug 28, 2014 Partly based on materials by 
 Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

  2. Last Time Big data analytics building blocks � Collection Data collection & simple data storage � Cleaning • Why SQLite? � Integration • Simplicity : nothing to install/ maintain, database in a single Analysis file � Visualization • Popular: cross-platform, cross- device � Presentation • SQL basics (create table, join, create index, etc.) Dissemination 2

  3. Data Cleaning 
 How dirty is real data?

  4. Data Cleaners Watch videos � • Open Refine (previously Google Refine ) � • Data Wrangler (research at Stanford) � Write down � • Examples of data dirtiness � • Tool’s features demo-ed (or that you like) � Will collectively summarize similarities and differences afterwards Open Refine : http://openrefine.org � Data Wrangler : http://vis.stanford.edu/wrangler/ 4

  5. How dirty is real data? Examples � • duplicates � • empty rows � • abbreviations (different kinds) � • difference in scales / inconsistency in description/ sometimes include units � • typos � • missing values � • trailing spaces � • incomplete cells � • synonyms of the same thing � • skewed distribution (outliers) � • bad formatting / not in relational format (in a format not expected) � 7

  6. How are the tools similar or different ? • [G + W] can track changes (can undo redo, roll back ) � • [G] aggregation of similar-spelling items � • [W] can import through copy and paste � • [G] can import data through URL � • [W] generate code/scripts � • [G+W] can do value transformations (e.g., log) � • [G] can do clustering � • [W] can build graph/charts � • [W] can learn from your actions � • [G + W] do sorting � • [G + W] your data is “safe” (desktop app) � • [W] calculated fields (similar to excel) � G = Google Refine � • [G] overview of data values (eg, histogram/distribution plot) W = Data wrangler 8

  7. ! The videos only show some of the tools’ features. Try them out. Google Refine : http://code.google.com/p/google-refine/ � Data Wrangler : http://vis.stanford.edu/wrangler/ 9

  8. Data Integration

  9. Course Overview Collection Cleaning Integration Analysis Visualization Presentation Dissemination

  10. What is Data Integration ? Why is it Important?

  11. Data Integration Combining data from different sources to provide the user with a unified view � As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges � How to help people effectively leverage multiple data sources? 
 (People: analysts, researchers, practitioners, etc.) 13

  12. Examples of businesses based on data integration

  13. Mashup

  14. More Examples? • [FREE] Mint: account app, integrates multiple account (credit card, bank, etc.), can parse receipts � • Google News � • Crime mapping � • Feedly � • app that check gas prices, coupons � • zillow-trulia/redfin � • imdb (movie database) � • coin: combine multiple credits � • ebay 19

  15. More Examples? • Palantir gotham � • Yelp: restaurant reviews, business reviews � • Facebook friend request: look at your friends’s friends and recommend those friends as your friends � • Trulia / zillow (real estate sites) � • graph search (facebook) � • waze � • yahoo pipe � • google search engine � • google transit � • google now / apple siri 20

  16. How to do data integration?

  17. “Low” Effort Approaches Use database’s “Join” ! (e.g., SQLite) � id name state id name id state � 111 Smith GA 111 Smith 111 GA � 222 Johnson 222 Johnson NY 222 NY 333 Obama 333 Obama CA 333 CA � � Google Refine 
 http://code.google.com/p/google-refine/ (video #3) 22

  18. Crowd-sourcing Approaches: Freebase 23 http://wiki.freebase.com/wiki/What_is_Freebase%3F

  19. 
 Freebase 
 (a graph of entities) � “…a large collaborative knowledge base consisting of metadata composed mainly by its community members …” Wikipedia. 24

  20. So what? 
 What can you do with Freebase? 
 (Hint: Google acquired it in 2010) � 25

  21. http://www.google.com/insidesearch/features/search/knowledge.html

  22. Given a graph of entities , like Freebase, what other cool things can you do? � 27

  23. https://www.facebook.com/about/graphsearch

  24. Facebook’s 
 Graph Search � Integrate your friends’ info with yours 29

  25. Feldspar � Finding Information by Association. 
 CHI 2008 
 Polo Chau, Brad Myers, Andrew Faulring YouTube : http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E 30 Paper : http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdf

  26. Summary for data integration Opportunities � • enable new services (Siri, padmapper) � • enable new ways to discover info � • improve existing services � • reduce redundancy � • new way to interactive with data � • promote knowledge transfer (e.g., between companies) 32

Recommend


More recommend