Data Cleaning & Integration Duen Horng (Polo) Chau Georgia Tech CSE6242 / CX4242 Aug 28, 2014 Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Last Time Big data analytics building blocks � Collection Data collection & simple data storage � Cleaning • Why SQLite? � Integration • Simplicity : nothing to install/ maintain, database in a single Analysis file � Visualization • Popular: cross-platform, cross- device � Presentation • SQL basics (create table, join, create index, etc.) Dissemination 2
Data Cleaning How dirty is real data?
Data Cleaners Watch videos � • Open Refine (previously Google Refine ) � • Data Wrangler (research at Stanford) � Write down � • Examples of data dirtiness � • Tool’s features demo-ed (or that you like) � Will collectively summarize similarities and differences afterwards Open Refine : http://openrefine.org � Data Wrangler : http://vis.stanford.edu/wrangler/ 4
How dirty is real data? Examples � • duplicates � • empty rows � • abbreviations (different kinds) � • difference in scales / inconsistency in description/ sometimes include units � • typos � • missing values � • trailing spaces � • incomplete cells � • synonyms of the same thing � • skewed distribution (outliers) � • bad formatting / not in relational format (in a format not expected) � 7
How are the tools similar or different ? • [G + W] can track changes (can undo redo, roll back ) � • [G] aggregation of similar-spelling items � • [W] can import through copy and paste � • [G] can import data through URL � • [W] generate code/scripts � • [G+W] can do value transformations (e.g., log) � • [G] can do clustering � • [W] can build graph/charts � • [W] can learn from your actions � • [G + W] do sorting � • [G + W] your data is “safe” (desktop app) � • [W] calculated fields (similar to excel) � G = Google Refine � • [G] overview of data values (eg, histogram/distribution plot) W = Data wrangler 8
! The videos only show some of the tools’ features. Try them out. Google Refine : http://code.google.com/p/google-refine/ � Data Wrangler : http://vis.stanford.edu/wrangler/ 9
Data Integration
Course Overview Collection Cleaning Integration Analysis Visualization Presentation Dissemination
What is Data Integration ? Why is it Important?
Data Integration Combining data from different sources to provide the user with a unified view � As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges � How to help people effectively leverage multiple data sources? (People: analysts, researchers, practitioners, etc.) 13
Examples of businesses based on data integration
Mashup
More Examples? • [FREE] Mint: account app, integrates multiple account (credit card, bank, etc.), can parse receipts � • Google News � • Crime mapping � • Feedly � • app that check gas prices, coupons � • zillow-trulia/redfin � • imdb (movie database) � • coin: combine multiple credits � • ebay 19
More Examples? • Palantir gotham � • Yelp: restaurant reviews, business reviews � • Facebook friend request: look at your friends’s friends and recommend those friends as your friends � • Trulia / zillow (real estate sites) � • graph search (facebook) � • waze � • yahoo pipe � • google search engine � • google transit � • google now / apple siri 20
How to do data integration?
“Low” Effort Approaches Use database’s “Join” ! (e.g., SQLite) � id name state id name id state � 111 Smith GA 111 Smith 111 GA � 222 Johnson 222 Johnson NY 222 NY 333 Obama 333 Obama CA 333 CA � � Google Refine http://code.google.com/p/google-refine/ (video #3) 22
Crowd-sourcing Approaches: Freebase 23 http://wiki.freebase.com/wiki/What_is_Freebase%3F
Freebase (a graph of entities) � “…a large collaborative knowledge base consisting of metadata composed mainly by its community members …” Wikipedia. 24
So what? What can you do with Freebase? (Hint: Google acquired it in 2010) � 25
http://www.google.com/insidesearch/features/search/knowledge.html
Given a graph of entities , like Freebase, what other cool things can you do? � 27
https://www.facebook.com/about/graphsearch
Facebook’s Graph Search � Integrate your friends’ info with yours 29
Feldspar � Finding Information by Association. CHI 2008 Polo Chau, Brad Myers, Andrew Faulring YouTube : http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E 30 Paper : http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdf
Summary for data integration Opportunities � • enable new services (Siri, padmapper) � • enable new ways to discover info � • improve existing services � • reduce redundancy � • new way to interactive with data � • promote knowledge transfer (e.g., between companies) 32
Recommend
More recommend