data cleaning
play

Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and - PowerPoint PPT Presentation

Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech Data Cleaning How dirty is real data? How dirty is real data? Examples Jan 19, 2016 January 19, 16 1/19/16


  1. Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech

  2. Data Cleaning How dirty is real data?

  3. How dirty is real data? Examples • Jan 19, 2016 • January 19, 16 • 1/19/16 • 2006-01-19 • 19/1/16 3 http://blogs.verdantis.com/wp-content/uploads/2015/02/Data-cleansing.jpg

  4. How dirty is real data? Discuss with you neighbors (group of 2-3) 60 seconds Comes up with 5+ kinds of “data dirtiness” 4

  5. How dirty is real data? • Missing or corrupted (NaN, null) • Numbers stored as string (“1232”) • Different units • Spelling/typos • Different string encodings • Outliers (due to data recording) • geocoding, timezone offsets (missing +, -) • Duplicate data • Fake data (malicious) • Sql injection • Different software version generating slightly different formats • Cap locks • Semi-colons • Structure (json objects) • Invisible characters • Different delimiters • Indentation 5

  6. Importance of Data Cleaning

  7. “80%” Time Spent on Data Preparation Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says [Forbes] http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time- consuming-least-enjoyable-data-science-task-survey-says/#73bf5b137f75 13

  8. Data Janitor

  9. Writing “Clean Code” • Be careful with trailing whitespaces • Indent code ( spaces vs tabs ) following coding practices in your team/company https://google.github.io/styleguide/javaguide.html#s4.2-block-indentation …there’s no way I'm going to be with someone who uses spaces over tabs… http://www.businessinsider.com/tabs-vs-spaces-from-silicon-valley-2016-5 Trailing whitespace is evil. Don't commit evil into your repo. http://codeimpossible.com/2012/04/02/trailing-whitespace-is-evil-don-t-commit-evil-into-your-repo/ 17

  10. Both available free for GT students on http://safaribooksonline.com/ 18

  11. Data Cleaners Watch videos • Data Wrangler (research at Stanford) • Open Refine (previously Google Refine ) Write down • Examples of data dirtiness • Tool’s features demo-ed (or that you like) Will collectively summarize similarities and differences afterwards Open Refine : http://openrefine.org Data Wrangler : http://vis.stanford.edu/wrangler/ 19

  12. What can Open Refine and Wrangler do? • [w,o] undo, redo • [o,w] history of data • [o] transform data (e.g., take log) • [w] data editing/highlighting/interaction may be easier • [o] clustering • [w] transpose/pivot • [w] fill in missing data • [w] suggestions + preview O = Open Refine W = Data wrangler 22

  13. ! The videos only show some of the tools’ features. Try them out. Open Refine : http://openrefine.org Data Wrangler : http://vis.stanford.edu/wrangler/ 37

Recommend


More recommend