The bottom line We are the data science people but the world needs to know about it
Wrangling vs Analytics wrangling analytics Wrangling: data processing that allows meaningful analysis to begin (extraction, integration, cleaning, querying, etc - basically SIGMOD/PODS CFP) Requires more effort (usually 50-80%)
This is what we do • But the world sees the end result • The 80-20 rule: 20% of effort gets 80% of PR • But we need to be better at it • Some ammunition...
Data analysts’ favorite tools Share of Respondents 40% 60% 20% 30% 50% 70% 10% 0% SQL Excel Python R MySQL LANGUAGES, DATA PLATFORMS, ANALYTICS TOOLS Python: numpy, scipy, scikit-learn ggplot Microsoft SQL Server Tableau JavaScript Matplotlib (Python) Java PostgreSQL Tool: language, data platform, analytics Oracle D3 Homegrown analysis tools Hive Spark Cloudera Visual Basic/VBA MongoDB Apache Hadoop SAS C++ PowerPivot Scala SQLite C Pig Amazon RedShift Weka Hbase Amazon Elastic MapReduce (EMR) Perl SPSS Teradata
Future data analysts’ favorite tools
The world needs to know • ... but it’s much more fun doing research than talk to the “real world” • Still, we are not a small community, and we have people with different skills • One example: we convinced our funders (EPSRC) that data management is an essential part of “big data” • The more people get the message, the healthier our field is
Recommend
More recommend