Data Science Opportunities and Risks Patrick Valduriez
Data versus Information • Data • Elementary definition of a fact • E.g. temperature, exam grade, account balance, message, photo, transaction, etc. • Can be complex • E.g. a satellite image • Can also be very simple, and taken in isolation, not very useful • But the integration with other data becomes useful • Information • Obtained by interpretation and analysis of data to yield sense in a given context • Can be very useful to understand the world • E.g. climate evolution, ranking of a student, etc. 2
Data and Algorithm "Content without method leads to fantasy, method without content to empty sophistry." Johann Wolfgang von Goethe (Maxims and Reflections, 1892) • The better the datasets, the better the machine learning algorithms • Milestones • 1997: IBM Deep Blue defeats Chess world champion Garry Kasparov • Negascout planning algorithm (1983) • Dataset of 700 thousands of chess games (1991) • 2016: Google Alphago defeats Go master Lee Sedol (4-1) • Monte Carlo method based algorithm (from the 1940's) and neural network • Dataset of 30 millions of go moves 3
The Continuum of Understanding Computer Human • The more the data, the better the understanding • If we (humans) do a good job 4
Outline 1. Data science 2. The good, the bad and the ugly 3. Technologies for data science 4. HPC & big data analysis 5. Opportunities and risks
Data Science
Data Science: definition • Data science • The science of making sense of data • The use of data management, statistics and machine learning, visualization and human-computer interactions to collect, clean, integrate, process, analyze and visualize big data • Ultimate goal: create data products and data services • Data scientist • Strong skills in statistics, data analysis and machine learning • AND strong knowledge of the business domain, to interpret the analysis results and draw meaningful conclusions 7
Data Science: definition • Data science Hard to find data scientists ! • The science of making sense of data • The use of data management, statistics and machine learning, visualization and human-computer interactions New training programs all over the world to collect, clean, integrate, process, analyze and visualize big data • Ultimate goal: create data products and data services Should we all be teaching “Intro to Data • Data scientist Science” instead of “Intro to Databases”? • Strong skills in statistics, data analysis and machine learning ACM SIGMOD panel 2014 • AND strong knowledge of the business domain, to interpret the analysis results and draw meaningful conclusions 8
Big Data: what is it? • A buzz word! • With different meanings depending on your perspective • E.g. 10 terabytes is big for an OLTP system, but small for a web search engine • A definition (Wikipedia) • Consists of data sets that grow so large that they become awkward to work with using on-hand data management tools • But size is only one dimension of the problem • How big is big? • Moving target: terabyte (10 12 bytes), petabyte (10 15 bytes), exabyte (10 18 ), zetabyte (10 21 ) • Landmarks in DBMS products • 1980: Teradata database machine • 2010: Oracle Exadata database machine 9
Why Big Data Today? • Overwhelming amounts of data • Exponential growth, generated by all kinds of programs, networks and devices • E.g. Web 2.0 (social networks, etc.), mobile devices, computer simulations, satellites, radiotelescopes, sensors, etc. • Increasing storage capacity • Storage capacity has doubled every 3 years since 1980 with prices steadily going down • 1 Gigabyte (HDD): $400K in 1980, $10K in 1990, $1K in 1995, $10 in 2000, $0.02 in 2015 • Very useful in a digital world! • Massive data => high-value information and knowledge 10
Big Data Dimensions: the V’s • Volume • Refers to massive amounts of data • Makes it hard to store and manage • Velocity • Continuous data streams are being produced • Makes it hard to process online • Variety • Different data formats, different semantics, uncertain data, multiscale data, etc. • Makes it hard to integrate • Other V's • Validity: is the data correct and accurate? • Veracity: are the results meaningful? • Volatility: how long do you need to store this data? 11
Big Data Analytics (BDA) • Objective: find useful information and discover knowledge in data • Typical uses: forecasting, decision making, research, science, … • Techniques: data analysis, data mining, machine learning, … • Why is this hard? • Low information density (unlike in corporate data) • Like searching for needles in a haystack • External data from various sources • Hard to verify and assess, hard to integrate • Different structures • Unstructured text, semi-structured document, key/value, table, array, graph, stream, time series, etc. • Hard to integrate • Simple machine learning models don't work • See next: "When big data goes bad" stories 12
Some BDA Killer Apps • Social network analysis • Modeling, simulation, visualization of large-scale networks • Online fraud detection across massive databases • Applicable in many domains (e-commerce, banking, telephony, etc.) • National security • Signal intelligence, cyber analytics • Real-time processing and analysis of raw data from high-throughput scientific instruments • E.g. to detect changing external conditions • Health care/medical science • Drug design, personalized medicine 13
Example: data-intensive science Observation Experimentation Processing Data Integration Collaboration Information Analysis Knowledge Search 14
Example: data-intensive science The problem “ Scientists are spending most of their time manipulating, organizing, finding and moving Observation Experimentation data, instead of researching. And it’s going to get worse ” The Office Science Data Management Challenge Processing Data Integration (USA DoE 2004) Collaboration Information Analysis Knowledge Search In bioinformatics, the time to deal with data can be well above 50% (IBC annual review 2017) 15
Data Science the good, the bad and the ugly
The good: Higgs Boson @ CERN • LHC (Large Hadron Collider) • Instrument to study the properties of fundamental particules in physics • Produces 15 petabytes / year • Made available through the LHC Computing Grid to several computing centers, e.g. CC- IN2P3, Lyon • Up to 200,000 simultaneous analyses • High Boson discovery • 2012: CERN announces that it had discovered a particle that was probably a Higgs boson particle as predicted by the Standard Model of particle physics • 2014: CERN confirms the discovery 17
The good: Google Sponsored Search Links • Google Adwords and Adsense programs • Revenue around $50 billion/year from marketing • The user defines its maximum cost-per-click bid (max. CPC bid), the most she's willing to pay for a click on her ad • Sponsored search uses an auction • A pure competition for marketers trying to win access to consumers, i.e. a competition for models of consumers – their likelihood of responding to the ad – and of determining the right bid for the item • There are around 30 billion search requests a month, perhaps a trillion events of history between search providers 18
When big data goes bad 19
The Bad 20
The Bad • Excerpts: What had happened was that two automated programs, one run by seller "bordeebook" and one by seller "profnath," were engaged in an iterative and incremental bidding war. Once a day profnath would raise their price to x times bordeebook's listed price. Several hours later, bordeebook would increase their price to y times profnath's latest amount. 21
The Bad • Excerpts: What had happened was that two automated programs, one run by seller "bordeebook" and one by seller "profnath," were engaged in an iterative and Problem: over simplified models, incremental bidding war. but reality is complex! Once a day profnath would raise their price to x times bordeebook's listed price. Several hours later, bordeebook would increase their price to y times profnath's latest amount. 22
The Bad (for Me) 23
The Bad (for Me) Problem: how do I get it fixed? 24
The Ugly 25
The Ugly 26
The Ugly • Excerpts: Solid Gold Bomb, the company that made the shirt, wasn't necessarily aware that it was even selling it. Solid Gold Bomb's business isn't in artfully designing T-shirts. Instead, it writes code that takes libraries of words that slot into popular phrases (such as "Keep Calm and Carry On," which enjoyed a brief mimetic popularity online) to make derivations that get dropped onto a template of a T-shirt and automatically get posted as an Amazon item for sale. Their mistake was overlooking a single word in a list of 4,000 or so others. 27
Recommend
More recommend