New Computing In 2019 and Beyond - Opportunities, Challenges, and Threats Fromm Institute Fall 2019 - Lecture 3 Bebo White - bebo.white@gmail.com 1
calendar 2
how big is a billion? • how do we describe it? • 10 9 = 1,000,000,000 (to a scientist)? • is it really a big number? • how do we imagine/visualize it in order to make it real? 3
4
what can be said about data? (1/2) • a cosmic view(?) • a fundamental component of the universe - the quantum no- hiding theorem • nothing disappears from the Internet • perhaps our most important asset • the new oil, a new currency • is a billion pieces of data a lot? do you have/own a billion pieces of data? how would you count the data you own? how do you manage/use the data you own? 5
what can be said about data? (2/2) • we • generate it • collect it • depend on it • share it • analyze it • plan with it • protect it • (maybe) sell it • etc., etc. 6
a datum 7
two data 8
relationships between data 9
more data means more complexity 10
patterns emerge Patterns yield information and insight 11
slac depends on data patterns Linear Coherent Light Source (LCLS ) 12
13
“When the number of factors coming into play in a phenomenological complex is too large, [the] scientific method in most cases fails.” -Albert Einstein in Out of my later years 14
data extremes at lcls • one LCLS experiment generates (on average) 2.5 million images per day • the LCLS data team manages 10 petabytes of data - 3 times more than the total data library for Netflix 15
16
what’s a petabyte(pb)? • 10 15 bytes = 1 quadrillion bytes • it is estimated that the human brain has the storage capacity of 2.5 PB • 223,101 DVDs • is that a lot of data? (Big Data)? • how can it be managed? 17
the data deluge… • from the beginning of recorded time until 2003, mankind generated 5 exabytes of data • in 2011, every two days; in 2013, every 10 minutes • such numbers become almost meaningless 18
19
where is this data coming from? (1/2) • EVERYWHERE! • any communication over a network involves transfer of data that is meaningful to someone or something • every e-mail, every tweet, every transaction, every social media interaction, etc. etc. • sensors - IOT 20
where is this data coming from? (2/2) 21
consider the new forms of data • that maybe did not exist 20+ years ago • Internet data, derived from social media and other online interactions (including data gathered by connected people and devices) • tracking data, monitoring the movement of people and objects • satellite and aerial imagery, • etc., etc. • much of the value of ‘new forms of data’ lies in the potential for it to be analyzed in near real-time 22
and this doesn’t include science, business, etc. etc. 23
how is this data being used (consumed)? • the “poster children”/“large data generators” for datasets are: • personal/consumer use • scientific use • finance/business use • government use • etc, etc. • now, we are the experiments creating these datasets • Facebook knows what food and music we like and how we are likely to vote • advertisers use cookies and intelligent algorithms to create personalization • Amazon even claims to know what we want to (or will) buy next 24
characteristics of this data eco-system - the 4 v’s (1/2) • volume • size of datasets or aggregated datasets • velocity • data rate, pipeline, bandwidth 25
characteristics of this data eco-system - the 4 v’s (2/2) • variety • any type of data both structured and unstructured (?) or meaningful and meaningless (?) • veracity • trust, source/provenance • e.g., in Facebook what does “like” really mean? are emojis interpretable data? 26
“big data” - a possible definition - just volume? • refers to datasets whose size is beyond the ability of • single storage devices • typical database software tools to capture, store, manage, and analyze (McKinsey Global Institute) • this definition is not based upon data size (which will increase) • it can vary by sector/usage • usually unstructured • this is not a new issue 27
beyond capability • 1956 • 5 Mb storage • LCLS would require over 1 trillion of these per month • 1960s • 10 Mb storage 28
= 200,000 x 29
30
31
data storage is not really a problem • E.coli has a storage density of ~1.125 exabytes/cm 3 • at that density, all the world’s current storage needs for a year could fit in a m 3 cube of DNA • DNA can be sequenced (read), synthesized (written to), and accurately copied • DNA is stable; genome sequencing of DNA 500,000 years old 32
what is data science? • the addition of meaning to multivariate arrays of data • creative visualization of complex datasets • the collection of insights from dataset analytics (knowledge?) • the ability to substantiate decisions based on datasets 33
a popular introduction to data science • 2003 • detailed a strategy used by the Oakland A’s to use data to make pragmatic decisions that went against the traditional wisdom of baseball teams • the A’s were able to outcompete their rivals on a shoestring budget • what happens when you mix lots of data and smart people 34
data science components • domain/subject matter experts • data engineering/information architecture • statistics • visualization • advanced computing 35
36
37
one of the fun parts of data science is visualization 38
39
40
41
42
43
visualization in >3 dimensions is a challenge • our brains are “wired” for a 3D world • multivariate (>3 variables) is typically more rich/ informative, and interesting • historical efforts • can new technologies help> 44
Minard mixed data science, statistics, and art 45
46
visualization is fun • it can show relationships • it really isn’t analysis • does it support decision-making? • does to support prediction? 47
data science and data analytics are often used interchangeably • data science isn’t concerned with answering specific queries, instead parsing through massive datasets in sometimes unstructured ways to expose insights • data analytics works better when it is focussed, having questions in mind that need answers based on existing data • data science produces broader insights that concentrate on which questions should be asked • data analytics emphasizes discovering answers to questions being asked 48
crossover - data science/data analytics and ai - “sentiment analysis” • goal - gauging mood on social network data • huge data streams coming in very fast • social sites operate 24/7 • timeliness - not subject to time lags • too much and too subjective for human analysis • useful to marketers, IT, customers, law enforcement/ security agencies, political influencers , etc. 49
remember volume and velocity? 50
difficult comment analysis (1/2) • false negatives - “crying” and “crap” (negative) vs. “crying with joy” and “holy crap!” (positive) • relative sentiment - “I bought a Honda Accord” - great for Honda, bad for Toyota • compound sentiment - “I love the phone but hate the network” • conditional sentiment - “If someone doesn’t call me back, I’m never doing business with them again!” 51
difficult comment analysis (2/2) • scoring sentiment - “I like it” vs. “I really like it” vs. “I love it” • sentiment modifiers - “I bought an iPhone today :-)” “Gotta love the telephone company ;-<“ • international, cultural, etc. etc. specific sentiments 52
53
54
remember the course goals? • in particular • to help you to: • appreciate why some of these new computing technologies are unique, revolutionary, and disruptive • have the vocabulary and understanding to evaluate stories that you read/hear • participate knowingly with friends, relatives, colleagues in discussions on these topics 55
56
57
58
59
analyzing significant correlations between social media measures and sales 60
watson claims to be able to do this 61
sentiment analysis can work in the opposite direction - a threat? • results of analysis can feed into social media • IOT + AI become participants in social networks in almost realtime • how would these actions influence privacy, security, veracity of data? 62
63
comparisons between data science and ai (1/2) • meaning • DS is about curating large datasets for analytics and visualization • AI is implementing this data in a machine • skills • DS is about statistical technique design and development • AI is about algorithm technique design and development 64
Recommend
More recommend