mad skills: new analysis practices for big data 2. dude, you got - PowerPoint PPT Presentation

Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton November 10, 2015 presented by Ritwika Ghosh mad skills: new analysis practices for big data

2. dude, you got mad skills. – UrbanDictionary.com 1 mad (adj.): an adjective used to enhance a noun. 1. dude, you got skills.

If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. 2 So what’s getting ubiquitous and cheap? Data . And what is complementary to data? Analysis . -Prof. Hal Varian, UC Berkeley, Chief Economist at Google

∙ Enterprise Data Warehouse(EDW) is queried by Business Intelligence(BI) software. ∙ A carefully constructed EDW was key. ∙ ”Mission Critical, expensive resource, used for serving data intensive reports targeted at executive decision makers”. 3 A bit of History

∙ Super cheap storage. ∙ Massive-scale data sources in an enterprise has grown remarkably : everything is data ∙ Grassroots move to collect and leverage data in multiple organizational units : Rise of data driven culture espoused by Google, Wired etc. ∙ Sophisticated data analysis leads to cost savings and even direct revenue 4 What has changed

∙ New requirements : MAD Skills. ∙ M : Magnetic (attract data and analysts) ∙ A : Agile (rapid iteration) ∙ D : Deep (sophisticated analytics in Big Data) ∙ Analysts with MAD skills need to be complemented by MAD approaches to design and infrastructure. ∙ 5 MAD skills

∙ MAD analytics for Fox Interactive Media, using Greenplum . ∙ Data parallel statistical algorithms for modeling and comparing the densities of distribution. ∙ Critical database system features that enable agile design and flexible algorithm development. ∙ Challenging data warehousing orthodoxy :”Model Less, Iterate More”. 6 This paper

∙ Serves ads across several Fox online publishers. (huge ad network). ∙ Greenplum Database system on 42 nodes: ∙ 40 Sun X4500s for query processing, ∙ 2 dual-core Opteron master nodes (one for failover). ∙ Big and Growing : ∙ 200 TB of mirrored data. Fact table of 1.5T rows. (2009) ∙ 5TB growth per day. ∙ Variety of data : Ad logs, CRM, User data. ∙ Diverse user set. ∙ Extensive use of R and Hadoop. 7 Fox Audience Network

Different needs, variety of reporting and statistical tools, command line access : Dynamic query ecosystem. Question: : How many female WWF enthusiasts under the age of 30 visited the Toyota community over the last four days and saw a medium rectangle? Problem : No set of pre-defined aggregates can possibly cover every question combining various variables. 8 Fox Audience Network: Contd. Diverse user base Dealing with ad-hoc questions

they tolerate dirty data, they attract data, they produce data. ∙ Sandboxing allows analysts to feed datasets directly from main warehouse. ∙ Encourage novel data sources. ∙ Business > application. 9 M agnetic : Attracting users and Methods Central Design Principle : Get data into the warehouse ASAP ∙ Analysts > DBAs : they like all data,

A gile: Analytics to adjust, react and learn from busi- 3 million users login to IMDb. 2 million shared enough personal information to be able to attach 1 out of 2k attributes of behavior. 3 billion ads serving as tracking devices. Acquiring this data, strategically sub-sampling, determine scaling, change practices to suit : rinse and repeat. 10 ness Case Study: Audience Forecasting Number of decisions : 1 . 2 × 10 16 Business cycle

∙ Infinite cycles of drill down and roll up : No single number is the answer. ∙ Anomaly detection, longitudinal variance, distribution functions. ∙ Statistical modeling : curves and models, as opposed to points ! 11 D eep : learning from data

tables/ logs ∙ Production Data Warehouse schema : aggregates for reporting tools and casual users. 12 MAD Modeling Intelligently staging cleaning and integration of data ∙ Staging schema : raw fact

∙ A hierarchy of mathematical concepts in SQL (MapReduce as well). Functional. ∙ Encapsulated as stored procedures and UDFs. ∙ Need to be able to use statistical vocabulary. 13 Data Parallel statistics ∙ Abstraction levels : Scalar → Vector → Function →

SELECT A.row_number, A.vector + B.vector FROM A, B WHERE A.row_number = B.row_number; SELECT 1, array_accum(row_number,vector*v) FROM A; 14 Vectors and Matrices Let A and B be two matrices of identical dimensions. Matrix Addition: Multiplication of matrix and a vector Av :

SELECT S.col_number, array_accum(A.row_number, A.vector[S.col_number]) FROM A, generate_series(1,3) AS S(col_number) Group by S.col_number; SELECT A.row_number, B.column_number, SUM(A.value * B.value) FROM A, B WHERE A.column_number = B.row_number GROUP BY A.row_number, B.column_number 15 Vectors and Matrices : Contd. Matrix transpost of an m × n : Matrix Multiplication

∙ Create marginals along document and term using group by queries. ∙ Expand each triple with a tf-idf score. Let A have one row per document vector. SELECT a1.row_id AS document_i, a2.row_id AS document_j, (a1.row_v * a2.row_v) / ((a1.row_v * a1.row_v) * (a2.row_v * a2.row_v)) AS theta FROM a AS a1, a AS a2 WHERE a1.row_id > a2.row_id 16 Example: tf-idf and Cosine similarity Document similarity : Fraud detection ∙ Create triples of ( document , term , count ) . ∙ Obtain cosine similarity of two document vectors x , y : θ = x . y || x || 2 || y || 2

Matrix based analytical methods : Ordinary Least ∙ coefficient of determination: TSS 17 Large dense matrices: distance matrix D, covariance matrices. ∙ OLS : modeling seasonal trends. Squares ∙ Statistical estimate of β ∗ best satisfying Y = X β . ∙ X = n × k , Y = { o 1 , . . . , o n } , β ∗ = ( X ′ X ) − 1 X ′ y . SSR = b ′ β − 1 ∑ n ( y i ) 2 y i ) 2 − 1 ∑ ∑ TSS = ( n ( y i ) 2 R 2 = SSR

CREATE VIEW ols AS SELECT pseudo_inverse(A) * b as beta_star, (transpose(b) * (pseudo_inverse(A) * b) - sum_y2/count) -- SSR / (sum_yy - sumy2/n) -- TSS as r_squared FROM ( SELECT sum(transpose(d.vector) * d.vector) as A, sum(d.vector * y) as b, sum(y)^2 as sum_y2, sum(y^2) as sum_yy, count(*) as n FROM design d ) ols_aggs; 18 Routine to compute OLS

∙ Agile : physical storage evolution easy and efficient. ∙ Magnetic : painless and efficient data insertion. ∙ Deep : powerful flexible programming environment. 19 MAD DBMS

∙ Database is not proprietary hardware : parallel computation engine. ∙ Storage is not expensive, math is not hard. ∙ SQL is flexible and highly extensible. ∙ How are queries parallelized? If we write in R, its not automatic. ∙ MapReduce here vs Hadoop? ∙ Ad for Greenplum :) 20 Conclusions Issues with Paper

mad skills: new analysis practices for big data 2. dude, you got - PowerPoint PPT Presentation

Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton November 10, 2015 presented by Ritwika Ghosh mad skills: new analysis practices for big data 2. dude, you got mad skills. UrbanDictionary.com 1 mad (adj.): an

Best Practices Presentation Skills Best Practices in Presentation Skills Whether you are

Demystifying Big Data: Value of Data Analysis Skills for Research Librarians Tammy Ann

OECD SKILLS STRATEGY: LATVIA GOOD PRACTICES WORKSHOP Andrew Bell Head, National Skills Strategy

Overview w Th he aim of this analysis is to examine Gen eral Practices workforce str ructure in

Data Analysis And Presentation Skills An Introduction For The Life And Medical Sciences Data

Public Art Donation for Guelph Park Dude Chilling Park Sign February 3, 2014

MA CHIA DATA COLLECTION & SHARING PRACTICES Kathy Hines Senior Director of Partner

Practical 3 & 4: Gravimetric Analysis of Sulfate 65410 Skills for the Professional Chemist

Skills for Implementing New Practices in Child Welfare Settings September 16, 2020 Annual

Customs Statistics predominantly process import exports data submitted via Customs

Enhancing Skills Data in Canada Connecting big data with traditional sources of LMI

Secondary Analysis of Data and Biospecimens August 21, 2018 Heather Hampel, MS, CGC Sandra

Inventory of Innovative Trades Training Practices Ministry of Jobs, Tourism and Skills Training

Dude, where is #mydata? By: @finnmyrstad from the Norwegian Consumer Council Consumer Bill of

Strengthening skills recognition systems: Interpreting the global analysis through a migration

Best Practices in LDAP Security Andrew Findlay Skills 1st Ltd October 2011 What is

Because, We Make You Happy Ben Huh, Dude Who Has Cheezburgers icanhas cheez burger.com

Dude, wheres that IP? Circumventing measurement-based IP geolocation Phillipa Gill Presented

GDPR Impact on Data Collection Practices New project kick off in Optimizing the Use of Data

American Workers Digital Skills: Digital Skills: Wha hat th t the da e data t ta tells us

How How Thinking Thinking in in Python Made Me a Better Python Made Me a Better Soware

Prepared By: Denise Moroney and Ginger Cullen April 27, 2016 WHY SCHOOL DUDE? *help us to

The future of Python on the Web My data journey 2 3 4 5 6 7 8 Lean Data Practices

Practical 3 & 4: Gravimetric Analysis of Sulfate 65410 Skills for the Professional Chemist

mad skills: new analysis practices for big data 2. dude, you got - PowerPoint PPT Presentation

Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, Caleb Welton November 10, 2015 presented by Ritwika Ghosh mad skills: new analysis practices for big data 2. dude, you got mad skills. UrbanDictionary.com 1 mad (adj.): an

Best Practices Presentation Skills Best Practices in Presentation Skills Whether you are

Demystifying Big Data: Value of Data Analysis Skills for Research Librarians Tammy Ann

OECD SKILLS STRATEGY: LATVIA GOOD PRACTICES WORKSHOP Andrew Bell Head, National Skills Strategy

Overview w Th he aim of this analysis is to examine Gen eral Practices workforce str ructure in

Data Analysis And Presentation Skills An Introduction For The Life And Medical Sciences Data

Public Art Donation for Guelph Park Dude Chilling Park Sign February 3, 2014

MA CHIA DATA COLLECTION &amp; SHARING PRACTICES Kathy Hines Senior Director of Partner

Practical 3 &amp; 4: Gravimetric Analysis of Sulfate 65410 Skills for the Professional Chemist

Skills for Implementing New Practices in Child Welfare Settings September 16, 2020 Annual

Customs Statistics predominantly process import exports data submitted via Customs

Enhancing Skills Data in Canada Connecting big data with traditional sources of LMI

Secondary Analysis of Data and Biospecimens August 21, 2018 Heather Hampel, MS, CGC Sandra

Inventory of Innovative Trades Training Practices Ministry of Jobs, Tourism and Skills Training

Dude, where is #mydata? By: @finnmyrstad from the Norwegian Consumer Council Consumer Bill of

Strengthening skills recognition systems: Interpreting the global analysis through a migration

Best Practices in LDAP Security Andrew Findlay Skills 1st Ltd October 2011 What is

Because, We Make You Happy Ben Huh, Dude Who Has Cheezburgers icanhas cheez burger.com

Dude, wheres that IP? Circumventing measurement-based IP geolocation Phillipa Gill Presented

GDPR Impact on Data Collection Practices New project kick off in Optimizing the Use of Data

American Workers Digital Skills: Digital Skills: Wha hat th t the da e data t ta tells us

How How Thinking Thinking in in Python Made Me a Better Python Made Me a Better Soware

Prepared By: Denise Moroney and Ginger Cullen April 27, 2016 WHY SCHOOL DUDE? *help us to

The future of Python on the Web My data journey 2 3 4 5 6 7 8 Lean Data Practices

Practical 3 &amp; 4: Gravimetric Analysis of Sulfate 65410 Skills for the Professional Chemist

MA CHIA DATA COLLECTION & SHARING PRACTICES Kathy Hines Senior Director of Partner

Practical 3 & 4: Gravimetric Analysis of Sulfate 65410 Skills for the Professional Chemist

Practical 3 & 4: Gravimetric Analysis of Sulfate 65410 Skills for the Professional Chemist