Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of - PowerPoint PPT Presentation

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1 ¡

Overview Revolution ¡Confidential  This is my perspective on the use of dynamic languages (interpreters) for data analytics (statistics)  I am a long-time user and commercial developer of dynamic languages for data analysis, but I am also a hard-core C++ programmer 2

Outline Revolution ¡Confidential  A bit about my experience  Good/bad things about dynamic languages for data analysis, with special focus on R  Why statistics/data analytics requires its own language  The importance of parallel and distributed computing for data analysis  What I think would be the ideal language for statistics Revolution R Enterprise 3

Dynamic languages I have used and/or developed Revolution ¡Confidential  APL – tried using to teach econometrics with it in early 1980’s  MATLAB – tried using a very early version to teach econometrics in the early 80’s  Gauss – designed, developed, and commercialized this matrix-oriented statistical system in mid-80’s Revolution R Enterprise 4

Dynamic languages I have used and/or developed -- 2 Revolution ¡Confidential  Axum – designed and developed this commercial technical graphics package, and wrote a C-like interpreter for doing data transformations  S-PLUS – was VP of Development for S- PLUS (at MathSoft) in late 90’s. S-PLUS is the commercial version of S. Revolution R Enterprise 5

Dynamic languages I have used and/or developed -- 3 Revolution ¡Confidential  ExaStat – designed and developed this C++ based data analysis system that can be both interpreted (using CINT) and compiled  Threading is built in for most array and matrix computations  Includes a framework for automatically parallelizing and distributing a broad class of statistical algorithms  R -- currently VP of Engineering at Revolution Analytics, which focuses on R Revolution R Enterprise 6

Features that make dynamic languages so nice for data analysis Revolution ¡Confidential  Command line for quick experimentation  Ability to work directly with arrays, sets of arrays, and matrices  Environment for inspection of objects  Availability of functions for doing common operations  Natural syntax that corresponds to subject matter – don’t have to be a hard core programmer to get things done Revolution R Enterprise 7

More nice features Revolution ¡Confidential  Reduced need to write loops  Don’t have to worry about specifying data types  No compile/link/load time  No worries about memory allocation, pointers, headers, linking errors, loading problems Revolution R Enterprise 8

Problems with most dynamic languages for data analysis Revolution ¡Confidential  Speed, especially compared with the best compiled code  Loops tend to be especially slow  Problems scaling with data size  Often requires translation into compiled code before use in production environment  Insufficient range of data types  Lack of support for parallel and distributed computing Revolution R Enterprise 9

Especially nice features of R Revolution ¡Confidential  Common language used by practitioners around the world  Much of new statistics over the past 10-15 years has been implemented in R  Easy ability to download and install new packages; thousands are available  Excellent interface to compiled languages  Functions available for doing almost anything data-related Revolution R Enterprise 10

Particular problems with R Revolution ¡Confidential  Slow, especially for loops over data (about 100,000 times slower than C++ for simple loops)  Memory hog; multiple copies of “read-only” data can be made during a simple analysis  Not enough data types  Encourages bad coding practices – especially use of globals  Not thread-safe Revolution R Enterprise 11

Why statistics needs its own language(s) Revolution ¡Confidential  A natural and familiar syntax is important  It is necessary to have “data set” objects with column and row names, data descriptions, mixed data types  Easy, fast I/O of these objects  Missing value handling is crucial and must be built-in to almost all functionality  Proper handling of categorical data is also critical Revolution R Enterprise 12

The importance of parallel and distributed computing in data analysis Revolution ¡Confidential  Our ability to collect and store data is rapidly and greatly outpacing our ability to analyze that data  To analyze all this data we must use multiple cores and multiple hard drives  This means we need software that distributes computations across cores and computers and puts the results back together as if the work had been done on one core Revolution R Enterprise 13

My view of the ideal statistical language Revolution ¡Confidential  Based on the R syntax, but fixing the main problems  Implemented using LLVM or something similar  Extended range of data types  Allow passing objects by reference  Perhaps allow type specifications to enable increased speed of loops over data  Have built-in support for parallel and distributed computations Revolution R Enterprise 14

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of - PowerPoint PPT Presentation

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1 Overview Revolution Confidential This is my perspective on the use of dynamic languages (interpreters) for data analytics (statistics) I am a

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Dynamic Languages In Production: Progress And Open Challenges Bryan Cantrill (@bcantrill) David

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

In the Maze of Data Data Languages Languages In the Maze of Loris D'Antoni Loris D'Antoni WPE

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Before We Start Any questions? Context Free Languages PDAs and CFLs Languages Context Free

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Introduction to Talent Analytics and Interim View 01 Overview Erich OSaben Talent Analytics

Google Analytics A beginners guide What is Google Analytics? Google Analytics is not magic.

Safe Programming in Dynamic Languages Jeff Foster University of Maryland, College Park Joint

Dynamic Languages need Dynamic Compilers Maxime Chevalier-Boisvert mloc.js 2014 Introduction

COMMUNICATING [with empathy] @ DY DYNAMIC JILL JILL @ DY DYNAMIC JILL TENSION IS INEVITABLE @

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data Analytics Baochun Li i Qua Wide Area

Assemblages: Modules with Interfaces for Dynamic Linking and Communication Yu David Liu and

Programming Languages Third Edition Chapter 10 Control II Procedures and Environments

Dynamic Pricing Janyl Jumadinova, Raj Dasgupta Computer Science Department University of

A Decision Tree-based Approach to Dynamic Pointcut Evaluation Robert Dyer and Hridesh Rajan

Flexible Dynamic Information Flow Control in Haskell Deian Stefan 1 Alejandro Russo 2 John C.

Web Architecture, 3-Tier Apps PDBM 15.1, 15.6 Dr. Chris Mayfield Department of Computer Science

D3 Tutorial Introduction of Basic Components: HTML, CSS, SVG, and JavaScript D3.js Setup Edit by

HTML, XML Ramakrishnan & Gehrke, Chapter 7 www.w3schools.com www.webdesign.com Really

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of - PowerPoint PPT Presentation

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1 Overview Revolution Confidential This is my perspective on the use of dynamic languages (interpreters) for data analytics (statistics) I am a

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Dynamic Languages In Production: Progress And Open Challenges Bryan Cantrill (@bcantrill) David

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

In the Maze of Data Data Languages Languages In the Maze of Loris D'Antoni Loris D'Antoni WPE

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Before We Start Any questions? Context Free Languages PDAs and CFLs Languages Context Free

Data Mining &amp; Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Introduction to Talent Analytics and Interim View 01 Overview Erich OSaben Talent Analytics

Google Analytics A beginners guide What is Google Analytics? Google Analytics is not magic.

Safe Programming in Dynamic Languages Jeff Foster University of Maryland, College Park Joint

Dynamic Languages need Dynamic Compilers Maxime Chevalier-Boisvert mloc.js 2014 Introduction

COMMUNICATING [with empathy] @ DY DYNAMIC JILL JILL @ DY DYNAMIC JILL TENSION IS INEVITABLE @

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Lube : Mitigating Bottlenecks in Hao Wang* Wide Area Data Analytics Baochun Li i Qua Wide Area

Assemblages: Modules with Interfaces for Dynamic Linking and Communication Yu David Liu and

Programming Languages Third Edition Chapter 10 Control II Procedures and Environments

Dynamic Pricing Janyl Jumadinova, Raj Dasgupta Computer Science Department University of

A Decision Tree-based Approach to Dynamic Pointcut Evaluation Robert Dyer and Hridesh Rajan

Flexible Dynamic Information Flow Control in Haskell Deian Stefan 1 Alejandro Russo 2 John C.

Web Architecture, 3-Tier Apps PDBM 15.1, 15.6 Dr. Chris Mayfield Department of Computer Science

D3 Tutorial Introduction of Basic Components: HTML, CSS, SVG, and JavaScript D3.js Setup Edit by

HTML, XML Ramakrishnan &amp; Gehrke, Chapter 7 www.w3schools.com www.webdesign.com Really

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

HTML, XML Ramakrishnan & Gehrke, Chapter 7 www.w3schools.com www.webdesign.com Really