Big Data Analytics 1 / 11 What is Big Data? Caracterized by - PowerPoint PPT Presentation

Big Data Analytics 1 / 11

What is Big Data? Caracterized by ◮ Volume ◮ No specific threshhold, but typically several gigabytes (10 9 ), terabytes (10 12 or petapbytes (10 15 ) ◮ Velocity – the data are generated quickly ◮ Facebook generates 600 TB of new data per day. 1 ◮ Variety – from multiple, often heterogeneous sources ◮ Variability – incomplete data, inconsistency within and between data sources ◮ Veracity – how can you trust the data you ingest? A good operative definition: a data set that may not fit on a single hard disk and/or requires parallel computation to process in a reasonable amount of time. (In practice many "big data" sets measure in the gigabytes, which might actually fit on a single modern disk.) 1 Pamela Vagata and Kevin Wilfong, Scaling the Facebook data warehouse to 300 PB 2 / 11

Applications of Big Data ◮ Web search ◮ Ad serving ◮ Multimedia analytics (image, video) ◮ Collaborative filtering (e.g., "customers who viewed this also viewed") ◮ Customer churn (identify customers likely to switch to a competitor in order to target special offers aimed at retention) ◮ Health care analytics ◮ Any sort of analytics application where the scale requires "big data" technology for reasonable performance. Big data processing is typically done in batch mode. A new paradigm, fast data, has recently emerged in which data are processed in real-time, often in combination with some batch-mode processing. We’ll focus on batch mode big data processing here, which is also typically a component of fast data systems. 3 / 11

Managing Big Data The characteristics of big data lead to two primary technical challenges: ◮ storage, and ◮ parallel processing. We’ll explore these challenges in the context of a ubiquitous industry-standard solution: the Hadoop scalable distributed computing platform. 4 / 11

The Hadoop Platform Hadoop is not a single software product, but an ecosystem of software tools. ◮ Core components: ◮ Common utilities that support the other Hadoop modules. ◮ Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. ◮ YARN (Yet Another Resource Manager): A framework for job scheduling and cluster resource management. ◮ MapReduce: A YARN-based system for parallel processing of large data sets. ◮ Add-ons and related projects: ◮ Cluster/Job Management: Amari, ZooKeeper ◮ Databases: Cassandra, HBase, Parquet ◮ Streaming engines (for fast data applications): Flink, Kafka, Spark Streaming ◮ Languages, libraries and compute engines: Pig, Hive, Mahout, Spark 5 / 11

The Hadoop Ecosystem 6 / 11

Installing Hadoop ◮ Single computer ◮ Cluster 7 / 11

HDFS Assumptions and Goals ◮ Hardware Failures will happen. Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. ◮ Streaming Data Access – high-throughput rather than interactive use. Trade a few POSIX requirements to increase data throughput. ◮ Large Data Sets – tens of millions of large files (gigabytes to terabytes each) ◮ Simple Coherency Model – write-once-read-many. After creation, files can only be appended to or truncated. ◮ "Moving Computation is Cheaper than Moving Data" ◮ Portability Across Heterogeneous Hardware and Software Platforms 8 / 11

HDFS Architecture 2 2 http://hadoop.apache.org/docs/current/hadoop-project-dist/ hadoop-hdfs/HdfsDesign.html 9 / 11

MapReduce split - map - reduce 10 / 11

Example: Word Count Canonical example. 11 / 11

Big Data Analytics 1 / 11 What is Big Data? Caracterized by - PowerPoint PPT Presentation

Big Data Analytics 1 / 11 What is Big Data? Caracterized by Volume No specific threshhold, but typically several gigabytes (10 9 ), terabytes (10 12 or petapbytes (10 15 ) Velocity the data are generated quickly Facebook

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Analytics (9:55-10:15am) Break Research Opportunities in Location, Analytics, Big Data and GIS

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Predictive Simulation & Big Data Analytics ISD Analytics Predict a better future

Michael Stonebraker The Meaning of Big Data - 3 V s Big Volume With simple (SQL)

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

Frequency dependence of the vertex function for the fRG and beyond Ciro Taranto

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

[S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 6: Analyzing Relational

Sambuz

Useful Links

Newsletter

Mail Us

Big Data Analytics 1 / 11 What is Big Data? Caracterized by - PowerPoint PPT Presentation

Big Data Analytics 1 / 11 What is Big Data? Caracterized by Volume No specific threshhold, but typically several gigabytes (10 9 ), terabytes (10 12 or petapbytes (10 15 ) Velocity the data are generated quickly Facebook

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Analytics (9:55-10:15am) Break Research Opportunities in Location, Analytics, Big Data and GIS

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Big Data Analytics Armistead Boyd SVP, Product &amp; Data Partnerships October 25, 2016 What is

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural

Big Data Analytics: What is Big Data? H. Andrew Schwartz Stony Brook University CSE545, Fall

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Predictive Simulation &amp; Big Data Analytics ISD Analytics Predict a better future

Michael Stonebraker The Meaning of Big Data - 3 V s Big Volume With simple (SQL)

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Data Mining &amp; Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

CS 327E Class 12 December 2, 2019 Announcements CIS Survey: Your voice matters .

Frequency dependence of the vertex function for the fRG and beyond Ciro Taranto

Spark &amp; sparklyr part II Spark &amp; sparklyr part II Programming for Statistical

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics End-to-End ML Pipelines with

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

[S PARK ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 6: Analyzing Relational

Sambuz

Useful Links

Newsletter

Mail Us

Big Data Analytics Armistead Boyd SVP, Product & Data Partnerships October 25, 2016 What is

Predictive Simulation & Big Data Analytics ISD Analytics Predict a better future

Data Mining & Analytics Data Mining Reference Model Data Warehouse Legal and Ethical Issues

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical