TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming • Lecturer: Timo Aaltonen – timo.aaltonen@tut.fi • Assistants – Adnan Mushtaq – MSc Antti Luoto – MSc Antti Kallonen
Lecturer • University Lecturer • Doctoral degree in Software Engineering, TUT, Software Engineering, 2005 • Work history – Various positions, TUT, 1995 – 2010 – Principal Researcher, System Software Engineering, Nokia Research Center, 2010 - 2012 – University lecturer, TUT
Working at the course • Lectures on Fridays • Weekly exercises – beginning from the week #2 • Course work – announced next Friday • Communication – http://www.cs.tut.fi/~dip/ • Exam
Weekly Exercises • Linux class TC217 • In the beginning of the course hands-on training • In the end of the course reception for problems with the course work • Enrolment is open • Not compulsory, no credit points • Two more instances will be added
Course Work • Using Hadoop tools and framework to solve typical Big Data problem (in Java) • Groups of three • Hardware – Your own laptop with self-installed Hadoop – Your own laptop with VirtualBox 5.1 and Ubuntu VM – A TUT virtual machine
Exam • Electronic exam after the course • Tests rather understanding than exact syntax • ”Use pseudocode to write a MapReduce program which …” • General questions on Hadoop and related technologies
Today • Big data • Data Science • Hadoop • HDFS • Apache Flume
1: Big Data • World is drowning in data – click stream data is collected by web servers – NYSE generates 1 TB trade data every day – MTC collects 5000 attributes for each call – Smart marketers collect purchasing habits • “More data usually beats better algorithms”
Three Vs of Big Data • Volume : amount of data – Transaction data stored through the years, unstructured data streaming in from social media, increasing amounts of sensor and machine-to- machine data • Velocity : speed of data in and out – streaming data from RFID, sensors, … • Variety : range of data types and sources – structured, unstructured
Big Data • Variability – Data flows can be highly inconsistent with periodic peaks • Complexity – Data comes from multiple sources. – linking, matching, cleansing and transforming data across systems is a complex task
Data Science • Definition: Data science is an activity to extracts insights from messy data • Facebook analyzes location data – to identify global migration patterns – to find out the fanbases to different sport teams • A retailer might track purchases both online and in-store to targeted marketing
Data Science
New Challenges • Compute-intensiveness – raw computing power • Challenges of data intensiveness – amount of data – complexity of data – speed in which data is changing
Data Storage Analysis • Hard drive from 1990 – store 1,370 MB – speed 4.4 MB/s • Hard drive 2010s – store 1 TB – speed 100 MB/s
Scalability • Grows without requiring developers to re- architect their algorithms/application • Horizontal scaling • Vertical scaling
Parallel Approach • Reading from multiple disks in parallel – 100 drives having 1/100 of the data => 1/100 reading time • Problem: Hardware failures – replication • Problem: Most analysis tasks need to be able to combine data in some way – MapReduce • Hadoop
2: Apache Hadoop • Hadoop is a frameworks of tools – libraries and methodologies • Operates on large unstructured datasets • Open source (Apache License) • Simple programming model • Scalable
Hadoop • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) • Core Hadoop has two main systems: – Hadoop Distributed File System : self-healing high- bandwidth clustered storage – MapReduce : distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction
Hadoop • Administrators – Installation – Monitor/Manage Systems – Tune Systems • End Users – Design MapReduce Applications – Import and export data – Work with various Hadoop Tools
Hadoop • Developed by Doug Cutting and Michael J. Cafarella • Based on Google MapReduce technology • Designed to handle large amounts of data and be robust • Donated to Apache Foundation in 2006 by Yahoo
Hadoop Design Principles • Moving computation is cheaper than moving data • Hardware will fail • Hide execution details from the user • Use streaming data access • Use simple file system coherency model • Hadoop is not a replacement for SQL, always fast and efficient quick ad-hoc querying
Hadoop MapReduce • MapReduce (MR) is the original programming model for Hadoop • Collocate data with compute node – data access is fast since its local ( data locality ) • Network bandwidth is the most precious resource in the data center – MR implementations explicit model the network topology
Hadoop MapReduce • MR operates at a high level of abstraction – programmer thinks in terms of functions of key and value pairs • MR is a shared-nothing architecture – tasks do not depend on each other – failed tasks can be rescheduled by the system • MR was introduced by Google – used for producing search indexes – applicable to many other problems too
Hadoop Components • Hadoop Common – A set of components and interfaces for distributed file systems and general I/O • Hadoop Distributed Filesystem (HDFS) • Hadoop YARN – a resource-management platform, scheduling • Hadoop MapReduce – Distributed programming model and execution environment
Hadoop Stack Transition
Hadoop Ecosystem • HBase – a scalable data warehouse with support for large tables • Hive – a data warehouse infrastructure that provides data summarization and ad hoc querying • Pig – a high-level data-flow language and execution framework for parallel computation • Spark – a fast and general compute engine for Hadoop data. Wide range of applications – ETL, Machine Learning, stream processing, and graph analytics
Flexibility: Complex Data Processing 1. Java MapReduce : Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch : A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4. Pig Latin : A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive : A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6. Oozie : A workflow engine that enables creating a workflow of jobs composed of any of the above.
3: Hadoop Distributed File System • Hadoop comes with distributed file system called HDFS (Hadoop Distributed File System) • Based on Google’s GFS (Google File System) • HDFS provides redundant storage for massive amounts of data – using commodity hardware • Data in HDFS is distributed across all data nodes – Efficient for MapReduce processing
HDFS Design • File system on commodity hardware – Survives even with high failure rates of the components • Supports lots of large files – File size hundreds GB or several TB • Main design principles – Write once, read many times – Rather streaming reads, than frequent random access – High throughput is more important than low latency
HDFS Architecture • HDFS operates on top of existing file system • Files are stored as blocks (default size 128 MB, different from file system blocks) • File reliability is based on block-based replication – Each block of a file is typically replicated across several DataNodes (default replication is 3) • NameNode stores metadata, manages replication and provides access to files • No data caching (because of large datasets), but direct reading/streaming from DataNode to client
HDFS Architecture • NameNode stores HDFS metadata – filenames, locations of blocks, file attributes – Metadata is kept in RAM for fast lookups • The number of files in HDFS is limited by the amount of available RAM in the NameNode – HDFS NameNode federation can help in RAM issues: several NameNodes, each of which manages a portion of the file system namespace
HDFS Architecture • DataNode stores file contents as blocks – Different blocks of the same file are stored on different DataNodes – Same block is typically replicated across several DataNodes for redundancy – Periodically sends report of all existing blocks to the NameNode – DataNodes exchange heartbeats with the NameNode
HDFS Architecture • Built-in protection against DataNode failure • If NameNode does not receive any heartbeat from a DataNode within certain time period, DataNode is assumed to be lost • In case of failing DataNode, block replication is actively maintained – NameNode determines which blocks were on the lost DataNode – The NameNode finds other copies of these lost blocks and replicates them to other nodes
Recommend
More recommend