Hadoop Distributed File System (HDFS) 1 HDFS Overview A - PowerPoint PPT Presentation

Hadoop Distributed File System (HDFS) 1

HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares a similar architecture to many other common distributed storage engines such as Amazon S3 and Microsoft Azure HDFS is a stand-alone storage engine and can be used in isolation of the query processing engine 2

HDFS Architecture Name node Data nodes B B B B B B B B B B B B B B B 3

What is where? File and directory names Block ordering and locations Name node Capacity of data nodes Architecture of data nodes Data nodes Block data Name node location B B B B B B B B B B B B B B B 4

Analogy to Unix FS The logical view is similar mary user chu / etc hadoop 5

Analogy to Unix FS The physical model is comparable List of iNodes List of block locations File1 Meta data File1 Block 1 Block 2 Block 3 B B B B B B B B B … B B B B B B Unix HFDS 6

HDFS Create Name node File creator Data nodes 7

HDFS Create Name node Create(…) File creator Data nodes The creator process calls the create function which translates to an RPC call at the name node 8

HDFS Create Name node Create(…) File creator Data nodes The master node creates three initial blocks 1. First block is assigned to a random machine 1 2 3 2. Second block is assigned to another random machine in the same rack of the first machine 3. Third block is assigned to a random machine in another rack 9

HDFS Create Name node OutputStream File creator Data nodes 1 2 3 10

HDFS Create Name node File creator Data nodes OutputStream#write 1 2 3 11

HDFS Create Name node Next block File creator Data nodes OutputStream#write 1 2 3 When a block is filled up, the creator contacts the name node to create the next block 14

Notes about writing to HDFS Data transfers of replicas are pipelined The data does not go through the name node Random writing is not supported Appending to a file is supported but it creates a new block 15

Self-writing Name node If the file creator is running on one of the data nodes, the first replica is always assigned to that node Data nodes File creator 16

Reading from HDFS Reading is relatively easier No replication is needed Replication can be exploited Random reading is allowed 17

HDFS Read Name node open(…) File reader Data nodes The reader process calls the open function which translates to an RPC call at the name node 18

HDFS Read Name node InputStream File reader Data nodes The name node locates the first block of that file and returns the address of one of the nodes that store that block The name node returns an input stream for the file 19

HDFS Read Name node File reader InputStream#read (…) Data nodes 20

HDFS Read Name node Next block File reader When an end-of-block is Data nodes reached, the name node locates the next block 21

HDFS Read Name node seek(pos) File reader InputStream#seek operation locates Data nodes a block and positions the stream accordingly 22

Self-reading Name node Open, seek 1. If the block is locally stored on the reader, this replica is Data nodes chosen to read 2. If not, a replica on another machine in the same rack is File chosen reader 3. Any other random block is chosen When self-reading occurs, HDFS can make it much faster through a feature called short-circuit 23

Notes About Reading The API is much richer than the simple open/seek/close API You can retrieve block locations You can choose a specific replica to read The same API is generalized to other file systems including the local FS and S3 Review question: Compare random access read in local file systems to HDFS 24

HDFS Special Features Node decomission Load balancer Cheap concatenation 25

Node Decommission B B B B B B B B B B B B B B B B B B B 26

Load Balancing B B B B B B B B B B B B B B B 27

Load Balancing B B B B B B B B B B B B B B B Start the load balancer 28

Cheap Concatenation File 1 File 2 File 3 Name node Concatenate File 1 + File 2 + File 3 ➔ File 4 Rather than creating new blocks, HDFS can just change the metadata in the name node to delete File 1, File 2, and File 3, and assign their blocks to a new File 4 in the right order. 29

HDFS API FileSystem LocalFileSystem DistributedFileSystem S3FileSystem Path Configuration 30

HDFS API Create the file system Configuration conf = new Configuration(); Path path = new Path(“…”); FileSystem fs = path.getFileSystem(conf); // To get the local FS fs = FileSystem.getLocal (conf); // To get the default FS fs = FileSystem.get(conf); 31

HDFS API Create a new file FSDataOutputStream out = fs.create (path, …); Delete a file fs.delete(path, recursive); fs.deleteOnExit(path); Rename a file fs.rename(oldPath, newPath); 32

HDFS API Open a file FSDataInputStream in = fs.open (path, …); Seek to a different location in.seek(pos); in.seekToNewSource(pos); 33

HDFS API Concatenate fs.concat(destination, src[]); Get file metadata fs.getFileStatus(path); Get block locations fs.getFileBlockLocations(path, from, to); 34

Hadoop Distributed File System (HDFS) 1 HDFS Overview A - PowerPoint PPT Presentation

Hadoop Distributed File System (HDFS) 1 HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares a similar architecture to many other common distributed storage engines such as Amazon S3 and

Hadoop Distributed File System (HDFS) 10/05/2018 1 HDFS Overview A distributed file system

Hadoop Distributed File System (HDFS) 1 HDFS Overview A distributed file system Built on

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

RAPIDS: PYTHON GPU-ACCELERATED DATA SCIENCE Keith Kraus 3-18-2019 Dante Gama Dessavre DATA

HDFS Hadoop Distributed File System Motivation File Management Streaming Data Fault Tolerance

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Using Hadoop for Webscale Computing Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008 Agenda

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Apache Flume Getting data into Hadoop Problem Getting data into HDFS is not difficult: %

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

File Management What is a file? Elements of file management File organization

Ohio AAP Brush, Book, Bed Pilot QI Program Action Period Call 1 January 15, 2020 Welcome and

Accelerating Bayesian Inference on Structured Graphs Using Parallel Gibbs Sampling Glenn G. Ko

Bayesian inference for discretely observed diffusion processes Moritz Schauer with Frank van der

Inference in Bayesian Networks Marco Chiarandini Department of Mathematics & Computer Science

Efficient Counting of Square Substrings in a Tree Tomasz Kociumaka, Jakub Pachocki , Jakub

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFORMATICS 1

Unit-5: -regular properties B. Srivathsan Chennai Mathematical Institute NPTEL-course July -

CS 241: Systems Programming Lecture 4. Environment and expansion Spring 2020 Prof. Stephen

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Hadoop Distributed File System (HDFS) 1 HDFS Overview A - PowerPoint PPT Presentation

Hadoop Distributed File System (HDFS) 1 HDFS Overview A distributed file system Built on the architecture of Google File System (GFS) Shares a similar architecture to many other common distributed storage engines such as Amazon S3 and

Hadoop Distributed File System (HDFS) 10/05/2018 1 HDFS Overview A distributed file system

Hadoop Distributed File System (HDFS) 1 HDFS Overview A distributed file system Built on

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

RAPIDS: PYTHON GPU-ACCELERATED DATA SCIENCE Keith Kraus 3-18-2019 Dante Gama Dessavre DATA

HDFS Hadoop Distributed File System Motivation File Management Streaming Data Fault Tolerance

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Distributed File Systems Issues in Distributed File Service Case Studies: Sun

Using Hadoop for Webscale Computing Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008 Agenda

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Apache Flume Getting data into Hadoop Problem Getting data into HDFS is not difficult: %

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

File Management What is a file? Elements of file management File organization

Ohio AAP Brush, Book, Bed Pilot QI Program Action Period Call 1 January 15, 2020 Welcome and

Accelerating Bayesian Inference on Structured Graphs Using Parallel Gibbs Sampling Glenn G. Ko

Bayesian inference for discretely observed diffusion processes Moritz Schauer with Frank van der

Inference in Bayesian Networks Marco Chiarandini Department of Mathematics &amp; Computer Science

Efficient Counting of Square Substrings in a Tree Tomasz Kociumaka, Jakub Pachocki , Jakub

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFORMATICS 1

Unit-5: -regular properties B. Srivathsan Chennai Mathematical Institute NPTEL-course July -

CS 241: Systems Programming Lecture 4. Environment and expansion Spring 2020 Prof. Stephen

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Inference in Bayesian Networks Marco Chiarandini Department of Mathematics & Computer Science