Apache Flume Getting data into Hadoop Problem Getting data into - PowerPoint PPT Presentation

Apache Flume Getting data into Hadoop

Problem • Getting data into HDFS is not difficult: – % hadoop fs --put data.csv . – works great when data is neatly packaged and ready to upload • Unfortunately, e.g. a webserver is creating data all the time – How often should a batch load data to HDFS happen? Daily? Hourly? • The real need is a solution that can deal with streaming logs/data 1.9.2016 2

Solution: Apache Flume • Introduced in Cloudera's CDH3 distribution • versions 0.x: flume, 1.x: flume-ng 1.9.2016 3

Overview • Stream data (events, not files) from clients to sinks • Clients: files, syslog, avro, … • Sinks: HDFS files, HBase, … • Configurable reliability levels – Best effort: “Fast and loose” – Guaranteed delivery: “Deliver no matter what” • Configurable routing / topology 1.9.2016 4

Architecture Component Function Agent The JVM running Flume. One per machine. Runs many sources and sinks. Client Produces data in the form of events. Runs in a separate thread. Sink Receives events from a channel. Runs in a separate thread. Channel Connects sources to sinks (like a queue). Implements the reliability semantics. Event A single datum; a log record, an avro object, etc. Normally around ~4KB. 1.9.2016 5

Events • Payload of the data is called an event – composed of zero or more headers and a body • Headers are key/value pairs – making routing decisions or – carry other structured information 1.9.2016 6

Channels • Provides a buffer for in-flight events – after they are read from sources – until they can be written to sinks in the data processing pipelines • Two (three) primary types are – a memory-backed/nondurable channel – a local-filesystem-backed/durable channel – (hybrid) 1.9.2016 7

Channels • The writing rate of the sink should faster than the ingest rare from the sources – ChannelException might lead to data loss 1.9.2016 8

Interceptors • An interceptor is a point in data flow where events can be inspected and altered • zero or more interceptors can be chained after a source creates an event 1.9.2016 9

Channel Selectors • Responsible for how data moves from a source to one or more channels • Flume comes with two selectors – replicating channel selector (the default) puts a copy of the event into each channel – multiplexing channel selector writes to different channels depending on headers • Combined with interceptors forms the foundation for routing 1.9.2016 10

Sinks • Flume supports a set of sinks – HDFS, ElasticSearch, Solr, HBase, IRC, MongoDB, Cassandra, RabbitMQ, Redis, … • HDFS Sink continuously – open a file in HDFS, – stream data into it, – at some point, close that file – start a new one agent.sinks.k1.type=hdfs agent.sinks.k1.hdfs.path=/path/in/hdfs 1.9.2016 11

Sources • Flume source consumes events delivered to it by an external source – like a web server 1.9.2016 12

Tiered Collection • Send events from agents to another tier of agents to aggregate • Use an Avro sink (really just a client) to send events to an Avro source (really just a server) in another machine • Failover supported • Load balancing (soon) • Transactions guarantee handoff 1.9.2016 13

Tiered Collection Handoff • Agent 1: Tx begin • Agent 1: Channel take event • Agent 1: Sink send • Agent 2: Tx begin • Agent 2: Channel put • Agent 2: Tx commit, respond OK • Agent 1: Tx commit (or rollback) 1.9.2016 14

Tiered Data Collection 1.9.2016 15

Apache Flume • A source writes events to one or more channels • A channel is the holding area as events are passed from a source to a sink • A sink receives events from one channel only • An agent can have many channels 1.9.2016 16

Flume Configuration File • Simple Java property file of key/value pairs • Several agents can be configured in a single file – agents are identified by agent identifier (called a name ) • Each agent is configured, starting with three parameters: agent.sources=<list of sources> agent.channels=<list of channels> agent.sinks=<list of sinks> 1.9.2016 17

• Each source, channel and sink has a unique name within the context of that agent – Prefix for channel named access agent.channels.access • Each item has a type – E.g. in-memory channel is memory agent.channels.access.type=memory 1.9.2016 18

Hello, World! agent.sources=s1 agent.channels=c1 agent.sinks=k1 agent.sources.s1.type = spooldir agent.sources.s1.spoolDir = /etc/spool … agent.sinks.k1.type = hdfs agent.sinks.k1.hdfs.path = hdfs://localhost:9001/user/hduser/log-data … 1.9.2016 19

Hello, World! • Config has one agent (called agent ) with – a source named s1 – a channel named c1 – a sink named k1 • The s1 source's type is spooldir – Files appearing in / etc/spool are ingested • The type of the sink named k1 is hdfs – writes data files to log-data 1.9.2016 20

Command Line Usage % flume-ng help Usage: /usr/local/flume/apache-flume-1.6.0- bin/bin/flume-ng <command> [options]... commands: help display this help text agent run a Flume agent avro-client run an avro Flume client version show Flume version info … 1.9.2016 21

Command Line Usage • The agent command requires 2 parameters – a configuration file to use and – the agent name • Example % flume-ng agent -n agent -f myConf.conf … • Test % cp log-data/* /etc/spool 1.9.2016 22

Sources Code public class MySource implements PollableSource { public Status process() { // Do something to create an Event.. Event e = EventBuilder.withBody(…).build(); // A channel instance is injected by Flume. Transaction tx = channel.getTransaction(); tx.begin(); try { channel.put(e); tx.commit(); } catch (ChannelException ex) { tx.rollback(); return Status.BACKOFF; } finally { tx.close(); } return Status.READY; } } 1.9.2016 23

Sinks Code public class MySink implements PollableSink { public Status process() { Transaction tx = channel.getTransaction(); tx.begin(); try { Event e = channel.take(); if (e != null) { // … tx.commit(); } else { return Status.BACKOFF; } } catch (ChannelException ex) { tx.rollback(); return Status.BACKOFF; } finally { tx.close(); } return Status.READY; } } 1.9.2016 24

Apache Flume Getting data into Hadoop Problem Getting data into - PowerPoint PPT Presentation

Apache Flume Getting data into Hadoop Problem Getting data into HDFS is not difficult: % hadoop fs --put data.csv . works great when data is neatly packaged and ready to upload Unfortunately, e.g. a webserver is creating data

Data Collection With Apache Flume Md. Sadil Khan Rohit Aich Bhowmick Outline Data

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Distributed Computation of with Apache Hadoop Tsz-Wo Sze Yahoo! Cloud Computing Apache

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

Hadoop Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt Publishing,

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Streamline Hadoop DevOps with Apache Ambari Alejandro Fernandez Speaker

IT 2 EC 2020 Train, Reflect, Learn, and Train again Chris Duncalfe 1 , and John Gardner 2 1 SO2

Problems Flaw Hypothesis Methodology depends on caliber of testers to hypothesize and

multi-vendor environments. Joel W. King Engineering and Innovations Network Solutions Using

The Exim Mail Transfer Agent (A brief introduction) http://www.exim.org 1 Configuration file

CS 730/830: Intro AI What is AI? This class Problems in AI Prof. Wheeler Ruml Search TA

A robust SNMP based Infrastructure for Intrusion Detection and Response in tactical MANETs

Tabled CLP for Reasoning over Stream Data Joaqun Arias 1 , 2 1 IMDEA Software Institute, 2

CS 356 Lecture 27 Internet Security Protocols Spring 2013 Review Chapter 1: Basic