Big ig Dat ata Pro roducts ducts an and d Pract actice ices - PowerPoint PPT Presentation

Big ig Dat ata Pro roducts ducts an and d Pract actice ices Venkatesh Vinayakarao venkateshv@cmi.ac.in http://vvtesh.co.in Venkatesh Vinayakarao (Vv)

Cloud Platforms Cloud Services: • SaaS • PaaS • IaaS 2

Cloud Platforms Ref: https://maelfabien.github.io/bigdata/gcps_1/#what-is-gcp 3

Storage Service (Amazon S3 Example) 4

Compute Services (Google Cloud Example) 5

Network Services (Azure Example) • Azure Traffic Manager is a DNS-based traffic load balancer that distributes traffic optimally to services across global Azure regions, while providing high availability. • Traffic Manager directs client requests to the most appropriate service endpoint. 6

Building Great Apps/Services • We need • Products that make certain features easy to implement • Visualization • Crawling/Search • Log Aggregation • Graph DB • Synchronization 7

Tableau 8

Crawling with Nutch Solr Integration Image Src: https://suyashaoc.wordpress.com/2016/12/04/nutch-2-3-1- hbase-0-98-8-hadoop-2-5-2-solr-4-1-web-crawling-and-indexing/ 9

Log Files are an Important Source of Big Data 10

Log4j 11

Flume Flume Config Files 12

Sqoop Designed for efficiently transferring bulk data between Hadoop and RDBMS Structured UnStructured Sqoop2 13

GraphDB – Neo4j ACID compliant graph database management system 14

Neo4j • A leading graph database, with native graph storage and processing. • Open Source • NoSQL • ACID compliant Neo4j Sandbox Neo4j Desktop https://sandbox.ne https://neo4j.com/ o4j.com/ download 15

Data Model • create (p:Person {name:'Venkatesh'})-[:Teaches]- >(c:Course {name:'BigData'}) 16

Query Language • Cypher Query Language • Similar to SQL • Optimized for graphs • Used by Neo4j, SAP HANA Graph, Redis Graph, etc. 17

CQL • create (p:Person {name:'Venkatesh'})-[:Teaches]- >(c:Course {name:'BigData'}) • Don’t forget the single quotes. 18

CQL • Match (n) return n 19

CQL • match(p:Person {name:'Venkatesh'}) set p.surname='Vinayakarao' return p 20

CQL • Create (p:Person { name:’Raj’}) -[:StudentOf]->(o:Org { name:’CMI’}) • Match (n) return n 21

CQL • create (p:Person {name:'Venkatesh'})-[:FacultyAt]- >(o:Org {name:'CMI ’}) • Match (n) return n 22

CQL • MATCH (p:Person {name:'Venkatesh'})-[r:FacultyAt]->() • DELETE r • MATCH (p:Person) where ID(p)=4 • DELETE p • MATCH (o:Org) where ID(o)=5 • DELETE o • MATCH (a:Person),(b:Org) • WHERE a.name = 'Venkatesh' AND b.name = 'CMI' • CREATE (a)-[:FacultyAt]->(b) 23

CQL create (p:Person {name:’ Isha ’}) MATCH (a:Person),(b:Course) WHERE a.name = 'Isha' and b.name = 'BigData' CREATE (a)-[:StudentOf]->(b) MATCH (a:Person)-[o:StudentOf]->(b:Course) where a.name = 'Isha' DELETE o MATCH (a:Person),(b:Org) WHERE a.name = 'Isha ' and b.name = ‘CMI' CREATE (a)-[:StudentOf]->(b) MATCH (a:Person),(b:Course) WHERE a.name = 'Isha' and b.name = 'BigData' CREATE (a)-[:EnrolledIn]->(b) 24

Apache ZooKeeper A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services A Zookeeper Ensemble Serving Clients It is simple to store data using zookeeper Data is stored hierarchically $ create /zk_test my_data $ set /zk_test junk $ get /zk_test junk $ delete /zk_test 25

Stream Processing • Process data as they arrive. 26

Stream Processing with Storm One of these is a master node. “ Nimbus ” is the “job tracker”! In MR parlance, “ Supervisor ” process is our “task tracker”

Apache Kafka • Uses Publish-Subscribe Mechanism 28

Kafka – Tutorial (Single Node) • Create a topic • > bin/kafka-topics.sh ---topic test • List all topics • > bin/kafka-topics.sh – list • > test • Send messages • > bin/kafka-console-producer.sh --topic test • This is a message • This is another message • Receive messages (subscribed to a topic) • > bin/kafka-console-consumer.sh --topic test --from-beginning • This is a message • This is another message 29

Kafka – Multi-node • topic is a stream of records. • for each topic, the Kafka cluster maintains a partitioned log • records in the partitions are each assigned a sequential id number called the offset 30

Kafka Brokers • For Kafka, a single broker is just a cluster of size one. • We can setup multiple brokers • The broker.id property is the unique and permanent name of each node in the cluster. • > bin/kafka-server-start.sh config/server-1.properties & • > bin/kafka-server-start.sh config/server-2.properties & • Now we can create topics with replication factor • > bin/kafka-topics.sh --create --replication-factor 3 --partitions 1 --topic my-replicated-topic • > bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic my-replicated-topic • Topic: my-replicated-topic PartitionCount:1 ReplicationFactor:3 • Partition: 0 Leader: 2 Replicas: 1,2,0 31

Streams API 32

Apache Kinesis • Amazon Kinesis Data Streams is a managed service that scales elastically for real-time processing of streaming big data. “ Netflix uses Amazon Kinesis to monitor the communications between all of its applications so it can detect and fix issues quickly, ensuring high service uptime and availability to its customers.” – Amazon (https://aws.amazon.com/kinesis/) . 33

Amazon Kinesis capabilities • Video Streams • Data Streams • Firehose • Analytics https://aws.amazon.com/kinesis/ 34

Apache Spark (A Unified Library) In spark, use data frames as tables https://spark.apache.org/ 35

Resilient Distributed Datasets (RDDs) RDD RDD Transformations Input RDD RDD RDD Map, filter, … Data.txt RDD RDD Actions Reduce, count, … 36

Spark Examples distributed dataset can be used in parallel distFile = sc.textFile("dta.txt") distFile.map(s => s.length). reduce((a, b) => a + b) Map/reduce passing functions through spark 37

Thank You 38

Big ig Dat ata Pro roducts ducts an and d Pract actice ices - PowerPoint PPT Presentation

Big ig Dat ata Pro roducts ducts an and d Pract actice ices Venkatesh Vinayakarao venkateshv@cmi.ac.in http://vvtesh.co.in Venkatesh Vinayakarao (Vv) Cloud Platforms Cloud Services: SaaS PaaS IaaS 2 Cloud Platforms

Bitly Link & DAT Page Link to Digital Preservation Peer Assessment: http://bit.ly/BPE-DAT

Pl Plac ace for for Sp Spat atial Bi Big Dat ata Anal Analyt ytics Ma May Y Yua uan n

Pf Pfimbi : Accelerat ating Big Dat ata Jo Jobs Through Flow- -Co Controlled Da

Big ig Dat ata a an and Had adoop oop Venkatesh Vinayakarao venkateshv@cmi.ac.in

Why EVs are key to your biz strategy now Beln Gallego ATA Insights belen.gallego@ata.email

D ATA S CIENCE E COSYSTEM M. T AMER ZSU N ANCY R EID R AYMOND N G U. W ATERLOO U. T ORONTO UBC

Engineering November 2, 2009 Innovative Solutions Through Test and Analysis-Driven Design ATA

DM26 Database Systems (Also: Databaser for HA-Dat ) Rolf Fagerberg Fall 2006 1 Course Credit

runs and dat aset s analysis of t he dat aset s remaining quest ions & work runs

grade graphite mine Investor Presentation May 2019 Corporate Information Key Dat ata Shar

18t 18th h Cent entur ury Trade ade Dat ata a : : A Compar omparat ativ ive e Analy

Dat ata a Bias as in Visual ual Re Reco cognition nition

Cha hall llenges enges As Associa ociated ed wi with th Integr egrating ating Dat ata a

Detecting ecting Chan ange ge in in Mult ltivar ivariate iate Dat ata a Strea eams ms

for Sen ensor sor Dat ata a An Analyt alytics ics Arcot t Raj ajas asek ekar ar 1 , ,

De Deal aling ing wit ith h mi missing ssing dat ata a in in pr pract actice: ice:

INTEGRATING SYSTEMS INTEGRATING SYSTEMS IN THE AGE OF IN THE AGE OF QUARKUS, KNATIVE AND KAFKA

DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini DSP

DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini DSP

{adobe.io} open source https://github.com/adobe-apiplatform dragos dascalita haut | project lead

Event-Driven Architecture in the Cloud DevUp October 16, 2019 Chad Green @ChadGreen

Building Data applications with Go from Bloom filters to Data pipelines Sergii Khomenko, Data

serverless IoT-Applications BED-Con 2017 Niko Will, innoQ @n1ko_w1ll about me > Developer

Leveraging Public Clouds for DOE Environmental Streaming Data Marty Humphrey Dept of Computer

Big ig Dat ata Pro roducts ducts an and d Pract actice ices - PowerPoint PPT Presentation

Big ig Dat ata Pro roducts ducts an and d Pract actice ices Venkatesh Vinayakarao venkateshv@cmi.ac.in http://vvtesh.co.in Venkatesh Vinayakarao (Vv) Cloud Platforms Cloud Services: SaaS PaaS IaaS 2 Cloud Platforms

Bitly Link &amp; DAT Page Link to Digital Preservation Peer Assessment: http://bit.ly/BPE-DAT

Pl Plac ace for for Sp Spat atial Bi Big Dat ata Anal Analyt ytics Ma May Y Yua uan n

Pf Pfimbi : Accelerat ating Big Dat ata Jo Jobs Through Flow- -Co Controlled Da

Big ig Dat ata a an and Had adoop oop Venkatesh Vinayakarao venkateshv@cmi.ac.in

Why EVs are key to your biz strategy now Beln Gallego ATA Insights belen.gallego@ata.email

D ATA S CIENCE E COSYSTEM M. T AMER ZSU N ANCY R EID R AYMOND N G U. W ATERLOO U. T ORONTO UBC

Engineering November 2, 2009 Innovative Solutions Through Test and Analysis-Driven Design ATA

DM26 Database Systems (Also: Databaser for HA-Dat ) Rolf Fagerberg Fall 2006 1 Course Credit

runs and dat aset s analysis of t he dat aset s remaining quest ions &amp; work runs

grade graphite mine Investor Presentation May 2019 Corporate Information Key Dat ata Shar

18t 18th h Cent entur ury Trade ade Dat ata a : : A Compar omparat ativ ive e Analy

Dat ata a Bias as in Visual ual Re Reco cognition nition

Cha hall llenges enges As Associa ociated ed wi with th Integr egrating ating Dat ata a

Detecting ecting Chan ange ge in in Mult ltivar ivariate iate Dat ata a Strea eams ms

for Sen ensor sor Dat ata a An Analyt alytics ics Arcot t Raj ajas asek ekar ar 1 , ,

De Deal aling ing wit ith h mi missing ssing dat ata a in in pr pract actice: ice:

INTEGRATING SYSTEMS INTEGRATING SYSTEMS IN THE AGE OF IN THE AGE OF QUARKUS, KNATIVE AND KAFKA

DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini DSP

DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini DSP

{adobe.io} open source https://github.com/adobe-apiplatform dragos dascalita haut | project lead

Event-Driven Architecture in the Cloud DevUp October 16, 2019 Chad Green @ChadGreen

Building Data applications with Go from Bloom filters to Data pipelines Sergii Khomenko, Data

serverless IoT-Applications BED-Con 2017 Niko Will, innoQ @n1ko_w1ll about me &gt; Developer

Leveraging Public Clouds for DOE Environmental Streaming Data Marty Humphrey Dept of Computer

Bitly Link & DAT Page Link to Digital Preservation Peer Assessment: http://bit.ly/BPE-DAT

runs and dat aset s analysis of t he dat aset s remaining quest ions & work runs

serverless IoT-Applications BED-Con 2017 Niko Will, innoQ @n1ko_w1ll about me > Developer