NoSQL: HBase and Neo4j A.A. 2018/19 Fabiana Rossi Laurea - PDF document

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica NoSQL: HBase and Neo4j A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica - II anno The reference Big Data stack High-level Interfaces Support / Integration Data Processing Data Storage Resource Management Fabiana Rossi - SABD 2018/19 1

Column-family data model • Strongly aggregate-oriented – Lots of aggregates – Each aggregate has a key • Similar to a key/value store, but the value can have multiple attributes ( columns ) • Data model: a two-level map structure: – A set of <row-key, aggregate> pairs – Each aggregate is a group of pairs <column-key, value> – Column: a set of data values of a particular type • Structure of the aggregate visible • Columns can be organized in families – Data usually accessed together Fabiana Rossi - SABD 2018/19 2 Suitable use cases for column-family stores • Queries that involve only a few columns • Aggregation queries against vast amounts of data - E.g., average age of all of your users • Column-wise compression • Well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve highly complex queries over all data (possibly petabytes) Fabiana Rossi - SABD 2018/19 3

HBase • Apache HBase: – open-source implementation providing Bigtable-like capabilities on top of Hadoop and HDFS – CP system (in the CAP space) • Data Model – HBase is based on Google's Bigtable model – A table store rows, sorted in alphanumerical order – A row consists of a set of columns – Columns are grouped in column families – A table defines a priori its column families (but not the columns within the families) Row key Column key Timestamp Cell value cutting info:state 1273516197868 IT parser role:Hadoop 1273616297466 g91m ( info and role are column families) Fabiana Rossi - SABD 2018/19 4 HBase: Auto-sharding Region: • the basic unit of scalability and load balancing • similar to the tablet in Bigtable • a contiguous range of rows stored together • each region is served by exactly one region server • they are dynamically split by the system when they become too large Fabiana Rossi - SABD 2018/19 5

HBase: Architecture Three major components: • the client library • one master server – The master is responsible for assigning regions to region servers and uses Apache ZooKeeper to facilitate that task • many region servers – manage the persistence of data – region servers can be added or removed while the system is up and running to accommodate changing workloads Fabiana Rossi - SABD 2018/19 6 HBase: Architecture Fabiana Rossi - SABD 2018/19 7

Regions Fabiana Rossi - SABD 2018/19 8 HBase HMaster Fabiana Rossi - SABD 2018/19 9

ZooKeeper: the Coordinator Fabiana Rossi - SABD 2018/19 10 HBase First Read or Write Fabiana Rossi - SABD 2018/19 11

HBase Write Steps Fabiana Rossi - SABD 2018/19 12 HBase HFile Fabiana Rossi - SABD 2018/19 13

HBase: Versioning • Cells may exist in multiple versions, and different columns have been written at different times. By default, the API provides a coherent view of all columns wherein it automatically picks the most current value of each cell. Fabiana Rossi - SABD 2018/19 14 HBase: Strengths • The column-oriented architecture allows for huge, wide, sparse tables as storing NULLs is free. • Highly scalable due to the flexible schema and row- level atomicity • Since a row is served by exactly one server, HBase is strongly consistent, and using its multi-versioning can help you to avoid edit conflicts • The storage format is ideal for reading adjacent key/value pairs • Table scans run in linear time and row key lookups or mutations are performed in logarithmic order • Bigtable has been in use for a variety of different use cases from batch-oriented processing to real-time data- serving Fabiana Rossi - SABD 2018/19 15

Hands-on HBase (Docker image) Fabiana Rossi - SABD 2018/19 HBase with Dockers • We use a lightweight container with a standalone HBase $ docker pull harisekhon/hbase:1.4 • We can now create an instance of HBase; since we are interesting to use it from our local machine, we need to forward several HBase ports and update the hosts file; $ docker run -ti --name=hbase-docker -h hbase-docker -p 2181:2181 -p 8080:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16000:16000 -p 16010:16010 -p 16201:16201 -p 16301:16301 harisekhon/hbase:1.4 # append the following line to /etc/hosts 127.0.0.1 hbase-docker Fabiana Rossi - SABD 2018/19 17

HBase Client • We interact with HBase through its Java APIs • Using Maven, include the hbase-client dependency: <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>1.4.2</version> </dependency> Fabiana Rossi - SABD 2018/19 18 HBase Client public Connection getConnection() throws ... { Configuration conf = HBaseConfiguration.create(); conf.set("hbase.zookeeper.quorum", ZOOKEEPER_HOST); conf.set("hbase.zookeeper.property.clientPort", ZOOKEEPER_PORT); conf.set("hbase.master", HBASE_MASTER); /* Check configuration */ HBaseAdmin.checkHBaseAvailable(conf); Connection connection = connectionFactory.createConnection(conf); return connection; } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 19

HBase Client: Create Table public void createTable(String table, String... columnFamilies) { Admin admin = ... HTableDescriptor tableDescriptor = ... table ... for (String columnFamily : columnFamilies) { tableDescriptor.addFamily(columnFamily); } admin.createTable(tableDescriptor); } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 20 HBase Client: Drop Table public void dropTable(String table) { Admin admin = ... TableName tableName = ... table ... // To delete a table or change its settings, // you need to first disable the table admin.disableTable(tableName); admin.deleteTable(tableName); } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 21

HBase Client: Put Data public void put(String table, String rowKey, String columnFamily, String column, String value) { Table hTable = getConnection().getTable( ... table ... ); Put p = new Put(b(rowKey)); p.addColumn(b(columnFamily), b(column), b(value)); // Saving the put Instance to the HTable hTable.put(p); hTable.close(); } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 22 HBase Client: Get Data public String get(String table, String rowKey, String columnFamily, String column) { Table hTable = getConnection().getTable( ... table ... ); Get g = new Get(b(rowKey)); g.addColumn(b(columnFamily), b(column)); Result result = hTable.get(g); return Bytes.toString(result.getValue()); } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 23

HBase Client: Delete Data public void delete(String table, String rowKey) { Table hTable = getConnection().getTable( ... table ... ); Delete delete = new Delete(b(rowKey)); // deleting the data hTable.delete(delete); // closing the HTable object hTable.close(); } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 24 Graph data model • Uses graph structures – Nodes are the entities and have a set of attributes – Edges are the relationships between the entities • E.g.: an author writes a book – Edges can be directed or undirected – Nodes and edges also have individual properties consisting of key-value pairs Fabiana Rossi - SABD 2018/19 25

Graph data model • Powerful data model – Differently from other types of NoSQL stores, it concerns itself with relationships – Focus on visual representation of information (more human- friendly than other NoSQL stores) – Other types of NoSQL stores are poor for interconnected data • Cons: – Sharding: data partitioning is difficult – Horizontal scalability • When related nodes are stored on different servers, traversing multiple servers is not performance-efficient – Requires rewiring your brain Fabiana Rossi - SABD 2018/19 26 Suitable use cases for graph databases • Good for applications where you need to model entities and relationships between them – Social networking applications – Pattern recognition – Dependency analysis – Recommendation systems – Solving path finding problems raised in navigation systems – … • Good for applications in which the focus is on querying for relationships between entities and analyzing relationships – Computing relationships and querying related entities is simpler and faster than in RDBMS Fabiana Rossi - SABD 2018/19 27

NoSQL: HBase and Neo4j A.A. 2018/19 Fabiana Rossi Laurea - PDF document

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica NoSQL: HBase and Neo4j A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica - II anno The reference Big Data stack High-level Interfaces

An Introduc/on to Neo4j @iansrobinson ian.robinson@neotechnology.com #neo4j Neo4j

Stefan Plantikow, Neo4j 2017 Stefan Plantikow, Neo4j 2 2017 Stefan Plantikow, Neo4j

Data Integration for Neo4j using Kettle Matt Casters, matt.casters@neo4j.com mattcasters Neo4j

Apache HBase Deploys Michael Stack GOTO Amsterdam 2011 Me Chair of Apache HBase Project

Neosemantics - A Linked Data Toolkit for Neo4j Jess Barrasa - Neo4j Jess Barrasa

NoSQL: HBase and Neo4j A.A. 2019/20 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -

All-new SDN-RX: Reactive Spring Data Neo4j Spring Data Neo4j / Neo4j-OGM Team Michael Simons

Intro to Neo4j for Developers Jennifer Reif Developer Relations Engineer, Neo4j

Why NoSQL? Why Riak? Justin Sheehy justin@basho.com 1 What's all of this NoSQL nonsense?

NoSQL and MongoDB 1 2 Introduction to NoSQL Based on a presentation by Traversy Media 3 What

NoSQL Source: Pramod J. Sadalage and Martin Fowler NoSQL Distilled: A Brief Guide to the

Django and Neo4j Domain modeling that kicks ass! twitter: @thobe / #neo4j Tobias Ivarsson

Building Spatial Search Algorithms for Neo4j Craig Taverner Neo4j Cypher and Spatial

Neo4j and Spring Data Going from relational databases to databases with relations Michael

Causal Consistency For Large Neo4j Clusters Jim Webber Chief Scientist, Neo4j QCon London Leads

Neo4j Spatial - GIS for the rest of us. OSCON Data 2011 #neo4j Peter Neubauer @peterneubauer

The QARMA Block Cipher Family Roberto Avanzi Qualcomm Product Security Germany Tokyo, March 7,

Intro to Cryptography Definitions Cryptography Cryptanalysis Cryptology CRYPTOGRAPHY

Cypher Knowledge Graphs slide 1 of 14 Cypher overview Cypher is a family of query languages for

Popularity and Challenges of Graph Cypher Queries Introduction Motivation Dataset

CS 327E Class 7 Oct 16, 2020 Review session for Test 2 Test 2 details Exam rules:

Rijndael Note on naming Vincent Rijmen Note on naming 1. Introduction After the selection of

Outline Outline

Symmetric-Key Encryption: One-Way Functions Lecture 6 PRG from One-Way Permutations RECALL