Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica NoSQL: HBase and Neo4j A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica - II anno The reference Big Data stack High-level Interfaces Support / Integration Data Processing Data Storage Resource Management Fabiana Rossi - SABD 2018/19 1
Column-family data model • Strongly aggregate-oriented – Lots of aggregates – Each aggregate has a key • Similar to a key/value store, but the value can have multiple attributes ( columns ) • Data model: a two-level map structure: – A set of <row-key, aggregate> pairs – Each aggregate is a group of pairs <column-key, value> – Column: a set of data values of a particular type • Structure of the aggregate visible • Columns can be organized in families – Data usually accessed together Fabiana Rossi - SABD 2018/19 2 Suitable use cases for column-family stores • Queries that involve only a few columns • Aggregation queries against vast amounts of data - E.g., average age of all of your users • Column-wise compression • Well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve highly complex queries over all data (possibly petabytes) Fabiana Rossi - SABD 2018/19 3
HBase • Apache HBase: – open-source implementation providing Bigtable-like capabilities on top of Hadoop and HDFS – CP system (in the CAP space) • Data Model – HBase is based on Google's Bigtable model – A table store rows, sorted in alphanumerical order – A row consists of a set of columns – Columns are grouped in column families – A table defines a priori its column families (but not the columns within the families) Row key Column key Timestamp Cell value cutting info:state 1273516197868 IT parser role:Hadoop 1273616297466 g91m ( info and role are column families) Fabiana Rossi - SABD 2018/19 4 HBase: Auto-sharding Region: • the basic unit of scalability and load balancing • similar to the tablet in Bigtable • a contiguous range of rows stored together • each region is served by exactly one region server • they are dynamically split by the system when they become too large Fabiana Rossi - SABD 2018/19 5
HBase: Architecture Three major components: • the client library • one master server – The master is responsible for assigning regions to region servers and uses Apache ZooKeeper to facilitate that task • many region servers – manage the persistence of data – region servers can be added or removed while the system is up and running to accommodate changing workloads Fabiana Rossi - SABD 2018/19 6 HBase: Architecture Fabiana Rossi - SABD 2018/19 7
Regions Fabiana Rossi - SABD 2018/19 8 HBase HMaster Fabiana Rossi - SABD 2018/19 9
ZooKeeper: the Coordinator Fabiana Rossi - SABD 2018/19 10 HBase First Read or Write Fabiana Rossi - SABD 2018/19 11
HBase Write Steps Fabiana Rossi - SABD 2018/19 12 HBase HFile Fabiana Rossi - SABD 2018/19 13
HBase: Versioning • Cells may exist in multiple versions, and different columns have been written at different times. By default, the API provides a coherent view of all columns wherein it automatically picks the most current value of each cell. Fabiana Rossi - SABD 2018/19 14 HBase: Strengths • The column-oriented architecture allows for huge, wide, sparse tables as storing NULLs is free. • Highly scalable due to the flexible schema and row- level atomicity • Since a row is served by exactly one server, HBase is strongly consistent, and using its multi-versioning can help you to avoid edit conflicts • The storage format is ideal for reading adjacent key/value pairs • Table scans run in linear time and row key lookups or mutations are performed in logarithmic order • Bigtable has been in use for a variety of different use cases from batch-oriented processing to real-time data- serving Fabiana Rossi - SABD 2018/19 15
Hands-on HBase (Docker image) Fabiana Rossi - SABD 2018/19 HBase with Dockers • We use a lightweight container with a standalone HBase $ docker pull harisekhon/hbase:1.4 • We can now create an instance of HBase; since we are interesting to use it from our local machine, we need to forward several HBase ports and update the hosts file; $ docker run -ti --name=hbase-docker -h hbase-docker -p 2181:2181 -p 8080:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16000:16000 -p 16010:16010 -p 16201:16201 -p 16301:16301 harisekhon/hbase:1.4 # append the following line to /etc/hosts 127.0.0.1 hbase-docker Fabiana Rossi - SABD 2018/19 17
HBase Client • We interact with HBase through its Java APIs • Using Maven, include the hbase-client dependency: <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>1.4.2</version> </dependency> Fabiana Rossi - SABD 2018/19 18 HBase Client public Connection getConnection() throws ... { Configuration conf = HBaseConfiguration.create(); conf.set("hbase.zookeeper.quorum", ZOOKEEPER_HOST); conf.set("hbase.zookeeper.property.clientPort", ZOOKEEPER_PORT); conf.set("hbase.master", HBASE_MASTER); /* Check configuration */ HBaseAdmin.checkHBaseAvailable(conf); Connection connection = connectionFactory.createConnection(conf); return connection; } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 19
HBase Client: Create Table public void createTable(String table, String... columnFamilies) { Admin admin = ... HTableDescriptor tableDescriptor = ... table ... for (String columnFamily : columnFamilies) { tableDescriptor.addFamily(columnFamily); } admin.createTable(tableDescriptor); } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 20 HBase Client: Drop Table public void dropTable(String table) { Admin admin = ... TableName tableName = ... table ... // To delete a table or change its settings, // you need to first disable the table admin.disableTable(tableName); admin.deleteTable(tableName); } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 21
HBase Client: Put Data public void put(String table, String rowKey, String columnFamily, String column, String value) { Table hTable = getConnection().getTable( ... table ... ); Put p = new Put(b(rowKey)); p.addColumn(b(columnFamily), b(column), b(value)); // Saving the put Instance to the HTable hTable.put(p); hTable.close(); } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 22 HBase Client: Get Data public String get(String table, String rowKey, String columnFamily, String column) { Table hTable = getConnection().getTable( ... table ... ); Get g = new Get(b(rowKey)); g.addColumn(b(columnFamily), b(column)); Result result = hTable.get(g); return Bytes.toString(result.getValue()); } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 23
HBase Client: Delete Data public void delete(String table, String rowKey) { Table hTable = getConnection().getTable( ... table ... ); Delete delete = new Delete(b(rowKey)); // deleting the data hTable.delete(delete); // closing the HTable object hTable.close(); } This is only an excerpt, check the HBaseClient.java file Fabiana Rossi - SABD 2018/19 24 Graph data model • Uses graph structures – Nodes are the entities and have a set of attributes – Edges are the relationships between the entities • E.g.: an author writes a book – Edges can be directed or undirected – Nodes and edges also have individual properties consisting of key-value pairs Fabiana Rossi - SABD 2018/19 25
Graph data model • Powerful data model – Differently from other types of NoSQL stores, it concerns itself with relationships – Focus on visual representation of information (more human- friendly than other NoSQL stores) – Other types of NoSQL stores are poor for interconnected data • Cons: – Sharding: data partitioning is difficult – Horizontal scalability • When related nodes are stored on different servers, traversing multiple servers is not performance-efficient – Requires rewiring your brain Fabiana Rossi - SABD 2018/19 26 Suitable use cases for graph databases • Good for applications where you need to model entities and relationships between them – Social networking applications – Pattern recognition – Dependency analysis – Recommendation systems – Solving path finding problems raised in navigation systems – … • Good for applications in which the focus is on querying for relationships between entities and analyzing relationships – Computing relationships and querying related entities is simpler and faster than in RDBMS Fabiana Rossi - SABD 2018/19 27
Recommend
More recommend