Università degli Studi di Roma “ Tor Vergata ” Dipartimento di Ingegneria Civile e Ingegneria Informatica NoSQL Databases Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference Big Data stack High-level Interfaces Support / Integration Data Processing Data Storage Resource Management Valeria Cardellini - SABD 2016/17 1
Traditional RDBMSs • RDMBSs: the traditional technology for storing structured data in web and business applications • SQL is good – Rich language and toolset – Easy to use and integrate – Many vendors • They promise ACID guarantees Valeria Cardellini - SABD 2016/17 2 ACID properties • A tomicity – All included statements in a transaction are either executed or the whole transaction is aborted without affecting the database (“all or nothing” principle) • C onsistency – A database is in a consistent state before and after a transaction • I solation – Transactions cannot see uncommitted changes in the database (i.e., the results of incomplete transactions are not visible to other transactions) • D urability – Changes are written to a disk before a database commits a transaction so that committed data cannot be lost through a power failure. Valeria Cardellini - SABD 2016/17 3
RDBMS constraints • Domain constraints – Restricts the domain of each attribute or the set of possible values for the attribute • Entity integrity constraint – No primary key value can be null • Referential integrity constraint – To maintain consistency among the tuples in two relations: every value of one attribute of a relation should exist as a value of another attribute in another relation • Foreign key – To cross-reference between multiple relations: it is a key in a relation that matches the primary key of another relation Valeria Cardellini - SABD 2016/17 4 Pros and cons of RDBMS Pros Cons • Well-defined consistency • Performance as major constraint, scaling is difficult model • Limited support for complex • ACID guarantees data structures • Relational integrity • Complete knowledge of DB maintained through entity structure required to create and referential integrity ad hoc queries constraints • Commercial DBMSs are • Well suited for OLTP apps expensive – OLTP: online transaction • Some DBMSs have limits on processing fields size • Sound theoretical foundation • Data integration from • Stable and standardized multiple RDBMSs can be cumbersome DBMSs available • Well understood Valeria Cardellini - SABD 2016/17 5
RDBMS challenges • Web-based applications caused spikes – Internet-scale data size – High read-write rates – Frequent schema changes • Let’s scale RDBMSs – RDBMS were not designed to be distributed • Possible solutions: – Replication – Sharding Valeria Cardellini - SABD 2016/17 6 Replication • Master/slave architecture • Scales read operations • Write operations? Valeria Cardellini - SABD 2016/17 7
Sharding • Horizontal partitioning of data across many separate servers • Scales read and write operations • Cannot execute transactions across shards (partitions) • Consistent hashing is one form of sharding - Hash both data and nodes using the same hash function in a same ID space Valeria Cardellini - SABD 2016/17 8 Scaling RDBMSs is expensive and inefficient Source: Couchbase technical report Valeria Cardellini - SABD 2016/17 9
NoSQL data stores • NoSQL = Not Only SQL – SQL-style querying is not the crucial objective • Main features of NoSQL data stores – Avoid unneeded complexity – Support flexible schema – Scale horizontally – Provide scalability and high availability by storing and replicating data in distributed systems, often across datacenters – Useful when working with Big data when the data’s nature does not require a relational model • Traditional join operations cannot be used – Do not typically support ACID properties, but rather BASE • Compromising reliability for better performance 10 Valeria Cardellini - SABD 2016/17 ACID vs BASE • Two design philosophies at opposite ends of the consistency-availability spectrum - Keep in mind the CAP theorem ! Pick two of Consistency, Availability and Partition tolerance • ACID: the traditional approach to address the consistency issue in RDBMS – A pessimistic approach: prevent conflicts from occurring • Usually implemented with write locks managed by the system – But ACID does not scale well when handling petabytes of data (remember of latency!) Valeria Cardellini - SABD 2016/17 11
ACID vs BASE (2) • BASE stands for B asically A vailable, S oft state, E ventual consistency – An optimistic approach • Lets conflicts occur, but detects them and takes action to sort the out • Approaches: • conditional updates: test the value just before updating • save both updates: record that they are in conflict and then merge them – Basically Available: the system is available most of the time and there could exist a subsystem temporarily unavailable – Soft state: data is not durable in the sense that its persistence is in the hand of the user that must take care of refresh them – Eventually consistent: the system eventually converge to a consistent state • Usually adopted in NoSQL databases Valeria Cardellini - SABD 2016/17 12 Consistency • Biggest change from a centralized relational database to a cluster-oriented NoSQL • RDBMS: strong consistency – Traditional RDBMS are CA systems • NoSQL systems: mostly eventual consistency Valeria Cardellini - SABD 2016/17 13
Consistency: an example • Ann is trying to book a room of the Ace Hotel in New York on a node located in London of a booking system • Pathin is trying to do the same on a node located in Mumbai • The booking system uses a replicated database with the master located in Mumbai and the slave in London • There is only a room available • The network link between the two servers breaks Pathin Ann London Mumbay Valeria Cardellini - SABD 2016/17 14 Consistency: an example • CA system: neither user can book any hotel room – No tolerance to network partitions • CP system: – Pathin can make the reservation – Ann can see the inconsistent room information but cannot book the room • AP: both nodes accept the hotel reservation – Overbooking! • Remember that the tolerance to this situation depends on the application type – Blog, financial exchange, shopping chart, … Valeria Cardellini - SABD 2016/17 15
Pessimistic vs. optimistic approach • Concurrency involves a fundamental tradeoff between: - Safety (avoiding errors such as update conflicts) and - Liveness (responding quickly to clients) • Pessimistic approaches often: - Severely degrade the responsiveness of a system - Leads to deadlocks, which are hard to prevent and debug Valeria Cardellini - SDCC 2016/17 16 NoSQL cost and performance Source: Couchbase technical report Valeria Cardellini - SABD 2016/17 17
Pros and cons of NoSQL Pros Cons • Easy to scale-out • Do not provide ACID guarantees, less suitable for • Higher performance for OLTP apps massive data scale • No fixed schema, no • Allows sharing of data common data storage model across multiple servers • Limited support for • Most solutions are either aggregation (sum, avg, open-source or cheaper count, group by) • HA and fault tolerance • Performance for complex join is poor Valeria Cardellini - SABD 2016/17 provided by data replication • No well defined approach for • Supports complex data DB design (different structures and objetcs solutions have different data • No fixed schema, supportrs models) unstructured data • Lack of consistent model • Very fast retrieval of data, can lead to solution lock-in suitable for real-time apps 18 Barriers to NoSQL • Main barriers to NoSQL adoption – No full ACID transaction support – Lack of standardized interfaces – Huge investments already made in existing RDBMSs • A commercial example – AWS launched two NoSQL services (SimpleDB in 2007 and later DynamoDB in 2012) and one RDBMS service (RDS in 2009) Valeria Cardellini - SABD 2016/17 19
NoSQL data models • A number of largely diverse data stores not based on the relational data model Valeria Cardellini - SABD 2016/17 20 NoSQL data models • A data model is a set of constructs for representing the information – Relational model: tables, columns and rows • Storage model: how the DBMS stores and manipulates the data internally • A data model is usually independent of the storage model • Data models for NoSQL systems: – Aggregate-oriented models: key-value , document , and column-family – Graph-based models Valeria Cardellini - SABD 2016/17 21
Recommend
More recommend