Outline Background & Motivation System Overview System Design - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Background & Motivation System Overview System Design - - PowerPoint PPT Presentation

R-Store : A Scalable Distributed System for Supporting Real-time Analytics Feng Li, M. Tamer zsu, Gang Chen, Beng Chin Ooi @ICDE 2014 Presented by: Xiao Meng CS848, University of Waterloo Outline Background & Motivation System


slide-1
SLIDE 1

R-Store: A Scalable Distributed System

for Supporting Real-time Analytics

Feng Li, M. Tamer Özsu, Gang Chen, Beng Chin Ooi @ICDE 2014

Presented by: Xiao Meng CS848, University of Waterloo

slide-2
SLIDE 2

Outline

  • Background & Motivation
  • System Overview
  • System Design
  • RTOLAP in R-Store
  • Evaluation
  • Conclusion
  • Q & A
slide-3
SLIDE 3

Background & Motivation

  • Si

Situation uation for l

  • r large

arge scal scale e data p data processi

  • cessing

ng

 Systems classified into 2 categories: OLTP, OLAP  Data periodically transport to OLAP through ETL

  • Dema

emand nd

 Time critical decision making (RTOLAP)

  • the freshness of OLAP results
  • Fully RTOLAP entail executing query directly on OLTP data

 OLAP & OLTP processed by one integrated system

slide-4
SLIDE 4

Background & Motivation

  • Prob

Problem em on

  • n si

simple co le combinatio ination

 Resource contention

  • OLTP query blocked by OLAP

 Inconsistency

  • Long running OLAP may access same data sets several times,

updates by OLTP could lead to incorrect OLAP results

  • So

Solut ution ion – R-Stor Store

 Resource contention

  • Computation resource isolation

 Inconsistency

  • Multi-versioning storage system
slide-5
SLIDE 5

System Overview – A

A glimpse of R-Store

  • OLAP

LAP quer query data b y data based ased on

  • n timest

estamp of

  • f quer

query y sub

ubmission ission fr from

  • m mul

multi ti-ve versi rsion

  • nin

ing stor

  • rag

age sys ystem tem – Modified HBase as storage – Mapreduce job for query execution

  • Per

Period

  • dica

ically lly mater ateria ialize lize real eal-tim time e data data into nto data cub data cube

– Fully HBaseScan every time is time-consuming

  • Entire table is scanned & shuffled during MR

– Streaming Mapreduce to maintain data cube

slide-6
SLIDE 6

System Overview – R-Store Architecture

  • OLTP submitted to KV Store
  • OLAP query processed by

MapReduce – Scan on HBase

  • Refresh data cube through

streaming MapReduce

  • MetaStore to generate query

timestamp T Q & metadata

slide-7
SLIDE 7

System Design – A Glimpse of HBase

slide-8
SLIDE 8

System Design – Storage Design based on HBase

  • Ext

Extend end Scan Scan to 2 to 2 ver versi sion

  • ns

– FullScan for querying data cube – IncrementalScan for querying real-time data

  • Infinit

nfinite e ver versi sion

  • ns of

s of data to data to mainta aintain in quer query co y consist nsistency ency

– Compaction to remove stale versions – Global compaction

 Immediately following data cube refresh

– Local compaction

 Compact old versions not accessed by any scan process

slide-9
SLIDE 9

System Design – IncrementalScan in detail

  • Tar

Target get: Find out changes since last data cube materialization

  • Met

etho hod

– Take 2 timestamps as input 𝑈𝐸𝐷 & 𝑈𝑅, return the values with largest timestamp before 𝑈𝐸𝐷 & 𝑈𝑅

  • Implem

ementa entation tions

– Naïve: Accessing memstore & storefile in parallel – Adaptive: Maintain key modified since last materialization, first scan memstore, scan or random access keys based on cost

slide-10
SLIDE 10

System Design – Compaction in detail

  • Glob
  • bal co

al compactio ction

– Similar to Hbase’s default, retain only one version of each key – Triggered by data cube’s refresh completion

  • Loca

Local l com compactio ction

– Compacted data stored in different file in case block scan process – Files can be removed when not accessed by any scan – Triggered when #tuple/#key exceeds threshold

slide-11
SLIDE 11

System Design – Data cube

  • Define a

efine a dat data cub a cube f e for

  • r “Customer Profiles”
  • Dim

imensions: ensions: age, age, inco ncome, b e, buys uys

slide-12
SLIDE 12

System Design – Data cube maintenance

  • Re-computation

– First run – FullScan on one region, generate a KV pair for each cuboid in mapper, aggregate &

  • utput in reducer
  • Incremental Update

– Consequent runs – Propagation step to computes change & update step to update cube – Streaming system updates cube inside & periodically materialize it into storage

slide-13
SLIDE 13

System Design – HStreaming for cube maintenance

  • Each mapper responsible for processing update within a key range

– Maintain KVs locally, cache hot keys in memory – For updates, emit 2 KV pair for each cubiod(+, -)

  • Reducer cache the output KV of mapper and invoke reduce every 𝑋

𝑠 , refresh

cube every 𝑋

𝑑𝑣𝑐𝑓

slide-14
SLIDE 14

System Design – Data Flow of R-Store

  • 1. Updates arrives Hbase-R 2. stream updates to a Hstreaming mapper
  • 3. Reducer periodically materialize local data cube to Hbase-R & notifies Metastore
slide-15
SLIDE 15

RTOLAP in R-Store – Query Processing

  • Map
  • Tag the values with ‘Q’ ‘+’, ‘-’
  • Reduce
  • Do calculation based on

aggregation function & three values

slide-16
SLIDE 16

Evaluation

  • Cluster of 144 nodes

 – Intel X3430 2.4 GHz processor  – 8 GB of memory  – 2x500 GB SATA disks  – gigabit Ethernet

  • TPC-H data
slide-17
SLIDE 17

Evaluation - Performance of Maintaining Data cube

  • Hstreaming with 10 nodes have higher

throughput than 40 Hbase-R nodes

  • 1.6 billion keys, 1% updated, update algorithm fast enough,
  • latency equals to Hbase-R input speed
slide-18
SLIDE 18

Evaluation - Performance of RT querying

  • Small key range updates scans

fewer data in Hbase-R, process fewer data

slide-19
SLIDE 19

Evaluation - Performance of OLTP

slide-20
SLIDE 20

Related Work

  • Database

– C-Store(VLDB 05)

  • Main-memory database

– HyPer(ICDE 11), HYRISE(VLDB 10)

  • Druid(SIGMOD 14)
slide-21
SLIDE 21

Conclusion

  • Multi-version concurrent control to support RTOLAP
  • Data cube to reduce storage requirement & improve performance
  • Streaming system to refresh data cube
  • Available at https://github.com/lifeng5042/RStore
slide-22
SLIDE 22

Q & A

slide-23
SLIDE 23

Backup – OLAP Cube

  • A multi-dimensional generalization of a two- or three-dimensional
  • spreadsheet. Hypercube for dataset with more than three d’s.
  • Dimensions: Product, time, cities…
  • Cells: each cell of the cube holds a number that represents some measure of

the business, e.g. sales, profits…

  • Slicer: the dimension held constant for all cells so that multi-dimensional

information can be shown in a 2D physical space of a spreadsheet.

slide-24
SLIDE 24

Backup – OLAP Cube

  • Data cube can be viewed as a lattice of cuboids
  • The bottom-most cuboid is the base cuboid
  • The top-most cuboid (apex) contains only one cell
  • How many cuboids in an n-dimensional cube with L levels?

) 1 1 (     n i i L T