R-Store: A Scalable Distributed System
for Supporting Real-time Analytics
Feng Li, M. Tamer Özsu, Gang Chen, Beng Chin Ooi @ICDE 2014
Presented by: Xiao Meng CS848, University of Waterloo
Outline Background & Motivation System Overview System Design - - PowerPoint PPT Presentation
R-Store : A Scalable Distributed System for Supporting Real-time Analytics Feng Li, M. Tamer zsu, Gang Chen, Beng Chin Ooi @ICDE 2014 Presented by: Xiao Meng CS848, University of Waterloo Outline Background & Motivation System
Feng Li, M. Tamer Özsu, Gang Chen, Beng Chin Ooi @ICDE 2014
Presented by: Xiao Meng CS848, University of Waterloo
Situation uation for l
arge scal scale e data p data processi
ng
Systems classified into 2 categories: OLTP, OLAP Data periodically transport to OLAP through ETL
emand nd
Time critical decision making (RTOLAP)
OLAP & OLTP processed by one integrated system
Problem em on
simple co le combinatio ination
Resource contention
Inconsistency
updates by OLTP could lead to incorrect OLAP results
Solut ution ion – R-Stor Store
Resource contention
Inconsistency
LAP quer query data b y data based ased on
estamp of
query y sub
ubmission ission fr from
multi ti-ve versi rsion
ing stor
age sys ystem tem – Modified HBase as storage – Mapreduce job for query execution
Period
ically lly mater ateria ialize lize real eal-tim time e data data into nto data cub data cube
– Fully HBaseScan every time is time-consuming
– Streaming Mapreduce to maintain data cube
MapReduce – Scan on HBase
streaming MapReduce
timestamp T Q & metadata
Extend end Scan Scan to 2 to 2 ver versi sion
– FullScan for querying data cube – IncrementalScan for querying real-time data
nfinite e ver versi sion
s of data to data to mainta aintain in quer query co y consist nsistency ency
– Compaction to remove stale versions – Global compaction
Immediately following data cube refresh
– Local compaction
Compact old versions not accessed by any scan process
Target get: Find out changes since last data cube materialization
etho hod
– Take 2 timestamps as input 𝑈𝐸𝐷 & 𝑈𝑅, return the values with largest timestamp before 𝑈𝐸𝐷 & 𝑈𝑅
ementa entation tions
– Naïve: Accessing memstore & storefile in parallel – Adaptive: Maintain key modified since last materialization, first scan memstore, scan or random access keys based on cost
al compactio ction
– Similar to Hbase’s default, retain only one version of each key – Triggered by data cube’s refresh completion
Local l com compactio ction
– Compacted data stored in different file in case block scan process – Files can be removed when not accessed by any scan – Triggered when #tuple/#key exceeds threshold
efine a dat data cub a cube f e for
imensions: ensions: age, age, inco ncome, b e, buys uys
– First run – FullScan on one region, generate a KV pair for each cuboid in mapper, aggregate &
– Consequent runs – Propagation step to computes change & update step to update cube – Streaming system updates cube inside & periodically materialize it into storage
– Maintain KVs locally, cache hot keys in memory – For updates, emit 2 KV pair for each cubiod(+, -)
𝑠 , refresh
cube every 𝑋
𝑑𝑣𝑐𝑓
aggregation function & three values
– Intel X3430 2.4 GHz processor – 8 GB of memory – 2x500 GB SATA disks – gigabit Ethernet
throughput than 40 Hbase-R nodes
fewer data in Hbase-R, process fewer data
– C-Store(VLDB 05)
– HyPer(ICDE 11), HYRISE(VLDB 10)
the business, e.g. sales, profits…
information can be shown in a 2D physical space of a spreadsheet.
) 1 1 ( n i i L T