Summarizing A 3 Way Relational Data Stream Baptiste Csernel, 3rd - PDF document

Summarizing A 3 Way Relational Data Stream Baptiste Csernel, 3rd year PhD Student Fabrice Clérot, Supervisor FT R&D Georges Hébrail, Supervisor ENST 1 Plan � Problem Presentation � Context � Problematic � Useful Tools � CluStream � Bloom Filters � Method Presentation � Entity Summary � Relation Summary � Storage Management � Work in Progress and Perspectives 2 1

Problem Presentation � Motivation � Context � Problematic � Goal 3 Motivations � Data Stream processing is an ever growing preoccupation. � For both DSMS and stream mining applications, summaries are a necessity. � Most information is by nature, relational. 4 2

Context � Data stream summaries generate a lot of interest. � Static tables as well as data stream join evaluation are a popular subject as well. � Single stream mining and single table mining are the norm. � Relational stream mining is not a very active research area. 5 Problematic Entity Stream E Entity Stream F of Elements E i of Elements F j Relation Stream R of Elements R l E i : (K e , t, e1, e2, …. ep) i E F j : (K f , t, f1, f2, …. fq) j i R l : (K e , K f , t, r1, r2, …. rd) l � Additional Constraints : � All Streams are insert only. � All attributes are numerical. � R speed <<< E and F speeds. � References are not broken. 6 3

Goal � Summarizing three data streams sharing a relational link with one another. � Building separate summaries for each entity stream, and for the relation stream. � Summarizing the information contained in the relational links between the streams. 7 Useful Tools � CluStream � Cluster Feature Vector (CFV) � SnapShot System � Bloom Filters 8 4

Cluster Feature Vector (CFV) (BIRCH, Zhang 1996) (Aggarwal 2003) � Structure : (n, CF 1 (t), CF 2 (t), CF 1 (a1), CF 2 (a1), …., CF 1 (ad), CF 2 (ad) ). � With � CF 1 (ak) = Σ (i, 1, n) (ak i ) � CF 2 (ak) = Σ (i, 1, n) (ak i )² � Remark � Time has the same role as any other variable. 9 SnapShot System � The state of the system is saved at regular time intervals � The data structure is chosen in order to allow arithmetic operation between snapshots. � The time at which snapshots are taken is chosen in accordance to the user’s needs. 10 5

Snapshot System : Distribution example : 2 o Order o Snapshots Step 0 69 67 65 2 1 1 70 66 62 2² 2 68 60 52 2 3 3 56 40 24 2 4 4 48 16 2 5 5 64 32 2 6 11 CluStream : Data Stream Clustering Algorithm (Aggarwal 2003) � Algorithm based on three principles : � Dividing processing in two parts, an on-line part and an off-line part. � Creating and maintaining a large population of micro clusters. � Storing the state of those micro clusters with a snapshot system.. 12 6

CluStream (1/4) (on-line part) � Initialization Micro Cluster 1 (CFV, ID list) � Off-line initialization of the micro clusters. � For each element Micro Cluster 2 � Locate the closest micro (CFV, ID list) cluster. …. � Admission test If admitted, update CFV. � Otherwise, create a new micro � cluster, and remove an Micro Cluster N outdated one. (CFV, ID list) 13 CluStream (2/4) (on-line part) � Micro cluster removal � Remove an old micro cluster. (criteria based on the arrival date of the last elements) � If none is available, fuse the two closest micro cluster. (Update the idlist of the absorbing micro cluster) 14 7

CluStream (3/4) (partie en ligne) � Storage � Snapshot system with a distribution in 2 o � Each snapshot contains � The CFV of each micro cluster. � The id list of each micro cluster. 15 CluStream (4/4) (off-line part) � Use the snapshot to rebuild the stream part to be analyzed. (as a set of micro clusters) � Apply a classic classification algorithm to the resulting set of micro clusters. � The resulting clusters represent the final clustering of the stream. 16 8

Bloom Filters (Bloom 1970) (1/2) � Idea : Can remember whether or not it has previously seen any number of elements. � Supports two operations : � Learn a new element. � Test if an element has been previously learned or not. 17 Bloom Filters (Bloom 1970) (1/2) � Structure : � A bloom filter is a simple binary word B of b bytes. � At initialization, all the bytes are set to 0. � Learn a new element E : � Hash E to a b bytes word W E . � Set all the bytes at 1 in W E to 1 in B. � Test a new element N : � Hash N to a b bytes word W N � If all the bytes at 1 in W N are at 1 in B, then, with high probability, N was previously learned. � Otherwise, N was never learned before. � Remark : � Bloom filters are additive. 18 9

Method Presentation � System Overview � Entity Summary � Relation Summary � Storage System 19 System Overview Entity Stream E Entity Stream F Relation Stream R Entity Summary Entity Summary Structure : Structure : - N e Micro Clusters - N f Micro Clusters - N e Bloom Filters - N f Bloom Filters Relation Summary Structure : CFV Cross Table N e x N f CFV Cross Table 20 10

Entity Summary � Upon the arrival of each new element E i (K e , t, e1, e2, …. ep) i : � Find the closest micro cluster. � Test for admission � If admitted : � Update the micro cluster CFV information. � Learn K e with the bloom filter attached to the micro cluster. � If not admitted : � Create a new micro cluster with E i as its seed. � Make room for it by fusing the two closest micro clusters. (this implies adding their two Bloom filters as well) 21 Relation Summary � Upon the arrival of each new element R l (K e , K f , t, r1, r2, …. rd) l : � Check all the Bloom filters for E to locate the one containing K e . Mark its associated micro cluster C i . � Check all the Bloom filters for F to locate the one containing K f . Mark its associated micro cluster C j . � If the couple (i,j) is unique, add the element R l to the CFV of indices (i,j) in the CFV cross table if the couple . 22 11

Storage Management � The storage system used is the same one as the one described in CluStream. � All three streams are considered to share the same system clock. � The information saved in each snapshot is : � For each entity : � The CFV and IdList of each micro cluster. � For the relation : � All the CFV matrix. 23 Work in Progress � A Prototype of the algorithm already exists. � Algorithm Testing : � Exploring suitable real datasets : � Telecommunication (services/usage/client) � Peer 2 Peer (documents/requests/users) � Airline Companies (flight/reservations/passengers) � Constructing an artificial dataset : � What kind of distribution should be used (Zipf?) � What kind of clusters, and what evolution for them. � Finding an appropriate evaluation criteria and evaluation scheme. 24 12

Conclusions and Perspectives � This work is still in progress despite a working prototype. � Perspectives include : � Extensive evaluation with real and artificial data. � Studying the summary querying mechanisms. � Extending the method to more complex data schemes (star first, then any relational type). � Adapting the method to deal with deletions in the streams processed. 25 13

Summarizing A 3 Way Relational Data Stream Baptiste Csernel, 3rd - PDF document

Summarizing A 3 Way Relational Data Stream Baptiste Csernel, 3rd year PhD Student Fabrice Clrot, Supervisor FT R&D Georges Hbrail, Supervisor ENST 1 Plan Problem Presentation Context Problematic Useful Tools

Chapter 2: Relational Model Chapter 2: Relational Model Structure of Relational Databases

Chapter 3: Relational Model Structure of Relational Databases Relational Algebra Tuple

Describing and summarizing data Describing and summarizing data Abhijit Dasgupta Abhijit

Relational Algebra Relational Query Languages Recall: Query = Retrieval Program Language

Relational Algebra 1 / 39 Relational Algebra Relational model specifies stuctures and

Relational Query Languages (2) SQL and QBE Walid G. Aref Query Languages For The Relational

Relational Data Model Hacettepe University Computer Engineering Department Outline 1. Relational

SUMMARIZING A Readers Workshop Mini -Lesson Summarizing A summary is a short statement of

This Lecture The Relational Model Relational data structures Relations and Relational

The Relational Data Model Lecture 6 1 Outline Relational Data Model Functional

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Chapter 8 Evaluation of Relational Operators Implementing the Relational Algebra Relational

Relational Calculus More declarative than relational algebra Foundation for query

RELATIONAL ALGEBRA CHAPTER 6 1 CHAPTER 6 OUTLINE Unary Relational Operations: SELECT and

Relational Algebra Murali Mani What is Relational Algebra? Defines operations (data

CSE 154 LECTURE 13:RELATIONAL DATABASES AND SQL Relational databases relational database : A

Content Who? Why? Learning Pyramid Millers Pyramid How? Blooms Taxonomy What?

SPARQLing Kleene Fast Property Paths in RDF-3X Andrey Gubichev, TU Munich Stephan Seufert,

EvenDB: Optimizing Key-Value Storage for Spatial Locality Eran Gilad, Edward Bortnikov, Anastasia

Cloud Data Management Felix Gessert December 18, 2018, Universitt Hamburg, DBIS Group

Active Learning: Rethinking Our Teaching to Promote Deeper Learning Facilitated by Ken Silvestri,

It Is Finished Christ has accomplished our redemption. The Atonement Atonement means

Overview Introduction Lexicalized TAG, Advantages of parsing with LTAG Parsing LTAGs

LEARN HOW TO CONTROL EVERY ROOM AT A LUXURY HOTEL REMOTELY: THE DANGERS OF INSECURE HOME