miscellaneous topics in databases
play

Miscellaneous Topics in Databases P ARALLEL DBMS W HY P ARALLEL A - PDF document

Topic for Thursday? 1 Miscellaneous Topics in Databases P ARALLEL DBMS W HY P ARALLEL A CCESS T O D ATA ? At 10 MB/s 1,000 x parallel 1.2 days to scan 1.5 minute to scan. 1 Terabyte 1 Terabyte Parallelism: 10 MB/s divide a big


  1.  Topic for Thursday? 1 Miscellaneous Topics in Databases P ARALLEL DBMS

  2. W HY P ARALLEL A CCESS T O D ATA ? At 10 MB/s 1,000 x parallel 1.2 days to scan 1.5 minute to scan. 1 Terabyte 1 Terabyte Parallelism: 10 MB/s divide a big problem into many smaller ones to be solved in parallel. 4 P ARALLEL DBMS: I NTRO  Parallelism is natural to DBMS processing  Pipeline parallelism: many machines each doing one step in a multi-step process.  Partition parallelism: many machines doing the same thing to different pieces of data.  Both are natural in DBMS! Any Any Sequential Sequential Pipeline Program Program Sequential Any Any Sequential Partition Sequential Sequential Sequential Sequential Program Program outputs split N ways, inputs merge 5 S OME || T ERMINOLOGY  Speed-Up Ideal (throughput)  More resources means Xact/sec. proportionally less time Realistic for given amount of data. degree of ||-ism  Scale-Up Realistic  If resources increased (response time) in proportion to sec./Xact increase in data size, Ideal time is constant. degree of ||-ism  Why Realistic <> Ideal? 6

  3. I NTRODUCTION  Parallel machines are becoming quite common and affordable  Prices of microprocessors, memory and disks have dropped sharply  Recent desktop computers feature multiple processors and this trend is projected to accelerate  Databases are growing increasingly large  large volumes of transaction data are collected and stored for later analysis.  multimedia objects like images are increasingly stored in databases  Large-scale parallel database systems increasingly used for:  storing large volumes of data  processing time-consuming decision-support queries  providing high throughput for transaction processing 7  Google data centers around the world, as of 2008 8 P ARALLELISM IN D ATABASES  Data can be partitioned across multiple disks for parallel I/O.  Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel  data can be partitioned and each processor can work independently on its own partition  Results merged when done  Different queries can be run in parallel with each other.  Concurrency control takes care of conflicts.  Thus, databases naturally lend themselves to parallelism. 9

  4. P ARTITIONING  Horizontal partitioning ( shard )  involves putting different rows into different tables  Ex: customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest  Vertical partitioning  involves creating tables with fewer columns and using additional tables to store the remaining columns  partitions columns even when already normalized  called "row splitting" (the row is split by its columns)  Ex: split (slow to find) dynamic data from (fast to find) static data in a table where the dynamic data is not used as often as the static 10 10 10 10 C OMPARISON OF P ARTITIONING T ECHNIQUES  Evaluate how well partitioning techniques support the following types of data access: 1.Scanning the entire relation. 2.Locating a tuple associatively – point queries .  E.g., r.A = 25. 3.Locating all tuples such that the value of a given attribute lies within a specified range – range queries .  E.g., 10  r.A < 25. 11 11 11 11 H ANDLING S KEW USING H ISTOGRAMS  Balanced partitioning vector can be constructed from histogram in a relatively straightforward fashion  Assume uniform distribution within each range of the histogram  Histogram can be constructed by scanning relation, or sampling (blocks containing) tuples of the relation 12 12 12 12

  5. I NTERQUERY P ARALLELISM  Queries/transactions execute in parallel with one another  concurrent processing  Increases transaction throughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second.  Easiest form of parallelism to support 13 13 13 13 I NTRAQUERY P ARALLELISM  Execution of a single query in parallel on multiple processors/disks; important for speeding up long- running queries  Two complementary forms of intraquery parallelism :  Intraoperation Parallelism – parallelize the execution of each individual operation in the query (each CPU runs on a subset of tuples)  Interoperation Parallelism – execute the different operations in a query expression in parallel. (each CPU runs a subset of operations on the data) 14 14 14 14 P ARALLEL J OIN  The join operation requires pairs of tuples to be tested to see if they satisfy the join condition, and if they do, the pair is added to the join output.  Parallel join algorithms attempt to split the pairs to be tested over several processors. Each processor then computes part of the join locally.  In a final step, the results from each processor can be collected together to produce the final result. 15 15 15 15

  6. Q UERY O PTIMIZATION  Query optimization in parallel databases is more complex than in sequential databases  Cost models are more complicated, since we must take into account partitioning costs and issues such as skew and resource contention  When scheduling execution tree in parallel system, must decide:  How to parallelize each operation  how many processors to use for it  What operations to pipeline  what operations to execute independently in parallel  what operations to execute sequentially  Determining the amount of resources to allocate for each operation is a problem  E.g., allocating more processors than optimal can result in high communication overhead 16 16 16 16 D EDUCTIVE D ATABASES O VERVIEW OF D EDUCTIVE D ATABASES  Declarative Language  Language to specify rules  Inference Engine (Deduction Machine)  Can deduce new facts by interpreting the rules  Related to logic programming  Prolog language (Prolog => Pro gramming in log ic)  Uses backward chaining to evaluate  Top-down application of the rules  Consists of:  Facts  Similar to relation specification without the necessity of including attribute names  Rules  Similar to relational views (virtual relations that are not stored) 18 18 18 18

  7. P ROLOG /D ATALOG N OTATION  Facts are provided as predicates  Predicate has  a name  a fixed number of arguments  Convention:  Constants are numeric or character strings  Variables start with upper case letters  E.g., SUPERVISE(Supervisor, Supervisee)  States that Supervisor SUPERVISE(s) Supervisee 19 19 19 19 P ROLOG /D ATALOG N OTATION  Rule  Is of the form head :- body  where :- is read as if and only iff  E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y)  E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y) 20 20 20 20 P ROLOG /D ATALOG N OTATION  Query  Involves a predicate symbol followed by some variable arguments to answer the question  where :- is read as if and only iff  E.g., SUPERIOR(james,Y)?  E.g., SUBORDINATE(james,X)? 21 21 21 21

  8. Prolog notation Supervisory tree 22 22 22 22 P ROVING A NEW FACT 23 23 23 23 24 24 24 24

  9. D ATA M INING D EFINITION Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6% 26 26 26 26 D EFINITION (C ONT .) Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns. 27 27 27 27

  10. W HY U SE D ATA M INING T ODAY ? Human analysis skills are inadequate:  Volume and dimensionality of the data  High data growth rate Availability of:  Data  Storage  Computational power  Off-the-shelf software  Expertise 28 28 28 28 T HE K NOWLEDGE D ISCOVERY P ROCESS Steps: Identify business problem  Data mining  Action  Evaluation and measurement  Deployment and integration into businesses  processes 29 29 29 29 P REPROCESSING AND M INING Knowledge Patterns Preprocessed Data Target Interpretation Data Model Original Data Construction Preprocessing Data Integration and Selection 30 30 30 30

Recommend


More recommend