Why Is This Important? Many ways to model a given scenario in a database Schema Refinement and How do we find the best one? We will discuss objective criteria for evaluating Normal Forms database design quality Formally define desired properties Chapter 19 Algorithms for determining if a database has these properties Algorithms for fixing problems 1 2 The Evils of Redundancy Functional Dependencies (FDs) A functional dependency X Y holds over relation R Redundancy is at the root of several problems associated with relational schemas: if, for every allowable instance r of R: Redundant storage t1 r, t2 r, X (t1) = X (t2) implies Y (t1) = Y (t2) Insert, delete, update anomalies I.e., given two tuples in r, if the X values agree, then the Y values must also agree. (X and Y are sets of attributes.) Integrity constraints can be used to identify schemas An FD is a statement about all allowable relations. with such problems and to suggest refinements. Must be identified based on semantics of application. Main refinement technique: decomposition Given some allowable instance r1 of R, we can check if it Replacing ABCD with, say, AB and BCD, or ACD and ABD. violates some FD f, but we cannot tell if f holds over R. Decomposition should be used judiciously: K is a candidate key for R means that K R Is there reason to decompose a relation? However, K R does not require K to be minimal. What problems (if any) does the decomposition cause? 3 4 Wages R W 8 10 Example: Constraints on Entity Set Example (Contd.) 5 7 Hourly_Emps2 S N L R H Are the two smaller Consider a relation obtained from Hourly_Emps: 123-22-3666 Attishoo 48 8 40 tables better? Hourly_Emps (ssn, name, lot, rating, hrly_wages, 231-31-5368 Smiley 22 8 30 Problems in single “wide” hrs_worked) table due to R W: 131-24-3650 Smethurst 35 5 30 Notation: We will denote this relation schema by Update anomaly: Can we 434-26-3751 Guldu 35 5 32 listing the attributes: SNLRWH change W in just the first 612-67-4134 Madayan 35 8 40 tuple of SNLRWH? This is really the set of attributes {S,N,L,R,W,H}. Insertion anomaly: What S N L R W H if we want to insert an Sometimes, we will refer to all attributes of a relation by employee and don’t know 123-22-3666 Attishoo 48 8 10 40 using the relation name. (e.g., Hourly_Emps for SNLRWH) the hourly wage for his rating? 231-31-5368 Smiley 22 8 10 30 Some FDs on Hourly_Emps: Deletion anomaly: If we 131-24-3650 Smethurst 35 5 7 30 delete all employees with ssn is the key: S SNLRWH rating 5, we lose the 434-26-3751 Guldu 35 5 7 32 rating determines hrly_wages: R W information about the wage for rating 5. 612-67-4134 Madayan 35 8 10 40 5 6
Reasoning About FDs Reasoning About FDs (Contd.) Given some FDs, we can infer additional FDs: Additional rules (that follow from the AA): ssn did, did lot implies ssn lot Union: If X Y and X Z, then X YZ Decomposition: If X YZ, then X Y and X Z An FD f is implied by a set of FDs F if f holds whenever all FDs in F hold. Example: Contracts(cid, sid, jid, did, pid, qty, value) and: F + = closure of F; is the set of all FDs that are implied by F. C is the key: C CSJDPQV Project purchases each part using single contract: JP C Armstrong’s Axioms (X, Y, Z are sets of attributes): Dept purchases at most one part from a supplier: SD P Reflexivity: If X Y, then Y X. JP C, C CSJDPQV imply JP CSJDPQV Augmentation: If X Y, then XZ YZ for any Z. Transitivity: If X Y and Y Z, then X Z. SD P implies SDJ JP These are sound (generate only FDs in F + ) and complete SDJ JP, JP CSJDPQV imply SDJ CSJDPQV (generate all FDs in F + ) inference rules for FDs. 7 8 Reasoning About FDs (Contd.) So, What Do We Do Now With FDs? Computing the closure of a set of FDs can be expensive. Essential for identifying problems in a database Size of closure is exponential in # attributes design Typically, we just want to check if a given FD X Y is in Provide a way for “fixing” the problem the closure of a set of FDs F. An efficient algorithm: Compute attribute closure of X (denoted X + ) wrt F: Key concept: normal forms • Set of all attributes A such that X A is in F+ A relation that is in a certain normal form has certain • There is a linear time algorithm to compute this. Check if Y is in X + desirable properties Does F = {A B, B C, CD E} imply A E? I.e, is A E in the closure F + ? Equivalently, is E in A+? 9 10 Normal Forms Boyce-Codd Normal Form (BCNF) Reln R with FDs F is in BCNF if, for all X A in F + Returning to the issue of schema refinement, the first A X (called a trivial FD), or question to ask is whether any refinement is needed. X is a superkey for R. If a relation is in a certain normal form (BCNF, 3NF etc.), In other words, R is in BCNF if the only non-trivial FDs it is known that certain kinds of problems are avoided or that hold over R are key constraints. minimized. R is free of any redundancy caused by FDs alone. Helps deciding whether decomposing the relation will help. • No field of any tuple can be inferred (using only FDs) from the values in the other fields in the relation instance Role of FDs in detecting redundancy: • For X A, consider two tuples with the same X Y A Consider a relation R with three attributes, ABC. X value. x y1 a • No FDs hold: There is no redundancy here. • They should have the same A value. Redundancy? • Given A B: Several tuples could have the same A value, and if so, • No. Since R is in BCNF, X is a superkey and hence x y2 ? they all have the same B value. the “two” tuples must be identical. 11 12
Problems Prevented By BCNF Third Normal Form (3NF) If BCNF is violated by (non-trivial) FD X A, one of the Reln R with FDs F is in 3NF if, for all X A in F + following holds: A X (called a trivial FD), or X is a subset of some key K. X is a superkey for R, or • We store (X, A) pairs redundantly. A is part of some key for R. • E.g., Reserves(S, B, D, C) with SBD as only key and FD S C Minimality of a key is crucial in third condition above. • Credit card number of a sailor stored for each reservation If R is in BCNF, is it automatically in 3NF? What about the X is not a proper subset of any key. other direction? • Redundant storage of (X, A) pairs as above • And there is a chain of FDs K X A, which means that we cannot If R is in 3NF, some redundancy is possible. associate an X value with a K value unless we also associate an A value 3NF is a compromise, used when BCNF is not achievable (e.g., with an X value. no ``good’’ decomposition, or performance considerations). • E.g., Hourly_Emps(S, N, L, R, W, H) with S as only key and FD R W Lossless-join, dependency-preserving decomposition of R into a • Have chain S R W, hence cannot record the fact that employee S has rating R without knowing the hourly wage for that rating collection of 3NF relations is always possible. (covered soon) 13 14 What Does 3NF Achieve? Footnote About Other Normal Forms Prevents same problems as BCNF, except for FDs where 1NF: every field contains only atomic values, i.e., no A is part of some key lists or sets Consider FD X A where X is no superkey, but A is part of some 2NF: 1NF, and all attributes that are not part of any key E.g., Reserves(S, B, D, C) with only key SBD and FDs S C and candidate key are functionally dependent on the C S is in 3NF whole of every candidate key • Notice: same example as before, but adding C S made it 3NF 3NF implies 2NF • Why? Since C S and SBD is a key, CBD is also a key. Hence for S C, C is part of a key 4NF: prevents redundancy from multi-valued • Redundancy problem: for each reservation of sailor S, same (S, C) pair is stored. dependencies (see book) BCNF did not suffer from this redundancy problem. 5NF: addresses redundancy based on join So, why do we need 3NF? Let’s look at decompositions dependencies, which generalize multi-valued first. dependencies (see book) 15 16 Example Decomposition Decomposition of a Relation Schema Decompositions should be used only when needed. Suppose relation R contains attributes A1,..., An. A Let SNLRWH have FDs S SNLRWH and R W decomposition of R replaces R by two or more Second FD causes violation of 3NF relations such that: • W values repeatedly associated with R values. Each new relation schema contains a subset of the Easiest fix: create a relation RW to store these associations attributes of R (and no attributes that do not appear in R), and remove W from the main schema: and • I.e., we decompose SNLRWH into SNLRH and RW Every attribute of R appears as an attribute of at least one Each SNLRWH tuple will now be projected into two of the new relations. tuples, SNLRH and RW, each stored in the Intuition: decomposing R means we will store corresponding relation instances of the relation schemes produced by the Are there any potential problems with this approach? decomposition, instead of instances of R. 17 18
Recommend
More recommend