Querying Relational Data: Algebra Gerome Miklau UMass Amherst CMPSCI 645 – Database Systems Jan 31, 2008 Some slide content courtesy of Zack Ives, Ramakrishnan & Gehrke, Dan Suciu, Ullman & Widom
Next lectures • Today – Relational model, relational algebra • Next Tuesday – SQL • Homework 1 will be on these topics
The Relational Model •The relational data model (Codd, 1970): – Data independence: details of physical storage are hidden from users – High-level declarative query language • say what you want, not how to compute it. • mathematical foundation – A theory of normalization guides the design of relations
Relational Database: Definitions • Relational database: a set of relations • Relation: made up of 2 parts: – Instance : a table , with rows and columns. – Schema : specifies name of relation, plus name and type/domain of each column. Students( sid : string, name : string, login : string, age : integer, gpa : real). Restriction: all attributes are of atomic type, no nested tables
Relational instances: tables Arity (number of attributes) is 5 column, attribute, field Students row, tuple Attribute value A relation is a set of tuples: no tuple can occur more than once – Real systems may allow duplicates for efficiency or other reasons – we’ll come back to this.
Relational Query Languages • Query languages: Allow manipulation and retrieval of data from a database. • Query Languages != programming languages! – QLs not expected to be “Turing complete”. – QLs not intended to be used for complex calculations. – QLs support easy, efficient access to large data sets.
Preliminaries Query Q: R 1 ..R n → R’ • A query is applied to one or more relation instances • The result of a query is a relation instance. • Input and output schema: – Schema of input relations for a query are fixed – The schema for the result of a given query is also fixed: determined by definition of query language constructs.
What is an “Algebra” • Mathematical system consisting of: – Operands --- variables or values from which new values can be constructed. – Operators --- symbols denoting procedures that construct new values from given values.
What is the Relational Algebra? • An algebra whose operands are relations or variables that represent relations. • Operators are designed to do the most common things that we need to do with relations in a database. – The result is an algebra that can be used as a query language for relations.
Relational Algebra • Operates on relations, i.e. sets – Later: we discuss how to extend this to bags • Five operators: – Union: ∪ – Difference: - – Selection: σ – Projection: Π – Cartesian Product: × • Derived or auxiliary operators: – Intersection, complement – Joins (natural,equi-join, theta join) – Renaming: ρ – Division: /
Example Database STUDENT COURSE Takes sid name cid name sem sid cid 1 Jill 645 DB F05 1 645 2 Bo 683 AI S05 1 683 3 Maya 635 Arch F05 3 635 PROFESSOR Teaches fid name fid cid 1 Diao 1 645 2 Saul 2 683 8 Weems 8 635
1. Union and 2. Difference R 1 R 2 sid name sid name 1 Jill 1 Jill 2 Bo 4 Bob 3 Maya R 1 – R 2 sid name R 1 ∪ R 2 sid name 2 Bo 1 Jill 3 Maya 2 Bo 3 Maya 4 Bob
What about Intersection ? • It is a derived operator • R 1 ∩ R 2 = R 1 – (R 1 – R 2 ) • Also expressed as a join (will see later) R 1 R 2 R 1 – R 2
3. Selection • Returns all tuples which satisfy a condition Course • Notation: σ c (R) cid name sem • Examples 645 DB F05 683 AI S05 σ CID > 600 (Course) 635 Arch F05 σ name = “AI” (Course) • The condition c can be =, <, ≤ , >, ≥ , <>
4. Projection • Eliminates columns, then removes duplicates • Notation: Π A1,…,An (R) • Example: project cid and name Π cid, name (Course) Output schema: Answer(cid, name) Course Answer cid name sem cid name Π 645 DB F05 645 DB 683 AI S05 683 AI 645 DB S05
5. Cartesian Product • Each tuple in R 1 with each tuple in R 2 • Notation: R 1 × R 2 • Very rare in practice; mainly used to express joins Also called “Cross Product”
Cartesian Product Student × Takes Student Takes sid name sid cid sid cid sid name 1 Jill 1 645 1 645 1 Jill 1 Jill 1 683 2 Bo 1 683 1 Jill 3 635 3 635 2 Bo 1 645 2 Bo 1 683 2 Bo 3 635 17
Renaming • Changes the schema , not the instance • Notation: ρ B1,…,Bn (R) • Example: ρ courseID, cname, term (Course) Course cname term cid name sem courseID ρ 645 DB F05 645 DB F05 683 AI S05 683 AI S05 645 DB S05 645 DB S05
Natural Join • Notation: R 1 R 2 • Meaning: R 1 R 2 = Π A ( σ C (R 1 × R 2 )) • Where: – The selection σ C checks equality of all common attributes – The projection eliminates the duplicate common attributes
Natural join example Student Student Takes sid name 1 Jill sid name cid 2 Bo 3 Maya 1 Jill 645 1 Jill 683 Takes 3 Maya 635 sid cid 1 645 1 683 3 635 20
Example Database STUDENT COURSE Takes sid name cid name sem sid cid 1 Jill 645 DB F05 1 645 2 Bo 683 AI S05 1 683 3 Maya 635 Arch F05 3 635 PROFESSOR Teaches fid name fid cid 1 Diao 1 645 2 Saul 2 683 8 Weems 8 635
Natural join questions • Given the schemas R(A, B, C, D), S(A, C, E), what is the schema of R S ? – R(A,B,C,D,E) • Given R(A, B, C), S(D, E), what is R S ? – Cartesian Product • Given R(A, B), S(A, B), what is R S ? – Intersection
Theta Join • A join that involves a predicate • R1 θ R2 = σ θ (R1 × R2) • Here θ can be any condition: =, <, ≠ , ≤ , >, ≥ Example: Student sid<sid Takes
Equi-join • A theta join where θ is an equality • R 1 A=B R 2 = σ A=B (R 1 × R 2 ) • Very useful join in practice • Example: Student sid=sid Takes
Division Not supported as a primitive operator, but useful for expressing queries like: Find students who have enrolled in all systems courses . Let A have 2 fields, x and y ; B have only field y : A/B = { (x) | ∀ (y) ∈ B, ∃ (x,y) ∈ A } i.e., A/B contains all x tuples (students) such that for every y tuple (course) in B , there is an xy tuple in A . Or : If the set of y values (courses) associated with an x value (student) in A contains all y values in B , the x value is in A/B . In general, x and y can be any lists of fields; y is the list of fields in B , and x y is the list of fields of A .
Division examples sno pno pno pno pno s1 p1 p2 p2 p1 s1 p2 p4 p2 B1 s1 p3 p4 B2 s1 p4 B3 s2 p1 sno s2 p2 s1 sno s3 p2 s2 sno s1 s4 p2 s3 s1 s4 s4 s4 p4 A/B1 A/B2 A/B3 A
Expressing A/B Using Basic Operators Division is not essential op; just a useful shorthand. (Also true of joins, but joins are so common that systems implement joins specially.) Idea : For A/B , compute all x values that are not `disqualified’ by some y value in B . x value is disqualified if by attaching y value from B , we obtain an xy tuple that is not in A . Disqualified x values: all disqualified tuples A/B:
Combining operators: complex expressions Π name , sid ( σ name = ”DB” ( Students (Takes Course))) Π name , sid σ name = ”DB” Students Takes Course
Query Optimization Is Based on Algebraic Equivalences • Relational algebra has laws of commutativity, associativity, etc. that imply certain expressions are equivalent . • They may be different in cost of evaluation! Definition: Query Equivalence σ c ∧ d (R) ≡ σ c ( σ d (R) ) cascading selection Two queries Q and Q’ are equivalent if: R ⋈ (S ⋈ T) ≡ (R ⋈ S) ⋈ T) join associativity for all databases D, Q(D) = Q’(D) σ c (R ⋈ S) ≡ σ c (R) ⋈ S pushing selections • Query optimization finds the most efficient representation to evaluate (or one that’s not bad)
Operations on Bags A bag = a set with repeated elements Relational Engines work on bags, not sets ! All operations need to be defined carefully on bags • {a,b,b,c} ∪ {a,b,b,b,e,f,f}={a,a,b,b,b,b,b,c,e,f,f} • {a,b,b,b,c,c} – {b,c,c,c,d} = {a,b,b} • σ C (R): preserves the number of occurrences • Π A (R): no duplicate elimination • Cartesian product, join: no duplicate elimination
Beware: Bag Laws != Set Laws • Some, but not all algebraic laws that hold for sets also hold for bags. • Example: the commutative law for union ( R ∪ S = S ∪ R ) does hold for bags. – Since addition is commutative, adding the number of times x appears in R and S doesn’t depend on the order of R and S .
Example of the Difference • Set union is idempotent , meaning that S ∪ S = S . • However, for bags, if x appears n times in S , then it appears 2 n times in S ∪ S . • Thus S ∪ S != S in general.
Relational calculus •What is a “calculus”? – The term "calculus" means a system of computation – The relational calculus is a system of computing with relations 33
Recommend
More recommend