Welcome to CMPSCI 645 Database Design & Implementation Instructor: Gerome Miklau
Overview of Databases Gerome Miklau CMPSCI 645 – Database Design & Implementation UMass Amherst Jan 29, 2008 Some slide content courtesy of Zack Ives, Ramakrishnan & Gehrke, Dan Suciu, Ullman & Widom
Today • Student information form • Overview of databases • Course topics • Course requirements
Databases & DBMS’s • A database is a large, integrated collection of data. • A database management system (DBMS) is a collection of software designed to store and manage databases, allowing: – Define the kind of data stored – Querying/updating interface – Reliable storage & recovery of 100s of GB – Control access to data from many concurrent users
Can filesystems do it? No • Schema for files is limited • No query language for data in files • Files can store large amounts of data, but – no efficient access to items within file – no recovery from failure • Concurrent access not safe
Evolution • Early DBMS’s (1960’s), evolved from file systems. • Data with many small items & many queries or modifications: – Airline reservations – Banking
Early DB systems Data model The data model includes basic assumptions about what’s an “item” of data, how to represent it and interpret it. • Tree-based hierarchical data model • Graph-based network data model • Encouraged users to think about data the way it was stored. • No high level query language
The Relational Model •The relational data model (Codd, 1970): – Data independence: details of physical storage are hidden from users – High-level declarative query language • say what you want, not how to compute it. • mathematical foundation – A theory of normalization guides the design of relations Side-note: Turing Awards in Databases 1973: Bachman, networked data model 1981: Codd, relational model 1998: Jim Gray, transaction processing
DBMS Benefit #1: Generality and Declarativity • The programmer or user does not need to know details like indices, sort orders, machine speeds, disk speeds, concurrent users, etc. • Instead, the programmer/user programs with a logical model in mind • The DBMS “makes it happen” based on an understanding of relative costs of different methods
Benefit #2: Efficiency and Scale • Efficient storage of hundreds of GBs of data • Efficient access to data • Rapid processing of transactions
Benefit #3: Management of Concurrency and Reliability • Simultaneous transactions handled safely. • Recovery of system data after system failure. • More formally: the ACID properties – Atomicity - all or nothing – Consistency - sensible state not violated – Isolation - separated from effects – Durability - once completed, never lost
How Does One Build a Database? • Start with a conceptual model • Design & implement schema • Write applications using DBMS and other tools – Many ways of doing this (DBMS, API writers, library authors, web server, etc.) – Common applications include PHP/JSP/servlet- driven web sites • The DBMS takes care of query optimization and execution
Conceptual Design fid PROFESSOR name Teaches STUDENT COURSE Takes cid name semester sid name
Designing a Schema (Set of Relations) STUDENT COURSE Takes sid name cid name sem sid cid 1 Jill 645 DB F05 1 645 2 Bo 683 AI S05 1 683 3 Maya 635 Arch F05 3 635 PROFESSOR Teaches • Convert to tables + fid name fid cid constraints 1 Diao 1 645 • Then need to do 2 Saul 2 683 “physical” design: the layout on disk, indices, 8 Weems 8 635 etc.
Queries • Find all courses that “Mary” takes SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.sid = T.sid and T.cid = C.cid • What happens behind the scene ? – Query processor figures out how to answer the query efficiently.
Queries, behind the scene Declarative SQL query Query execution plan: sname SELECT C.name FROM Students S, Takes T, Courses C WHERE S.name=“Mary” and S.sid = T.sid and T.cid = C.cid cid=cid sid=sid name=“Mary” Courses Takes Students The optimizer chooses the best execution plan for a query
An Issue: 80% of the World’s Data is Not in a DB! Examples: – Scientific data (large images, complex programs that analyze the data) – Personal data – WWW and email (some of it is stored in something resembling a DBMS) Data management is expanding to tackle these problems
DBMSs in the Real World A huge industry for 20% of the world’s data! • Big, mature relational databases – IBM DB2, Oracle, Microsoft SQL Server – Adding advanced features, including “native XML” support • “Middleware” above these systems – SAP, Siebel, PeopleSoft, dozens of special-purpose apps • Integration and warehousing systems – BEA AquaLogic, DB2 Information Integrator • Current trends: – Web services; XML everywhere – Smarter, self-tuning systems – Distributed databases, column-stores
Database Research • One of the broadest, most exciting areas in CS! • A microcosm of CS in general • languages, operating systems, concurrent programming, data structures, algorithms, theory, distributed systems, statistical techniques. • Theory and systems well-integrated.
Recent Trends in Databases • XML – Relational databases with XML support – Middleware between XML and relational databases – Large-scale XML message systems • Main memory database systems • Peer data management • Stream data management • Model management, provenance • Security and privacy • Modeling uncertainty, probabilistic databases
What is the Field of Databases ? • To an applied researcher (SIGMOD/VLDB/ICDE) – Query optimization – Query processing (yet-another join algorithm) – Transaction processing, recovery (but most stuff is already done) – Novel applications: data mining, high-dimensional search • To a theoretical researcher (PODS/ICDT/LICS) – Focus on the query languages – Query language = logic = complexity classes
Course topics • Fundamentals : relational design, query languages. • Theory : expressiveness of query languages, static analysis, complexity. • Database internals : storage, indexing, query processing, query optimization, transaction management. • XML and semi-structured data models. • Security: access control, privacy. • Advanced topics : incomplete/probabilistic DBs, parallel and distributed DBs.
Prerequisites • Official: undergrad course in DB or OS • Also: – Elementary complexity theory
Grading • Homework: 30% • Paper reviews & participation: 15% • Midterm: 30% • Project: 25%
Homework: 30% • ~ 4 assignments throughout the course – written problem sets – practical experience with SQL, XQuery
Paper Reviews & Participation: 15% • Approximately 5 classic papers will be assigned • Short written reviews are due before the day of class. Email to: – cs645-reviews@cs.umass.edu First paper review: Read thru 1.4 of Codd’s paper Due Wed Feb 5th
Project: 25% • General theme: apply database principles to a new problem • Suggested topics will be discussed next Tuesday • Groups of 2 preferred. 3 possible. • Project work will include: – Reading some of the research literature – Implementation – Written report – In-class presentation • Periodic consultation with the instructor
Midterm Exam (30%) • Midterm scheduled for Apr 17th at 7pm • (No Final!)
Textbook Database Management Systems Ramakrishnan and Gehrke
Other useful resources • Database systems: the complete book (Ullman, Widom and Garcia-Molina) • Readings in Database Systems (Stonebraker and Hellerstein) • Foundations of Databases (Abiteboul, Hull, Vianu) • Data on the Web (Abiteboul, Buneman, Suciu) • Parallel and Distributed DBMS (Ozsu and Valduriez) • Transaction Processing (Gray and Reuter) • Data and Knowledge based Systems (volumes I, II) (Ullman) • Proceedings of SIGMOD, VLDB, PODS conferences.
Communication • Instructor – Office hours: by appointment – Email: miklau at cs dot umass dot edu • Check the course webpage often • You should have been added to the mailing list. 31
Questions about the course? 32
Recommend
More recommend