DATABASES FOR DATA SCIENTIST Asking and Logical and Requirement Conceptual Answering Physical Engineering Modeling Questions Modeling (Analysis) Conceptual Logical Design Relational Algebra, “Book of Duty” Design (ER) (schema, table names, SQL data types), Physical Design (indices, memory layout, optimizations) https://cs.brown.edu/courses/csci1951-a/res/static/lec02_databases.pdf 19
Entity-Relationship (ER) Model 20
Entity-Relationship (ER) Model ENTITY Attribute RELATIONSHIP 21
Entity-Relationship (ER) Model TWEET PERSON Mentions Timestamp Text Name ID Author CANDIDATE Party Name ID 22
Entity-Relationship (ER) Model TWEET Timestamp Mentions 01/26/2019 03:14:15 SenGillibrand Text Thinkin bout @SenGillibrand. Is Author it Kristen? Kirsten? Keirsten? idk so confused help #gillibrand2020 #maga JoeSchmo #blessed 23
Entity-Relationship (ER) Model Multivalued Attribute TWEET Timestamp Mentions 01/26/2019 03:14:15 SenGillibrand Text Thinkin bout @SenGillibrand. Is Author it Kristen? Kirsten? Keirsten? idk so confused help #gillibrand2020 #maga JoeSchmo #blessed 24
Entity-Relationship (ER) Model Multivalued Attribute TWEET Timestamp Mentions 01/26/2019 03:14:15 DNC, SenGillibrand Text @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 25
Entity-Relationship (ER) Model Composite Attribute Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 26
Clicker Question! 27
Entity-Relationship (ER) Model Clicker Question! Which representation is better? TWEET TWEET Day Time 26 Timestamp Min. Month 14 01 Hour 01/26/2019 03:14:15 Year 03 2019 (a) Composite (b) Normal 28
Entity-Relationship (ER) Model Clicker Question! Which representation is better? TWEET TWEET Day Time 26 Timestamp Min. Month 14 01 Hour 01/26/2019 03:14:15 Year 03 2019 (a) Composite (b) Normal 29
Entity-Relationship (ER) Model ❌❌❌ Find all tweets sent between 2am and 4am that mention democratic primary candidates TWEET Timestamp Mentions 01/26/2019 03:14:15 DNC, SenGillibrand Text @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 30
Entity-Relationship (ER) Model Find all tweets sent between 2am and 4am that ✔✔✔ mention democratic primary candidates Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 31
Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 32
Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 33
Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 34
Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 35
Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 36
Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 37
Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 38
Clicker Question! 39
Entity-Relationship (ER) Model Clicker Question! Is it a good idea to use author +timestamp as a key? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed (a) Yes! (b) No! 40
Entity-Relationship (ER) Model Clicker Question! Is it a good idea to use author +timestamp as a key? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed (a) Yes! (b) No! 41
Entity-Relationship (ER) Model Key Attribute…? TWEET Day Key Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 42
Entity-Relationship (ER) Model Key Attribute…? 83948547823 Key Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 43
Entity-Relationship (ER) Model Relationships PERSON TWEET ID Mentions ID Name Timestamp Text Author CANDIDATE Party ID Name 44
Entity-Relationship (ER) Model Relationships PERSON TWEET ID Mentions ID Name Timestamp Text Author CANDIDATE Party ID Name 45
Entity-Relationship (ER) Model Relationships ID Timestamp TWEET Author Mentions Text CANDIDATE Party ID Name 46
Entity-Relationship (ER) Model Relationships ID Timestamp TWEET Author Text Mentioned_in CANDIDATE Party ID Name 47
Entity-Relationship (ER) Model Relationships ID Timestamp TWEET Author Mentions Text CANDIDATE Party ID Name 48
Entity-Relationship (ER) Model Relationships ID Timestamp Direct/ Indirect? TWEET Author Sentiment Mentions Text CANDIDATE Party ID Name 49
Entity-Relationship (ER) Model ID Name Author_of Timestamp TWEET PERSON ID CANDIDATE Text Mentions Party Name Direct/ ID Indirect? Sentiment 50
Entity-Relationship (ER) Model cardinalities ID (min, max) Name Author_of Timestamp TWEET PERSON ID (0,N) CANDIDATE Text (1,1) Mentions Party Name Direct/ ID Indirect? Sentiment 51
Entity-Relationship (ER) Model cardinalities ID (min, max) Name Author_of Timestamp TWEET PERSON ID (0,N) CANDIDATE Text (1,1) a tweet can have any Mentions Party number of mentions, or Name no mentions Direct/ ID Indirect? Sentiment 52
Entity-Relationship (ER) Model cardinalities ID (min, max) Name Author_of Timestamp TWEET PERSON ID (0,N) CANDIDATE Text (1,1) each mention corresponds Mentions Party one and only one Name candidate Direct/ ID Indirect? Sentiment 53
Design Decisions • You can talk about things that don’t exist! Make sure the representation supports the analysis you want to do. • Should this concept be an entity? Attribute? Relation? • As with most things, there is no good answer • Draft, refine, document, iterate… 54
Design Decisions • You can’t talk about things that don’t exist! Make sure the representation supports the analysis you want to do. • Should this concept be an entity? Attribute? Relation? • As with most things, there is no good answer • Draft, refine, document, iterate… 55
Design Decisions • You can’t talk about things that don’t exist! Make sure the representation supports the analysis you want to do. • Should this concept be an entity? Attribute? Relation? • As with most things, there is no good answer • Draft, refine, document, iterate… 56
Design Decisions • You can’t talk about things that don’t exist! Make sure the representation supports the analysis you want to do. • Should this concept be an entity? Attribute? Relation? • As with most things, there is no good answer • Draft, refine, document, iterate… 57
Design Decisions • You can’t talk about things that don’t exist! Make sure the representation supports the analysis you want to do. • Should this concept be an entity? Attribute? Relation? • As with most things, there is no good answer • Draft, refine, document, iterate… 58
Before we proceed… Burning Questions? 59
DATABASES FOR DATA SCIENTIST Asking and Logical and Requirement Conceptual Answering Physical Engineering Modeling Questions Modeling (Analysis) Conceptual Logical Design Relational Algebra, “Book of Duty” Design (ER) (schema, table names, SQL data types), Physical Design (indices, memory layout, optimizations) https://cs.brown.edu/courses/csci1951-a/res/static/lec02_databases.pdf 60
Relational Model 61
Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 62
Relational Model Relation TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 63
Relational Model Relation Name TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 64
Relational Model Attribute TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL Domain: D = dom(Timestamp) = Valid time strings = ##/##/#### ##:## 65
Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL Tuple 66
Relational Model Relation Schema (R) TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 67
Relational Model Relation Schema (R) TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL Relation State r(R) 68
Relational Model Intension TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL Extension 69
Find all the tweets by authors named Diane. Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 70
Find all the tweets by authors named Diane. SELECT * FROM TWEET WHERE Name is “Diane” Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 71
Find all the tweets by authors named Diane. SELECT * FROM TWEET WHERE Name is “Diane” Relational Model ∅ TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 72
Find all the tweets by authors named Diane. SELECT * FROM TWEET WHERE Name is “Diane” Relational Model ✔ ∅ TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 73
Find all the tweets by authors named Diane. SELECT * FROM TWEET WHERE Name is “Diane” Relational Model ✔ ∅ “Closed world assumption” TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 74
Find all the tweets which weight less than 45lbs Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 75
Find all the tweets which weight less than 45lbs SELECT * FROM TWEET WHERE Weight < 45 Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 76
Find all the tweets which weight less than 45lbs SELECT * FROM TWEET WHERE Weight < 45 Relational Model ❌ ???? TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 77
SQL 78
SQL • Data Definition Language (DDL): Defining data types and Relation Schemas (intensions!) • Data Manipulation and Query Language (DML): • Populating/updating data bases (extensions!) • Querying data bases 79
Creating and Manipulating Tables 80
Creating and Manipulating Tables ID Name Author_of Timestamp TWEET PERSON ID CANDIDATE Text Mentions Party Name Direct/ ID Indirect? Sentiment 81
Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 82
Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 83
Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 84
Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 85
Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 86
Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 87
Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 88
Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 89
Creating Tables 90
Creating Tables TWEET: <ID, Time, Text> 91
Creating Tables TWEET: <ID, Time, Text> create table TWEET ( ID INT, Time TIMESTAMP, Text ??? ); 92
Creating Tables TWEET: <ID, Time, Text> create table TWEET ( ID INT, Time TIMESTAMP, Text ??? ); CHAR(n), VARCHAR(n), CLOB(size) ?? 93
Creating Tables TWEET: <ID, Time, Text> create table TWEET ( ID INT, Time TIMESTAMP, Text VARCHAR(140) ); 94
Creating Tables • CHAR(n): faster -> can use static memory allocation; no length checks in TWEET: <ID, Time, Text> operations, so less overhead • VARCHAR(n): uses less space on create table TWEET ( average ID INT, Time TIMESTAMP, Text VARCHAR(140) ); 95
Creating Tables PERSON: <Handle, Name> create table PERSON ( Handle VARCHAR(100), Name VARCHAR(1000) ); 96
Creating Tables PERSON: <Handle, Name, ProfilePic> create table PERSON ( Handle VARCHAR(100), Name VARCHAR(1000), ProfilePic ??? ); 97
Creating Tables PERSON: <Handle, Name, ProfilePic> create table PERSON ( Handle VARCHAR(100), Name VARCHAR(1000), ProfilePic BLOB(20MB), ); 98
Creating Tables PERSON: <Handle, Name, ProfilePic, ProfilePage> create table PERSON ( Handle VARCHAR(100), Name VARCHAR(1000), ProfilePic BLOB(20MB), ProfilePage ??? ); 99
Creating Tables PERSON: <Handle, Name, ProfilePic, ProfilePage> create table PERSON ( Handle VARCHAR(100), Name VARCHAR(1000), ProfilePic BLOB(20MB), ProfilePage CLOB(20MB) ); 100
Recommend
More recommend