data representation and intro sql
play

Data Representation and Intro SQL January 28, 2020 Data Science - PowerPoint PPT Presentation

Data Representation and Intro SQL January 28, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1 But first! Waitlistwe are working our way through Top Hat; Have


  1. DATABASES FOR DATA SCIENTIST Asking and Logical and Requirement Conceptual Answering Physical Engineering Modeling Questions Modeling (Analysis) Conceptual Logical Design Relational Algebra, “Book of Duty” Design (ER) (schema, table names, SQL data types), Physical Design (indices, memory layout, optimizations) https://cs.brown.edu/courses/csci1951-a/res/static/lec02_databases.pdf 19

  2. Entity-Relationship (ER) Model 20

  3. Entity-Relationship (ER) Model ENTITY Attribute RELATIONSHIP 21

  4. Entity-Relationship (ER) Model TWEET PERSON Mentions Timestamp Text Name ID Author CANDIDATE Party Name ID 22

  5. Entity-Relationship (ER) Model TWEET Timestamp Mentions 01/26/2019 03:14:15 SenGillibrand Text Thinkin bout @SenGillibrand. Is Author it Kristen? Kirsten? Keirsten? idk so confused help #gillibrand2020 #maga JoeSchmo #blessed 23

  6. Entity-Relationship (ER) Model Multivalued Attribute TWEET Timestamp Mentions 01/26/2019 03:14:15 SenGillibrand Text Thinkin bout @SenGillibrand. Is Author it Kristen? Kirsten? Keirsten? idk so confused help #gillibrand2020 #maga JoeSchmo #blessed 24

  7. Entity-Relationship (ER) Model Multivalued Attribute TWEET Timestamp Mentions 01/26/2019 03:14:15 DNC, SenGillibrand Text @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 25

  8. Entity-Relationship (ER) Model Composite Attribute Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 26

  9. Clicker Question! 27

  10. Entity-Relationship (ER) Model Clicker Question! Which representation is better? TWEET TWEET Day Time 26 Timestamp Min. Month 14 01 Hour 01/26/2019 03:14:15 Year 03 2019 (a) Composite (b) Normal 28

  11. Entity-Relationship (ER) Model Clicker Question! Which representation is better? TWEET TWEET Day Time 26 Timestamp Min. Month 14 01 Hour 01/26/2019 03:14:15 Year 03 2019 (a) Composite (b) Normal 29

  12. Entity-Relationship (ER) Model ❌❌❌ Find all tweets sent between 2am and 4am that mention democratic primary candidates TWEET Timestamp Mentions 01/26/2019 03:14:15 DNC, SenGillibrand Text @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 30

  13. Entity-Relationship (ER) Model Find all tweets sent between 2am and 4am that ✔✔✔ mention democratic primary candidates Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 31

  14. Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 32

  15. Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 33

  16. Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 34

  17. Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 35

  18. Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 36

  19. Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 37

  20. Entity-Relationship (ER) Model Key Attribute: Designated attribute that uniquely identifies the entry …? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 38

  21. Clicker Question! 39

  22. Entity-Relationship (ER) Model Clicker Question! Is it a good idea to use author +timestamp as a key? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed (a) Yes! (b) No! 40

  23. Entity-Relationship (ER) Model Clicker Question! Is it a good idea to use author +timestamp as a key? Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed (a) Yes! (b) No! 41

  24. Entity-Relationship (ER) Model Key Attribute…? TWEET Day Key Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 42

  25. Entity-Relationship (ER) Model Key Attribute…? 83948547823 Key Day TWEET Time 26 Mentions Min. Month DNC, SenGillibrand 14 01 Hour Text Year 03 @DNC Thinkin bout Author @SenGillibrand. Is it Kristen? 2019 Kirsten? Keirsten? idk so confused help #gillibrand2020 JoeSchmo #maga #blessed 43

  26. Entity-Relationship (ER) Model Relationships PERSON TWEET ID Mentions ID Name Timestamp Text Author CANDIDATE Party ID Name 44

  27. Entity-Relationship (ER) Model Relationships PERSON TWEET ID Mentions ID Name Timestamp Text Author CANDIDATE Party ID Name 45

  28. Entity-Relationship (ER) Model Relationships ID Timestamp TWEET Author Mentions Text CANDIDATE Party ID Name 46

  29. Entity-Relationship (ER) Model Relationships ID Timestamp TWEET Author Text Mentioned_in CANDIDATE Party ID Name 47

  30. Entity-Relationship (ER) Model Relationships ID Timestamp TWEET Author Mentions Text CANDIDATE Party ID Name 48

  31. Entity-Relationship (ER) Model Relationships ID Timestamp Direct/ Indirect? TWEET Author Sentiment Mentions Text CANDIDATE Party ID Name 49

  32. Entity-Relationship (ER) Model ID Name Author_of Timestamp TWEET PERSON ID CANDIDATE Text Mentions Party Name Direct/ ID Indirect? Sentiment 50

  33. Entity-Relationship (ER) Model cardinalities ID (min, max) Name Author_of Timestamp TWEET PERSON ID (0,N) CANDIDATE Text (1,1) Mentions Party Name Direct/ ID Indirect? Sentiment 51

  34. Entity-Relationship (ER) Model cardinalities ID (min, max) Name Author_of Timestamp TWEET PERSON ID (0,N) CANDIDATE Text (1,1) a tweet can have any Mentions Party number of mentions, or Name no mentions Direct/ ID Indirect? Sentiment 52

  35. Entity-Relationship (ER) Model cardinalities ID (min, max) Name Author_of Timestamp TWEET PERSON ID (0,N) CANDIDATE Text (1,1) each mention corresponds Mentions Party one and only one Name candidate Direct/ ID Indirect? Sentiment 53

  36. Design Decisions • You can talk about things that don’t exist! Make sure the representation supports the analysis you want to do. • Should this concept be an entity? Attribute? Relation? • As with most things, there is no good answer • Draft, refine, document, iterate… 54

  37. Design Decisions • You can’t talk about things that don’t exist! Make sure the representation supports the analysis you want to do. • Should this concept be an entity? Attribute? Relation? • As with most things, there is no good answer • Draft, refine, document, iterate… 55

  38. Design Decisions • You can’t talk about things that don’t exist! Make sure the representation supports the analysis you want to do. • Should this concept be an entity? Attribute? Relation? • As with most things, there is no good answer • Draft, refine, document, iterate… 56

  39. Design Decisions • You can’t talk about things that don’t exist! Make sure the representation supports the analysis you want to do. • Should this concept be an entity? Attribute? Relation? • As with most things, there is no good answer • Draft, refine, document, iterate… 57

  40. Design Decisions • You can’t talk about things that don’t exist! Make sure the representation supports the analysis you want to do. • Should this concept be an entity? Attribute? Relation? • As with most things, there is no good answer • Draft, refine, document, iterate… 58

  41. Before we proceed… Burning Questions? 59

  42. DATABASES FOR DATA SCIENTIST Asking and Logical and Requirement Conceptual Answering Physical Engineering Modeling Questions Modeling (Analysis) Conceptual Logical Design Relational Algebra, “Book of Duty” Design (ER) (schema, table names, SQL data types), Physical Design (indices, memory layout, optimizations) https://cs.brown.edu/courses/csci1951-a/res/static/lec02_databases.pdf 60

  43. Relational Model 61

  44. Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 62

  45. Relational Model Relation TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 63

  46. Relational Model Relation Name TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 64

  47. Relational Model Attribute TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL Domain: D = dom(Timestamp) = Valid time strings = ##/##/#### ##:## 65

  48. Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL Tuple 66

  49. Relational Model Relation Schema (R) TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 67

  50. Relational Model Relation Schema (R) TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL Relation State r(R) 68

  51. Relational Model Intension TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL Extension 69

  52. Find all the tweets by authors named Diane. Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 70

  53. Find all the tweets by authors named Diane. SELECT * FROM TWEET WHERE Name is “Diane” Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 71

  54. Find all the tweets by authors named Diane. SELECT * FROM TWEET WHERE Name is “Diane” Relational Model ∅ TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 72

  55. Find all the tweets by authors named Diane. SELECT * FROM TWEET WHERE Name is “Diane” Relational Model ✔ ∅ TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 73

  56. Find all the tweets by authors named Diane. SELECT * FROM TWEET WHERE Name is “Diane” Relational Model ✔ ∅ “Closed world assumption” TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 74

  57. Find all the tweets which weight less than 45lbs Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 75

  58. Find all the tweets which weight less than 45lbs SELECT * FROM TWEET WHERE Weight < 45 Relational Model TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 76

  59. Find all the tweets which weight less than 45lbs SELECT * FROM TWEET WHERE Weight < 45 Relational Model ❌ ???? TWEET ID Timestamp Author Text Mentions 389472 1/1/19 12:34 Bob hey NULL 123794 1/1/19 12:32 Maria lol {Bob} 596208 1/2/19 1:04 Yu :-D NULL 77

  60. SQL 78

  61. SQL • Data Definition Language (DDL): Defining data types and Relation Schemas (intensions!) • Data Manipulation and Query Language (DML): • Populating/updating data bases (extensions!) • Querying data bases 79

  62. Creating and Manipulating Tables 80

  63. Creating and Manipulating Tables ID Name Author_of Timestamp TWEET PERSON ID CANDIDATE Text Mentions Party Name Direct/ ID Indirect? Sentiment 81

  64. Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 82

  65. Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 83

  66. Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 84

  67. Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 85

  68. Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 86

  69. Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 87

  70. Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 88

  71. Data Types • Numeric: INT, FLOAT, REAL, DOUBLE • Character Strings: CHAR(n), VARCHAR(n), CLOB(size) • CLOB(2MB) for large objects e.g. documents/web pages • Bit Strings: BIT(n), BIT VARYING(n), BLOB • BLOB(20MB) e.g. for images • Boolean • Dates: DATE, TIME, TIMESTAMP, TIME WITH TIME ZONE 89

  72. Creating Tables 90

  73. Creating Tables TWEET: <ID, Time, Text> 91

  74. Creating Tables TWEET: <ID, Time, Text> create table TWEET ( ID INT, Time TIMESTAMP, Text ??? ); 92

  75. Creating Tables TWEET: <ID, Time, Text> create table TWEET ( ID INT, Time TIMESTAMP, Text ??? ); CHAR(n), VARCHAR(n), CLOB(size) ?? 93

  76. Creating Tables TWEET: <ID, Time, Text> create table TWEET ( ID INT, Time TIMESTAMP, Text VARCHAR(140) ); 94

  77. Creating Tables • CHAR(n): faster -> can use static memory allocation; no length checks in TWEET: <ID, Time, Text> operations, so less overhead • VARCHAR(n): uses less space on create table TWEET ( average ID INT, Time TIMESTAMP, Text VARCHAR(140) ); 95

  78. Creating Tables PERSON: <Handle, Name> create table PERSON ( Handle VARCHAR(100), Name VARCHAR(1000) ); 96

  79. Creating Tables PERSON: <Handle, Name, ProfilePic> create table PERSON ( Handle VARCHAR(100), Name VARCHAR(1000), ProfilePic ??? ); 97

  80. Creating Tables PERSON: <Handle, Name, ProfilePic> create table PERSON ( Handle VARCHAR(100), Name VARCHAR(1000), ProfilePic BLOB(20MB), ); 98

  81. Creating Tables PERSON: <Handle, Name, ProfilePic, ProfilePage> create table PERSON ( Handle VARCHAR(100), Name VARCHAR(1000), ProfilePic BLOB(20MB), ProfilePage ??? ); 99

  82. Creating Tables PERSON: <Handle, Name, ProfilePic, ProfilePage> create table PERSON ( Handle VARCHAR(100), Name VARCHAR(1000), ProfilePic BLOB(20MB), ProfilePage CLOB(20MB) ); 100

Recommend


More recommend