Introd u ction to databases STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor
Relational Databases Data abo u t entities is organi z ed into tables Each ro w or record is an instance of an entit y Each col u mn has information abo u t an a � rib u te Tables can be linked to each other v ia u niq u e ke y s S u pport more data , m u ltiple sim u ltaneo u s u sers , and data q u alit y controls Data t y pes are speci � ed for each col u mn SQL ( Str u ct u red Q u er y Lang u age ) to interact w ith databases STREAMLINED DATA INGESTION WITH PANDAS
Common Relational Databases SQLite databases are comp u ter � les STREAMLINED DATA INGESTION WITH PANDAS
Connecting to Databases T w o - step process : 1. Create w a y to connect to database 2. Q u er y database STREAMLINED DATA INGESTION WITH PANDAS
Creating a Database Engine sqlalchemy ' s create_engine() makes an engine to handle database connections Needs string URL of database to connect to SQLite URL format : sqlite:///filename.db STREAMLINED DATA INGESTION WITH PANDAS
Q u er y ing Databases pd.read_sql(query, engine) to load in data from a database Arg u ments query : String containing SQL q u er y to r u n or table to load engine : Connection / database engine object STREAMLINED DATA INGESTION WITH PANDAS
SQL Re v ie w: SELECT Used to q u er y data from a database Basic s y nta x: SELECT [column_names] FROM [table_name]; To get all data in a table : SELECT * FROM [table_name]; Code st y le : ke yw ords in ALL CAPS , semicolon (;) to end a statement STREAMLINED DATA INGESTION WITH PANDAS
Getting Data from a Database # Load pandas and sqlalchemy's create_engine import pandas as pd from sqlalchemy import create_engine # Create database engine to manage connections engine = create_engine("sqlite:///data.db") # Load entire weather table by table name weather = pd.read_sql("weather", engine) STREAMLINED DATA INGESTION WITH PANDAS
# Create database engine to manage connections engine = create_engine("sqlite:///data.db") # Load entire weather table with SQL weather = pd.read_sql("SELECT * FROM weather", engine) print(weather.head()) station name latitude ... prcp snow tavg tmax tmin 0 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 52 42 1 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 48 39 2 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 48 42 3 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 51 40 4 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.75 0.0 61 50 [5 rows x 13 columns] STREAMLINED DATA INGESTION WITH PANDAS
Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS
Refining imports w ith SQL q u eries STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor
SELECTing Col u mns SELECT [column names] FROM [table name]; E x ample : SELECT date, tavg FROM weather; STREAMLINED DATA INGESTION WITH PANDAS
WHERE Cla u ses Use a WHERE cla u se to selecti v el y import records SELECT [column_names] FROM [table_name] WHERE [condition]; STREAMLINED DATA INGESTION WITH PANDAS
Filtering b y N u mbers Compare n u mbers w ith mathematical operators = > and >= < and <= <> ( not eq u al to ) E x ample : SELECT * FROM weather WHERE tmax > 32; STREAMLINED DATA INGESTION WITH PANDAS
Filtering Te x t Match e x act strings w ith the = sign and the te x t to match String matching is case - sensiti v e E x ample : /* Get records about incidents in Brooklyn */ SELECT * FROM hpd311calls WHERE borough = 'BROOKLYN'; STREAMLINED DATA INGESTION WITH PANDAS
SQL and pandas # Load libraries import pandas as pd from sqlalchemy import create_engine # Create database engine engine = create_engine("sqlite:///data.db") # Write query to get records from Brooklyn query = """SELECT * FROM hpd311calls WHERE borough = 'BROOKLYN';""" # Query the database brooklyn_calls = pd.read_sql(query, engine) print(brookyn_calls.borough.unique()) ['BROOKLYN'] STREAMLINED DATA INGESTION WITH PANDAS
Combining Conditions : AND WHERE cla u ses w ith AND ret u rn records that meet all conditions # Write query to get records about plumbing in the Bronx and_query = """SELECT * FROM hpd311calls WHERE borough = 'BRONX' AND complaint_type = 'PLUMBING';""" # Get calls about plumbing issues in the Bronx bx_plumbing_calls = pd.read_sql(and_query, engine) # Check record count print(bx_plumbing_calls.shape) (2016, 8) STREAMLINED DATA INGESTION WITH PANDAS
Combining Conditions : OR WHERE cla u ses w ith OR ret u rn records that meet at least one condition # Write query to get records about water leaks or plumbing or_query = """SELECT * FROM hpd311calls WHERE complaint_type = 'WATER LEAK' OR complaint_type = 'PLUMBING';""" # Get calls that are about plumbing or water leaks leaks_or_plumbing = pd.read_sql(or_query, engine) # Check record count print(leaks_or_plumbing.shape) (10684, 8) STREAMLINED DATA INGESTION WITH PANDAS
Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS
More comple x SQL q u eries STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor
Getting DISTINCT Val u es Get u niq u e v al u es for one or more col u mns w ith SELECT DISTINCT S y nta x: SELECT DISTINCT [column names] FROM [table]; Remo v e d u plicate records : SELECT DISTINCT * FROM [table]; /* Get unique street addresses and boroughs */ SELECT DISTINCT incident_address, borough FROM hpd311calls; STREAMLINED DATA INGESTION WITH PANDAS
Aggregate F u nctions Q u er y a database directl y for descripti v e statistics Aggregate f u nctions SUM AVG MAX MIN COUNT STREAMLINED DATA INGESTION WITH PANDAS
Aggregate F u nctions SUM , AVG , MAX , MIN Each takes a single col u mn name SELECT AVG(tmax) FROM weather; COUNT Get n u mber of ro w s that meet q u er y conditions SELECT COUNT(*) FROM [table_name]; Get n u mber of u niq u e v al u es in a col u mn SELECT COUNT(DISTINCT [column_names]) FROM [table_name]; STREAMLINED DATA INGESTION WITH PANDAS
GROUP BY Aggregate f u nctions calc u late a single s u mmar y statistic b y defa u lt S u mmari z e data b y categories w ith GROUP BY statements Remember to also select the col u mn y o u' re gro u ping b y! /* Get counts of plumbing calls by borough */ SELECT borough, COUNT(*) FROM hpd311calls WHERE complaint_type = 'PLUMBING' GROUP BY borough; STREAMLINED DATA INGESTION WITH PANDAS
Co u nting b y Gro u ps # Create database engine engine = create_engine("sqlite:///data.db") # Write query to get plumbing call counts by borough query = """SELECT borough, COUNT(*) FROM hpd311calls WHERE complaint_type = 'PLUMBING' GROUP BY borough;""" # Query databse and create data frame plumbing_call_counts = pd.read_sql(query, engine) STREAMLINED DATA INGESTION WITH PANDAS
Co u nting b y Gro u ps print(plumbing_call_counts) borough COUNT(*) 0 BRONX 2016 1 BROOKLYN 2702 2 MANHATTAN 1413 3 QUEENS 808 4 STATEN ISLAND 178 STREAMLINED DATA INGESTION WITH PANDAS
Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS
Loading m u ltiple tables w ith joins STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor
Ke y s Database records ha v e u niq u e identi � ers , or ke y s STREAMLINED DATA INGESTION WITH PANDAS
Ke y s Database records ha v e u niq u e identi � ers , or ke y s STREAMLINED DATA INGESTION WITH PANDAS
Ke y s Database records ha v e u niq u e identi � ers , or ke y s STREAMLINED DATA INGESTION WITH PANDAS
Ke y s STREAMLINED DATA INGESTION WITH PANDAS
Ke y s STREAMLINED DATA INGESTION WITH PANDAS
Joining Tables STREAMLINED DATA INGESTION WITH PANDAS
Joining Tables SELECT * FROM hpd311calls STREAMLINED DATA INGESTION WITH PANDAS
Joining Tables SELECT * FROM hpd311calls JOIN weather ON hpd311calls.created_date = weather.date; Use dot notation ( table.column ) w hen w orking w ith m u ltiple tables Defa u lt join onl y ret u rns records w hose ke y v al u es appear in both tables Make s u re join ke y s are the same data t y pe or nothing w ill match STREAMLINED DATA INGESTION WITH PANDAS
Joining and Filtering /* Get only heat/hot water calls and join in weather data */ SELECT * FROM hpd311calls JOIN weather ON hpd311calls.created_date = weather.date WHERE hpd311calls.complaint_type = 'HEAT/HOT WATER'; STREAMLINED DATA INGESTION WITH PANDAS
Joining and Aggregating /* Get call counts by borough */ SELECT hpd311calls.borough, COUNT(*) FROM hpd311calls GROUP BY hpd311calls.borough; STREAMLINED DATA INGESTION WITH PANDAS
Joining and Aggregating /* Get call counts by borough and join in population and housing counts */ SELECT hpd311calls.borough, COUNT(*), boro_census.total_population, boro_census.housing_units FROM hpd311calls GROUP BY hpd311calls.borough STREAMLINED DATA INGESTION WITH PANDAS
Recommend
More recommend