Introd u ction to spreadsheets STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor
Spreadsheets Also kno w n as E x cel � les Data stored in tab u lar form , w ith cells arranged in ro w s and col u mns Unlike � at � les , can ha v e forma � ing and form u las M u ltiple spreadsheets can e x ist in a w orkbook STREAMLINED DATA INGESTION WITH PANDAS
Loading Spreadsheets Spreadsheets ha v e their o w n loading f u nction in pandas : read_excel() STREAMLINED DATA INGESTION WITH PANDAS
Loading Spreadsheets import pandas as pd # Read the Excel file survey_data = pd.read_excel("fcc_survey.xlsx") # View the first 5 lines of data print(survey_data.head()) Age AttendedBootcamp ... SchoolMajor StudentDebtOwe 0 28.0 0.0 ... NaN 20000 1 22.0 0.0 ... NaN NaN 2 19.0 0.0 ... NaN NaN 3 26.0 0.0 ... Cinematography And Film 7000 4 20.0 0.0 ... NaN NaN [5 rows x 98 columns] STREAMLINED DATA INGESTION WITH PANDAS
Loading Select Col u mns and Ro w s STREAMLINED DATA INGESTION WITH PANDAS
Loading Select Col u mns and Ro w s read_excel() has man y ke yw ord arg u ments in common w ith read_csv() nrows : limit n u mber of ro w s to load skiprows : specif y n u mber of ro w s or ro w n u mbers to skip usecols : choose col u mns b y name , positional n u mber , or le � er ( e . g . " A : P ") STREAMLINED DATA INGESTION WITH PANDAS
Loading Select Col u mns and Ro w s STREAMLINED DATA INGESTION WITH PANDAS
Loading Select Col u mns and Ro w s # Read columns W-AB and AR of file, skipping metadata header survey_data = pd.read_excel("fcc_survey_with_headers.xlsx", skiprows=2, usecols="W:AB, AR") # View data print(survey_data.head()) CommuteTime CountryCitizen ... EmploymentFieldOther EmploymentStatus Income 0 35.0 United States of America ... NaN Employed for wages 32000.0 1 90.0 United States of America ... NaN Employed for wages 15000.0 2 45.0 United States of America ... NaN Employed for wages 48000.0 3 45.0 United States of America ... NaN Employed for wages 43000.0 4 10.0 United States of America ... NaN Employed for wages 6000.0 [5 rows x 7 columns] STREAMLINED DATA INGESTION WITH PANDAS
Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS
Getting data from m u ltiple w orksheets STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor
Selecting Sheets to Load read_excel() loads the � rst sheet in an E x cel � le b y defa u lt Use the sheet_name ke yw ord arg u ment to load other sheets Specif y spreadsheets b y name and / or (z ero - inde x ed ) position n u mber Pass a list of names / n u mbers to load more than one sheet at a time An y arg u ments passed to read_excel() appl y to all sheets read STREAMLINED DATA INGESTION WITH PANDAS
Selecting Sheets to Load STREAMLINED DATA INGESTION WITH PANDAS
Loading Select Sheets # Get the second sheet by position index survey_data_sheet2 = pd.read_excel('fcc_survey.xlsx', sheet_name=1) # Get the second sheet by name survey_data_2017 = pd.read_excel('fcc_survey.xlsx', sheet_name='2017') print(survey_data_sheet2.equals(survey_data_2017)) True STREAMLINED DATA INGESTION WITH PANDAS
Loading All Sheets Passing sheet_name=None to read_excel() reads all sheets in a w orkbook survey_responses = pd.read_excel("fcc_survey.xlsx", sheet_name=None) print(type(survey_responses)) <class 'collections.OrderedDict'> for key, value in survey_responses.items(): print(key, type(value)) 2016 <class 'pandas.core.frame.DataFrame'> 2017 <class 'pandas.core.frame.DataFrame'> STREAMLINED DATA INGESTION WITH PANDAS
P u tting It All Together # Create empty data frame to hold all loaded sheets all_responses = pd.DataFrame() # Iterate through data frames in dictionary for sheet_name, frame in survey_responses.items(): # Add a column so we know which year data is from frame["Year"] = sheet_name # Add each data frame to all_responses all_responses = all_responses.append(frame) # View years in data print(all_responses.Year.unique()) ['2016' '2017'] STREAMLINED DATA INGESTION WITH PANDAS
Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS
Modif y ing imports : tr u e / false data STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor
Boolean Data True / False data STREAMLINED DATA INGESTION WITH PANDAS
Boolean Data STREAMLINED DATA INGESTION WITH PANDAS
Boolean Data STREAMLINED DATA INGESTION WITH PANDAS
Boolean Data STREAMLINED DATA INGESTION WITH PANDAS
Boolean Data STREAMLINED DATA INGESTION WITH PANDAS
Boolean Data STREAMLINED DATA INGESTION WITH PANDAS
Boolean Data STREAMLINED DATA INGESTION WITH PANDAS
pandas and Booleans bootcamp_data = pd.read_excel("fcc_survey_booleans.xlsx") print(bootcamp_data.dtypes) ID.x object AttendedBootcamp float64 AttendedBootCampYesNo object AttendedBootcampTF float64 BootcampLoan float64 LoanYesNo object LoanTF float64 dtype: object STREAMLINED DATA INGESTION WITH PANDAS
pandas and Booleans # Count True values # Count NAs print(bootcamp_data.sum()) print(bootcamp_data.isna().sum()) AttendedBootcamp 38 ID.x 0 AttendedBootcampTF 38 AttendedBootcamp 0 BootcampLoan 14 AttendedBootCampYesNo 0 LoanTF 14 AttendedBootcampTF 0 dtype: object BootcampLoan 964 LoanYesNo 964 LoanTF 964 dtype: int64 STREAMLINED DATA INGESTION WITH PANDAS
# Load data, casting True/False columns as Boolean bool_data = pd.read_excel("fcc_survey_booleans.xlsx", dtype={"AttendedBootcamp": bool, "AttendedBootCampYesNo": bool, "AttendedBootcampTF":bool, "BootcampLoan": bool, "LoanYesNo": bool, "LoanTF": bool}) print(bool_data.dtypes) ID.x object AttendedBootcamp bool AttendedBootCampYesNo bool AttendedBootcampTF bool BootcampLoan bool LoanYesNo bool LoanTF bool dtype: object STREAMLINED DATA INGESTION WITH PANDAS
# Count True values # Count NA values print(bool_data.sum()) print(bool_data.isna().sum()) AttendedBootcamp 38 ID.x 0 AttendedBootCampYesNo 1000 AttendedBootcamp 0 AttendedBootcampTF 38 AttendedBootCampYesNo 0 BootcampLoan 978 AttendedBootcampTF 0 LoanYesNo 1000 BootcampLoan 0 LoanTF 978 LoanYesNo 0 dtype: object LoanTF 0 dtype: int64 STREAMLINED DATA INGESTION WITH PANDAS
pandas and Booleans pandas loads True / False col u mns as � oat data b y defa u lt Specif y a col u mn sho u ld be bool w ith read_excel() ' s dtype arg u ment Boolean col u mns can onl y ha v e True and False v al u es NA / missing v al u es in Boolean col u mns are changed to True pandas a u tomaticall y recogni z es some v al u es as True / False in Boolean col u mns Unrecogni z ed v al u es in a Boolean col u mn are also changed to True STREAMLINED DATA INGESTION WITH PANDAS
Setting C u stom Tr u e / False Val u es Use read_excel() ' s true_values arg u ment to set c u stom True v al u es Use false_values to set c u stom False v al u es Each takes a list of v al u es to treat as True / False , respecti v el y C u stom True / False v al u es are onl y applied to col u mns set as Boolean STREAMLINED DATA INGESTION WITH PANDAS
Setting C u stom Tr u e / False Val u es # Load data with Boolean dtypes and custom T/F values bool_data = pd.read_excel("fcc_survey_booleans.xlsx", dtype={"AttendedBootcamp": bool, "AttendedBootCampYesNo": bool, "AttendedBootcampTF":bool, "BootcampLoan": bool, "LoanYesNo": bool, "LoanTF": bool}, true_values=["Yes"], false_values=["No"]) STREAMLINED DATA INGESTION WITH PANDAS
Setting C u stom Tr u e / False Val u es print(bool_data.sum()) AttendedBootcamp 38 AttendedBootCampYesNo 38 AttendedBootcampTF 38 BootcampLoan 978 LoanYesNo 978 LoanTF 978 dtype: object STREAMLINED DATA INGESTION WITH PANDAS
Recommend
More recommend