introd u ction to spreadsheets
play

Introd u ction to spreadsheets STR E AML IN E D DATA IN G E STION - PowerPoint PPT Presentation

Introd u ction to spreadsheets STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor Spreadsheets Also kno w n as E x cel les Data stored in tab u lar form , w ith cells arranged in ro w s and col u mns Unlike


  1. Introd u ction to spreadsheets STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor

  2. Spreadsheets Also kno w n as E x cel � les Data stored in tab u lar form , w ith cells arranged in ro w s and col u mns Unlike � at � les , can ha v e forma � ing and form u las M u ltiple spreadsheets can e x ist in a w orkbook STREAMLINED DATA INGESTION WITH PANDAS

  3. Loading Spreadsheets Spreadsheets ha v e their o w n loading f u nction in pandas : read_excel() STREAMLINED DATA INGESTION WITH PANDAS

  4. Loading Spreadsheets import pandas as pd # Read the Excel file survey_data = pd.read_excel("fcc_survey.xlsx") # View the first 5 lines of data print(survey_data.head()) Age AttendedBootcamp ... SchoolMajor StudentDebtOwe 0 28.0 0.0 ... NaN 20000 1 22.0 0.0 ... NaN NaN 2 19.0 0.0 ... NaN NaN 3 26.0 0.0 ... Cinematography And Film 7000 4 20.0 0.0 ... NaN NaN [5 rows x 98 columns] STREAMLINED DATA INGESTION WITH PANDAS

  5. Loading Select Col u mns and Ro w s STREAMLINED DATA INGESTION WITH PANDAS

  6. Loading Select Col u mns and Ro w s read_excel() has man y ke yw ord arg u ments in common w ith read_csv() nrows : limit n u mber of ro w s to load skiprows : specif y n u mber of ro w s or ro w n u mbers to skip usecols : choose col u mns b y name , positional n u mber , or le � er ( e . g . " A : P ") STREAMLINED DATA INGESTION WITH PANDAS

  7. Loading Select Col u mns and Ro w s STREAMLINED DATA INGESTION WITH PANDAS

  8. Loading Select Col u mns and Ro w s # Read columns W-AB and AR of file, skipping metadata header survey_data = pd.read_excel("fcc_survey_with_headers.xlsx", skiprows=2, usecols="W:AB, AR") # View data print(survey_data.head()) CommuteTime CountryCitizen ... EmploymentFieldOther EmploymentStatus Income 0 35.0 United States of America ... NaN Employed for wages 32000.0 1 90.0 United States of America ... NaN Employed for wages 15000.0 2 45.0 United States of America ... NaN Employed for wages 48000.0 3 45.0 United States of America ... NaN Employed for wages 43000.0 4 10.0 United States of America ... NaN Employed for wages 6000.0 [5 rows x 7 columns] STREAMLINED DATA INGESTION WITH PANDAS

  9. Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS

  10. Getting data from m u ltiple w orksheets STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor

  11. Selecting Sheets to Load read_excel() loads the � rst sheet in an E x cel � le b y defa u lt Use the sheet_name ke yw ord arg u ment to load other sheets Specif y spreadsheets b y name and / or (z ero - inde x ed ) position n u mber Pass a list of names / n u mbers to load more than one sheet at a time An y arg u ments passed to read_excel() appl y to all sheets read STREAMLINED DATA INGESTION WITH PANDAS

  12. Selecting Sheets to Load STREAMLINED DATA INGESTION WITH PANDAS

  13. Loading Select Sheets # Get the second sheet by position index survey_data_sheet2 = pd.read_excel('fcc_survey.xlsx', sheet_name=1) # Get the second sheet by name survey_data_2017 = pd.read_excel('fcc_survey.xlsx', sheet_name='2017') print(survey_data_sheet2.equals(survey_data_2017)) True STREAMLINED DATA INGESTION WITH PANDAS

  14. Loading All Sheets Passing sheet_name=None to read_excel() reads all sheets in a w orkbook survey_responses = pd.read_excel("fcc_survey.xlsx", sheet_name=None) print(type(survey_responses)) <class 'collections.OrderedDict'> for key, value in survey_responses.items(): print(key, type(value)) 2016 <class 'pandas.core.frame.DataFrame'> 2017 <class 'pandas.core.frame.DataFrame'> STREAMLINED DATA INGESTION WITH PANDAS

  15. P u tting It All Together # Create empty data frame to hold all loaded sheets all_responses = pd.DataFrame() # Iterate through data frames in dictionary for sheet_name, frame in survey_responses.items(): # Add a column so we know which year data is from frame["Year"] = sheet_name # Add each data frame to all_responses all_responses = all_responses.append(frame) # View years in data print(all_responses.Year.unique()) ['2016' '2017'] STREAMLINED DATA INGESTION WITH PANDAS

  16. Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS

  17. Modif y ing imports : tr u e / false data STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor

  18. Boolean Data True / False data STREAMLINED DATA INGESTION WITH PANDAS

  19. Boolean Data STREAMLINED DATA INGESTION WITH PANDAS

  20. Boolean Data STREAMLINED DATA INGESTION WITH PANDAS

  21. Boolean Data STREAMLINED DATA INGESTION WITH PANDAS

  22. Boolean Data STREAMLINED DATA INGESTION WITH PANDAS

  23. Boolean Data STREAMLINED DATA INGESTION WITH PANDAS

  24. Boolean Data STREAMLINED DATA INGESTION WITH PANDAS

  25. pandas and Booleans bootcamp_data = pd.read_excel("fcc_survey_booleans.xlsx") print(bootcamp_data.dtypes) ID.x object AttendedBootcamp float64 AttendedBootCampYesNo object AttendedBootcampTF float64 BootcampLoan float64 LoanYesNo object LoanTF float64 dtype: object STREAMLINED DATA INGESTION WITH PANDAS

  26. pandas and Booleans # Count True values # Count NAs print(bootcamp_data.sum()) print(bootcamp_data.isna().sum()) AttendedBootcamp 38 ID.x 0 AttendedBootcampTF 38 AttendedBootcamp 0 BootcampLoan 14 AttendedBootCampYesNo 0 LoanTF 14 AttendedBootcampTF 0 dtype: object BootcampLoan 964 LoanYesNo 964 LoanTF 964 dtype: int64 STREAMLINED DATA INGESTION WITH PANDAS

  27. # Load data, casting True/False columns as Boolean bool_data = pd.read_excel("fcc_survey_booleans.xlsx", dtype={"AttendedBootcamp": bool, "AttendedBootCampYesNo": bool, "AttendedBootcampTF":bool, "BootcampLoan": bool, "LoanYesNo": bool, "LoanTF": bool}) print(bool_data.dtypes) ID.x object AttendedBootcamp bool AttendedBootCampYesNo bool AttendedBootcampTF bool BootcampLoan bool LoanYesNo bool LoanTF bool dtype: object STREAMLINED DATA INGESTION WITH PANDAS

  28. # Count True values # Count NA values print(bool_data.sum()) print(bool_data.isna().sum()) AttendedBootcamp 38 ID.x 0 AttendedBootCampYesNo 1000 AttendedBootcamp 0 AttendedBootcampTF 38 AttendedBootCampYesNo 0 BootcampLoan 978 AttendedBootcampTF 0 LoanYesNo 1000 BootcampLoan 0 LoanTF 978 LoanYesNo 0 dtype: object LoanTF 0 dtype: int64 STREAMLINED DATA INGESTION WITH PANDAS

  29. pandas and Booleans pandas loads True / False col u mns as � oat data b y defa u lt Specif y a col u mn sho u ld be bool w ith read_excel() ' s dtype arg u ment Boolean col u mns can onl y ha v e True and False v al u es NA / missing v al u es in Boolean col u mns are changed to True pandas a u tomaticall y recogni z es some v al u es as True / False in Boolean col u mns Unrecogni z ed v al u es in a Boolean col u mn are also changed to True STREAMLINED DATA INGESTION WITH PANDAS

  30. Setting C u stom Tr u e / False Val u es Use read_excel() ' s true_values arg u ment to set c u stom True v al u es Use false_values to set c u stom False v al u es Each takes a list of v al u es to treat as True / False , respecti v el y C u stom True / False v al u es are onl y applied to col u mns set as Boolean STREAMLINED DATA INGESTION WITH PANDAS

  31. Setting C u stom Tr u e / False Val u es # Load data with Boolean dtypes and custom T/F values bool_data = pd.read_excel("fcc_survey_booleans.xlsx", dtype={"AttendedBootcamp": bool, "AttendedBootCampYesNo": bool, "AttendedBootcampTF":bool, "BootcampLoan": bool, "LoanYesNo": bool, "LoanTF": bool}, true_values=["Yes"], false_values=["No"]) STREAMLINED DATA INGESTION WITH PANDAS

  32. Setting C u stom Tr u e / False Val u es print(bool_data.sum()) AttendedBootcamp 38 AttendedBootCampYesNo 38 AttendedBootcampTF 38 BootcampLoan 978 LoanYesNo 978 LoanTF 978 dtype: object STREAMLINED DATA INGESTION WITH PANDAS

Recommend


More recommend