Selecting data in pandas P YTH ON FOR R U SE R S Daniel Chen - PowerPoint PPT Presentation

Selecting data in pandas P YTH ON FOR R U SE R S Daniel Chen Instr u ctor

Man u all y create DataFrame df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}, index = ['x', 'y', 'z']) print(df) A B C x 1 4 7 y 2 5 8 z 3 6 9 PYTHON FOR R USERS

df = pd.DataFrame({ df.A 'A': [1, 2, 3], 'B': [4, 5, 6], x 1 'C': [7, 8, 9]}, y 2 index = ['x', 'y', 'z']) z 3 df Name: A, dtype: int64 A B C df[['A', 'B']] x 1 4 7 y 2 5 8 z 3 6 9 A B x 1 4 y 2 5 df['A'] z 3 6 x 1 y 2 z 3 Name: A, dtype: int64 PYTHON FOR R USERS

S u bsetting ro w s Ro w- label ( loc ) v s ro w- inde x ( iloc ) P y thon starts co u nting from 0 PYTHON FOR R USERS

S u bsetting ro w s . iloc df df.iloc[0, :] A B C A 1 x 1 4 7 B 4 y 2 5 8 C 7 z 3 6 9 Name: x, dtype: int64 df.iloc[0] df.iloc[[0, 1], :] A 1 A B C B 4 x 1 4 7 C 7 y 2 5 8 Name: x, dtype: int64 PYTHON FOR R USERS

S u bsetting ro w s . loc df df.loc['x'] A B C A 1 x 1 4 7 B 4 y 2 5 8 C 7 z 3 6 9 Name: x, dtype: int64 df.loc[['x', 'y']] A B C x 1 4 7 y 2 5 8 PYTHON FOR R USERS

df A B C x 1 4 7 y 2 5 8 z 3 6 9 df.loc['x', 'A'] 1 df.loc[['x', 'y'], ['A', 'B']] A B x 1 4 y 2 5 PYTHON FOR R USERS

Conditional s u bsetting df[df.A == 3] A B C z 3 6 9 df[(df.A == 3) | (df.B == 4)] A B C x 1 4 7 z 3 6 9 PYTHON FOR R USERS

Attrib u tes df.shape (3, 2) df.shape() ------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-17-0e566b70f572> in <module>() <hr />-> 1 df.shape() TypeError: 'tuple' object is not callable PYTHON FOR R USERS

Let ' s practice ! P YTH ON FOR R U SE R S

Data t y pes P YTH ON FOR R U SE R S Daniel Chen Instr u ctor

R P y thon df <- data.frame( import pandas as pd 'A' = c(1, 2, 3), df = pd.DataFrame( 'B' = c(4, 5, 6) {'A': [1, 2, 3], ) 'B':[4, 5, 6]}) df df A B 1 1 4 A Bd 2 2 5 0 1 4 3 3 6 1 2 5 2 3 6 class(df) type(df) "data.frame" pandas.core.frame.DataFrame PYTHON FOR R USERS

R str(df) 'data.frame': 3 obs. of 2 variables: $ A: num 1 2 3 $ B: num 4 5 6 P y thon df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 2 columns): A 3 non-null int64 B 3 non-null int64 dtypes: int64(2) memory usage: 128.0 bytes PYTHON FOR R USERS

R df$A <- as.character(df$A) str(df) 'data.frame': 3 obs. of 2 variables: $ A: chr "1" "2" "3" $ B: num 4 5 6 P y thon df['A'] = df['A'].astype(str) df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 2 columns): A 3 non-null object B 3 non-null int64 dtypes: int64(1), object(1) memory usage: 128.0+ bytes PYTHON FOR R USERS

String objects df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 2 columns): A 3 non-null object B 3 non-null int64 dtypes: int64(1), object(1) memory usage: 128.0+ bytes When y o u see " object " it is a string Access b u ilt - in string methods w ith str accessor PYTHON FOR R USERS

String accessor df = pd.DataFrame({'name': ['Daniel ',' Eric', ' Julia ']}) df name 0 Daniel 1 Eric 2 Julia df['name_strip'] = df['name'].str.strip() df name name_strip 0 Daniel Daniel 1 Eric Eric 2 Julia Julia PYTHON FOR R USERS

Categor y df = pd.DataFrame({'name': ['Daniel','Eric', 'Julia'], ...: 'gender':['Male', 'Male', 'Female']}) df.dtypes Out[3]: gender object name object dtype: object df['gender_cat'] = df['gender'].astype('category') df.dtypes gender object name object gender_cat category dtype: object PYTHON FOR R USERS

Categor y accessor df['gender_cat'].cat.categories Index(['Female', 'Male'], dtype='object') df.gender_cat.cat.codes 0 1 1 1 2 0 dtype: int8 PYTHON FOR R USERS

Datetime df = pd.DataFrame({'name': ['Rosaline Franklin', 'William Gosset'], 'born': ['1920-07-25', '1876-06-13']}) df['born_dt'] = pd.to_datetime(df['born']) df born name born_dt 0 1920-07-25 Rosaline Franklin 1920-07-25 1 1876-06-13 William Gosset 1876-06-13 df.dtypes born object name object born_dt datetime64[ns] dtype: object PYTHON FOR R USERS

Datetime accessor df['born_dt'].dt.day 0 25 1 13 Name: born_dt, dtype: int64 df['born_dt'].dt.month 0 7 1 6 Name: born_dt, dtype: int64 df['born_dt'].dt.year 0 1920 1 1876 Name: born_dt, dtype: int64 PYTHON FOR R USERS

More Pandas P YTH ON FOR R U SE R S Daniel Chen Instr u ctor

Missing data NaN missing v al u es from from n u mp y np.NaN , np.NAN , np.nan are all the same as the NA R v al u e check missing w ith pd.isnull Check non - missing w ith pd.notnull pd.isnull is an alias for pd.isna PYTHON FOR R USERS

Working w ith missing data df name treatment_a treatment_b 0 John Smith NaN 2 1 Jane Doe 16.0 11 2 Mary Johnson 3.0 1 a_mean = df['treatment_a'].mean() a_mean 9.5 PYTHON FOR R USERS

Fillna df['a_fill'] = df['treatment_a'].fillna(a_mean) df name treatment_a treatment_b a_fill 0 John Smith NaN 2 9.5 1 Jane Doe 16.0 11 16.0 2 Mary Johnson 3.0 1 3.0 PYTHON FOR R USERS

More Pandas Appl y ing c u stom f u nctions Gro u pb y operations Tid y ing data PYTHON FOR R USERS

Appl y y o u r o w n f u nctions B u ilt - in f u nctions C u stom f u nctions apply method Pass in an a x is PYTHON FOR R USERS

R P y thon df = data.frame('a' = c(1, 2, 3), import pandas as pd 'b' = c(4, 5, 6)) df = pd.DataFrame({'A': [1, 2, 3], apply(df, 2, mean) 'B':[4, 5, 6]}) df.apply(np.mean, axis=0) a b 2 5 A 2.0 B 5.0 dtype: float64 apply(df, 1, mean) df.apply(np.mean, axis=1) 2.5 3.5 4.5 0 2.5 1 3.5 2 4.5 dtype: float64 PYTHON FOR R USERS

Tid y Reshaping and tid y ing o u r data Hadle y Wickham , Tid y Data Paper Each ro w is an obser v ation Each col u mn is a v ariable Each t y pe of obser v ational u nit forms a table Tid y Data Paper : h � p ://v ita . had . co . n z/ papers / tid y- data . pdf PYTHON FOR R USERS

Tid y melt df name treatment_a treatment_b 0 John Smith NaN 2 1 Jane Doe 16.0 11 2 Mary Johnson 3.0 1 df_melt = pd.melt(df, id_vars='name') df_melt name variable value 0 John Smith treatment_a NaN 1 Jane Doe treatment_a 16.0 2 Mary Johnson treatment_a 3.0 3 John Smith treatment_b 2.0 ... PYTHON FOR R USERS

Tid y pi v ot _ table df_melt_pivot = pd.pivot_table(df_melt, index='name', columns='variable', values='value') df_melt_pivot variable treatment_a treatment_b name Jane Doe 16.0 11.0 John Smith NaN 2.0 Mary Johnson 3.0 1.0 PYTHON FOR R USERS

Reset inde x df_melt_pivot.reset_index() variable name treatment_a treatment_b 0 Jane Doe 16.0 11.0 1 John Smith NaN 2.0 2 Mary Johnson 3.0 1.0 PYTHON FOR R USERS

Gro u pb y groupby : split - appl y- combine split data into separate partitions appl y a f u nction on each partition combine the res u lts PYTHON FOR R USERS

Performing a gro u pb y name variable value 0 John Smith treatment_a NaN 1 Jane Doe treatment_a 16.0 2 Mary Johnson treatment_a 3.0 3 John Smith treatment_b 2.0 4 Jane Doe treatment_b 11.0 5 Mary Johnson treatment_b 1.0 df_melt.groupby('name')['value'].mean() name Jane Doe 13.5 John Smith 2.0 Mary Johnson 2.0 Name: value, dtype: float64 PYTHON FOR R USERS

Selecting data in pandas P YTH ON FOR R U SE R S Daniel Chen - PowerPoint PPT Presentation

Selecting data in pandas P YTH ON FOR R U SE R S Daniel Chen Instr u ctor Man u all y create DataFrame df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}, index = ['x', 'y', 'z']) print(df) A B C x 1 4

Pandas Data Manipulation in Python 1 / 31 Pandas Built on NumPy Adds data structures and

Reading date and time data in Pandas W ORK IN G W ITH DATES AN D TIMES IN P YTH ON Max Shron

Review of pandas DataFrames PAN DAS F OUN DATION S Dhavide Aruliah Director of Training,

Merging DataFrames Merging DataFrames with pandas Population DataFrame In [1]: import pandas as

Python Data Processing with Pandas CSE 5542 Introduc:on to Data Visualiza:on Pandas A very

Intro to pandas DataFrame iteration W RITIN G EF F ICIEN T P YTH ON CODE Logan Thomas Senior

Modern pandas Herv Mignot EQUANCY 1 Building Pipelines with Python Data Size PySpark x100

Data structuring The Pandas way Andreas Bjerre-Nielsen Recap What have we learned about

Plotting directl y u sing pandas P YTH ON FOR R U SE R S Daniel Chen Instr u ctor Plotting in

Whats new and awesome in pandas pandas? In [13]: foo Out[13]: methyl1 age edu

All You Need is Pandas All You Need is Pandas Unexpected Success Stories Dimiter Naydenov

Dotmetrics Exclusive Users Selecting basic dimensions (country, devices) Selecting timeframe

Visual exploratory data analysis pandas Foundations The iris data set Famous data set in pa

Data Analysis with Python Pandas, Jupyter, and Friends Andreas Herten, 4 May 2017 The data

What is pandas ? IN TR OD U C TION TO DATA SC IE N C E IN P YTH ON Hillar y Green - Lerman Lead

Pandas Under The Hood Peeking behind the scenes of a high performance data analysis library

Informatics 1 Lecture 8 Searching for Satisfaction Michael Fourman 1 2 D C A C B

dependencies in a table Haotian Wang Table(a,b,c) Group 1: Left part contains one attribute:

Last class: Synchronization Today: Deadlocks Definition A set of processes is

1 b. Implement the function using a minimal network of 4:1 multiplexers. The truth table using a ,

ABC BALLOT PROPOSAL WORKSHOP August 18 th and 29th 1 Universal Serenity Prayer: God, grant me

tr t trr t t

HARDWARE FOR ARITHMETIC Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Order of attributes is arbitrary , but in practice w e need to assume the order