Pandas? Like the bear? No!
pandas is a library we can use for reading and analyzing any bit of Tabular data. We'll work out of the newly released 1.0 version
import pandas as pd
We do this by creating what are called DataFrames. These can come from a variety of forms. Most commonly:
read_csvread_excelread_featherread_htmlread_jsonread_pickleread_sqlread_table
We'll focus on the read_csv function. We'll use the Titanic dataset for today's tutorial
!git clone https://github.com/pcsanwald/kaggle-titanic.git
Let's use the Pathlib library to read our data
from fastai.basic_data import pathlib
path = pathlib.Path('kaggle-titanic')
And look at what we grabbed
path.ls()
We can see that we have a train.csv and test.csv file. We'll work out of the train file today. Let's make our DataFrame
df = pd.read_csv(path/'train.csv')
And now for some common functionalities:
df.head(n=4)
df.tail(n=3)
There are two different ways we can select rows, loc and iloc. Each work a little differently
loc is used to get rows (or columns) with a particular label from an index. IE:
df.loc[:3]
Here we grabbed rows 0 through 3
df.iloc[:3]
Whereas here we grabbed the first three rows
df['survived'].head()
To select multiple columns, pass in a double array of your names
df[['sex', 'age', 'survived']].head()
And to pass as an index, first do the number of rows followed by the column index
df.iloc[:,0]
df.iloc[:,0:3]
df.loc[df['sex'] == 'female'].head()
Or even a series of values:
df.loc[(df['sex'] == 'female') & (df['survived'] == 0)].head()
Now that we have the basic selects done, let's go into some more complex ideas
df['pclass'].iloc[:5].plot(kind='bar', title='pclass')
df['fare'].plot(title='fare')
We can also plot multiple columns:
df[['survived', 'pclass']][:20].plot(kind='bar')
df1 = df[['sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'name']].iloc[:50]
df2 = df[df.columns[:6]].iloc[:50]
Merge
merge allows you to do standard database operations on a DataFrame or Series. Here is the doc line, let's break it down:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
leftandrightshould be twoDataFrameorSeriesobjects.onis a column or index name to join on and must be found in both dataframes.left_indexandright_indexuses the respective input's index to use as a key insteadhow: Eitherleft,right,outer, orinner.
df1.head()
df2.head()
Let's practice merging on name
merge = pd.merge(df1, df2, on='name')
merge.head()
df_top = df.iloc[:5]
df_bottom = df.iloc[5:10]
len(df_top), len(df_bottom)
df_top.append(df_bottom)
df_top.drop(0)
df_top.drop('survived', axis=1)
df_top.head()
df_top.groupby(by=None, axis=0, level=None, as_index: bool=True, sort: bool=True, group_keys: bool=True, squeeze: bool=False, observed: bool=False)
Parameters:
by: mapping, function, string or some iterableaxis: default is 0level: If the axis is heirarchical, group by levels or a particular levelgroup_keys: Add group keys to index to identify pieces
For an example, let's group by survived:
surv = df.groupby('survived')
If we call first, we'll see the first entry in each group:
surv.first()
To grab the group, we can call get_group and pass in either one of our classes:
surv.get_group(1)
We can also group by multiple columns. Let's combine both survived and sex
surs = df.groupby(['survived', 'sex'])
surs.first()
Now we can further analyze and split our data based on whatever queries we want to use!
type(df['survived'])
df['survived'].head()
We can use a function or a dictionary:
df['survived'].map({0:'dead', 1:'survived'}).head()
df['survived'].map('I am a {}'.format).head()
Apply let's you pass in a function to apply to every value in a series. It takes in a function, convert_dtype, and some argumetns
Let's make a basic one that returns if someone survived:
def lived(num):
if num == 0:
return "Died"
else:
return "Lived"
new = df['survived'].apply(lived)
new.head()
We can also use a lambda:
new = df['age'].apply(lambda age: age + 5)
print(f'{df["age"].head()}\n{new.head()}')