Pandas? Like the bear? No!
pandas
is a library we can use for reading and analyzing any bit of Tabular data. We'll work out of the newly released 1.0 version
import pandas as pd
We do this by creating what are called DataFrames
. These can come from a variety of forms. Most commonly:
read_csv
read_excel
read_feather
read_html
read_json
read_pickle
read_sql
read_table
We'll focus on the read_csv
function. We'll use the Titanic dataset for today's tutorial
!git clone https://github.com/pcsanwald/kaggle-titanic.git
Let's use the Pathlib
library to read our data
from fastai.basic_data import pathlib
path = pathlib.Path('kaggle-titanic')
And look at what we grabbed
path.ls()
We can see that we have a train.csv
and test.csv
file. We'll work out of the train
file today. Let's make our DataFrame
df = pd.read_csv(path/'train.csv')
And now for some common functionalities:
df.head(n=4)
df.tail(n=3)
There are two different ways we can select rows, loc
and iloc
. Each work a little differently
loc
is used to get rows (or columns) with a particular label from an index. IE:
df.loc[:3]
Here we grabbed rows 0 through 3
df.iloc[:3]
Whereas here we grabbed the first three rows
df['survived'].head()
To select multiple columns, pass in a double array of your names
df[['sex', 'age', 'survived']].head()
And to pass as an index, first do the number of rows followed by the column index
df.iloc[:,0]
df.iloc[:,0:3]
df.loc[df['sex'] == 'female'].head()
Or even a series of values:
df.loc[(df['sex'] == 'female') & (df['survived'] == 0)].head()
Now that we have the basic selects done, let's go into some more complex ideas
df['pclass'].iloc[:5].plot(kind='bar', title='pclass')
df['fare'].plot(title='fare')
We can also plot multiple columns:
df[['survived', 'pclass']][:20].plot(kind='bar')
df1 = df[['sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'name']].iloc[:50]
df2 = df[df.columns[:6]].iloc[:50]
Merge
merge
allows you to do standard database operations on a DataFrame
or Series
. Here is the doc line, let's break it down:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
left
andright
should be twoDataFrame
orSeries
objects.on
is a column or index name to join on and must be found in both dataframes.left_index
andright_index
uses the respective input's index to use as a key insteadhow
: Eitherleft
,right
,outer
, orinner
.
df1.head()
df2.head()
Let's practice merging on name
merge = pd.merge(df1, df2, on='name')
merge.head()
df_top = df.iloc[:5]
df_bottom = df.iloc[5:10]
len(df_top), len(df_bottom)
df_top.append(df_bottom)
df_top.drop(0)
df_top.drop('survived', axis=1)
df_top.head()
df_top.groupby(by=None, axis=0, level=None, as_index: bool=True, sort: bool=True, group_keys: bool=True, squeeze: bool=False, observed: bool=False)
Parameters:
by
: mapping, function, string or some iterableaxis
: default is 0level
: If the axis is heirarchical, group by levels or a particular levelgroup_keys
: Add group keys to index to identify pieces
For an example, let's group by survived
:
surv = df.groupby('survived')
If we call first
, we'll see the first entry in each group:
surv.first()
To grab the group, we can call get_group
and pass in either one of our classes:
surv.get_group(1)
We can also group by multiple columns. Let's combine both survived
and sex
surs = df.groupby(['survived', 'sex'])
surs.first()
Now we can further analyze and split our data based on whatever queries we want to use!
type(df['survived'])
df['survived'].head()
We can use a function or a dictionary:
df['survived'].map({0:'dead', 1:'survived'}).head()
df['survived'].map('I am a {}'.format).head()
Apply let's you pass in a function to apply to every value in a series. It takes in a function, convert_dtype
, and some argumetns
Let's make a basic one that returns if someone survived:
def lived(num):
if num == 0:
return "Died"
else:
return "Lived"
new = df['survived'].apply(lived)
new.head()
We can also use a lambda:
new = df['age'].apply(lambda age: age + 5)
print(f'{df["age"].head()}\n{new.head()}')