Lesson Video:

Intro to Tabular and Pandas

Before we begin doing tabular modeling, let's learn about the Pandas library

Pandas? Like the bear? No!

pandas is a library we can use for reading and analyzing any bit of Tabular data. We'll work out of the newly released 1.0 version

import pandas as pd

We do this by creating what are called DataFrames. These can come from a variety of forms. Most commonly:

read_csv
read_excel
read_feather
read_html
read_json
read_pickle
read_sql
read_table

We'll focus on the read_csv function. We'll use the Titanic dataset for today's tutorial

Titanic

!git clone https://github.com/pcsanwald/kaggle-titanic.git

Cloning into 'kaggle-titanic'...
remote: Enumerating objects: 19, done.
remote: Counting objects: 100% (19/19), done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 19 (delta 0), reused 19 (delta 0), pack-reused 0
Unpacking objects: 100% (19/19), done.

Let's use the Pathlib library to read our data

from fastai.basic_data import pathlib

path = pathlib.Path('kaggle-titanic')

And look at what we grabbed

path.ls()

[PosixPath('kaggle-titanic/train.csv'),
 PosixPath('kaggle-titanic/doc'),
 PosixPath('kaggle-titanic/.git'),
 PosixPath('kaggle-titanic/test.csv'),
 PosixPath('kaggle-titanic/.gitignore'),
 PosixPath('kaggle-titanic/README.md'),
 PosixPath('kaggle-titanic/test'),
 PosixPath('kaggle-titanic/project.clj'),
 PosixPath('kaggle-titanic/src')]

We can see that we have a train.csv and test.csv file. We'll work out of the train file today. Let's make our DataFrame

df = pd.read_csv(path/'train.csv')

And now for some common functionalities:

Pandas Functionalities:

Head and Tail

.head() and .tail() will show the first and last few rows of a DataFrame. You can pass in n rows to look at (the default is 5)

df.head(n=4)

df.tail(n=3)

Selecting

We can select a variety of ways in pandas: row, value, even by a column:

Row:

There are two different ways we can select rows, loc and iloc. Each work a little differently

`loc`

loc is used to get rows (or columns) with a particular label from an index. IE:

df.loc[:3]

Here we grabbed rows 0 through 3

`iloc`

iloc is used when we want to get rows (or columns) from a position on an index

df.iloc[:3]

Whereas here we grabbed the first three rows

Column

We can pass in either a string, an index, or multiple columns to select:

df['survived'].head()

0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64

To select multiple columns, pass in a double array of your names

df[['sex', 'age', 'survived']].head()

And to pass as an index, first do the number of rows followed by the column index

df.iloc[:,0]

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: survived, Length: 891, dtype: int64

df.iloc[:,0:3]

Value

We can select based on a value a few different ways, most involving a boolean argument:

df.loc[df['sex'] == 'female'].head()

Or even a series of values:

df.loc[(df['sex'] == 'female') & (df['survived'] == 0)].head()

Now that we have the basic selects done, let's go into some more complex ideas

Plotting

We can plot out any data we want from our DataFrames, like so:

df['pclass'].iloc[:5].plot(kind='bar', title='pclass')

<matplotlib.axes._subplots.AxesSubplot at 0x7fc8906b2518>

df['fare'].plot(title='fare')

<matplotlib.axes._subplots.AxesSubplot at 0x7fc890642518>

We can also plot multiple columns:

df[['survived', 'pclass']][:20].plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7fc890129828>

Dealing with multiple `DataFrames`

Sometimes, we have multiple dataframes of data. How do we combine them?

df1 = df[['sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'name']].iloc[:50]
df2 = df[df.columns[:6]].iloc[:50]

Merge

merge allows you to do standard database operations on a DataFrame or Series. Here is the doc line, let's break it down:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)

left and right should be two DataFrame or Series objects.
on is a column or index name to join on and must be found in both dataframes.
left_index and right_index uses the respective input's index to use as a key instead
how: Either left, right, outer, or inner.

df1.head()

df2.head()

Let's practice merging on name

merge = pd.merge(df1, df2, on='name')

merge.head()

Appending

We can tag one dataframe onto another:

df_top = df.iloc[:5]
df_bottom = df.iloc[5:10]

len(df_top), len(df_bottom)

(5, 5)

df_top.append(df_bottom)

Dropping

You can drop a column or row, and by default pandas will return the dropped table. If you choose to do inplace, it will directly modify the dataframe. You can also pass in an axis parameter. By default it is 1, which will drop a row. 0 drops a column

df_top.drop(0)

df_top.drop('survived', axis=1)

df_top.head()

GroupBy

GroupBy can be used to split our data into groups if they fall into some criteria.

df_top.groupby(by=None, axis=0, level=None, as_index: bool=True, sort: bool=True, group_keys: bool=True, squeeze: bool=False, observed: bool=False)

Parameters:

by: mapping, function, string or some iterable
axis: default is 0
level: If the axis is heirarchical, group by levels or a particular level
group_keys: Add group keys to index to identify pieces

For an example, let's group by survived:

surv = df.groupby('survived')

If we call first, we'll see the first entry in each group:

surv.first()

To grab the group, we can call get_group and pass in either one of our classes:

surv.get_group(1)

We can also group by multiple columns. Let's combine both survived and sex

surs = df.groupby(['survived', 'sex'])

surs.first()

Now we can further analyze and split our data based on whatever queries we want to use!

Map

map can be used to map values from two different series (column) that share a same column. Basically we can repeat our adjust values:

type(df['survived'])

pandas.core.series.Series

df['survived'].head()

0    0
1    1
2    1
3    1
4    0
Name: survived, dtype: int64

We can use a function or a dictionary:

df['survived'].map({0:'dead', 1:'survived'}).head()

0        dead
1    survived
2    survived
3    survived
4        dead
Name: survived, dtype: object

df['survived'].map('I am a {}'.format).head()

0    I am a 0
1    I am a 1
2    I am a 1
3    I am a 1
4    I am a 0
Name: survived, dtype: object

.apply

Apply let's you pass in a function to apply to every value in a series. It takes in a function, convert_dtype, and some argumetns

Let's make a basic one that returns if someone survived:

def lived(num):
  if num == 0:
    return "Died"
  else:
    return "Lived"

new = df['survived'].apply(lived)

new.head()

0     Died
1    Lived
2    Lived
3    Lived
4     Died
Name: survived, dtype: object

We can also use a lambda:

new = df['age'].apply(lambda age: age + 5)

print(f'{df["age"].head()}\n{new.head()}')

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: age, dtype: float64
0    27.0
1    43.0
2    31.0
3    40.0
4    40.0
Name: age, dtype: float64

	survived	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
8	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked
14	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	NaN	S
18	3	Vander Planke, Mrs. Julius (Emelia Maria Vande...	female	31.0	1	0	345763	18.0000	NaN	S
24	3	Palsson, Miss. Torborg Danira	female	8.0	3	1	349909	21.0750	NaN	S
38	3	Vander Planke, Miss. Augusta Maria	female	18.0	2	0	345764	18.0000	NaN	S
40	3	Ahlin, Mrs. Johan (Johanna Persdotter Larsson)	female	40.0	1	0	7546	9.4750	NaN	S

	sibsp	ticket	fare	cabin	embarked	name
0	1	A/5 21171	7.2500	NaN	S	Braund, Mr. Owen Harris
1	1	PC 17599	71.2833	C85	C	Cumings, Mrs. John Bradley (Florence Briggs Th...
2	0	STON/O2. 3101282	7.9250	NaN	S	Heikkinen, Miss. Laina
3	1	113803	53.1000	C123	S	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	0	373450	8.0500	NaN	S	Allen, Mr. William Henry

	sibsp_x	ticket	fare	cabin	embarked	name	survived	pclass	sex	age	sibsp_y
0	1	A/5 21171	7.2500	NaN	S	Braund, Mr. Owen Harris	0	3	male	22.0	1
1	1	PC 17599	71.2833	C85	C	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	1	female	38.0	1
2	0	STON/O2. 3101282	7.9250	NaN	S	Heikkinen, Miss. Laina	1	3	female	26.0	0
3	1	113803	53.1000	C123	S	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	1	female	35.0	1
4	0	373450	8.0500	NaN	S	Allen, Mr. William Henry	0	3	male	35.0	0

	survived	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

	survived	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked
888	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q

Lesson 1 - Introduction to Pandas