Introduction to Pandas and Seaborn (Pokemon Dataset Part 1)

Posted on Tue 17 July 2018 in posts

In this post, we'll take a look at the Pokemon dataset that is available in Kaggle and we'll try to have fun practicing data analysis and visualization using Pandas and Seaborn. We won't be doing some advanced analysis, just a simple analysis such as counting, filtering, cleaning the data, and some basic plotting techniques¶

Note : The dataset only contains all the Pokemon until gen 6, the gen 7 dataset can be taken from here¶

Let's start by importing all the important packages

In [1]:

import pandas as pd   
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('bmh')
%matplotlib inline
plt.rcParams['figure.dpi'] = 100

Now we can load the dataset, let's name the dataframe as pokedata and take a look at the first and last 5 rows of the dataset to get a general knowledge of the data.

In [2]:

pokedata = pd.read_csv("../input/Pokemon_all.csv")

In [3]:

pokedata.head()

Out[3]:

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	625	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	NaN	309	39	52	43	60	50	65	1	False

In [4]:

pokedata.tail()

Out[4]:

	#	Name	Type 1	Type 2	Total	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
881	803	Poipole	poison	NaN	420	67	73	67	73	67	73	7	False
882	804	Naganadel	poison	dragon	540	73	73	73	127	73	121	7	False
883	805	Stakataka	rock	steel	570	61	131	211	53	101	13	7	False
884	806	Blachepalon	fire	ghost	570	53	127	53	151	79	107	7	False
885	807	Zeraora	electric	NaN	600	88	112	75	102	80	143	7	True

Cleaning the dataset¶

If we look carefully at the 10 rows in the dataset above, we can see some problems in the dataset.

Some Pokemon have NaN values (null values) in the column Type 2
Some Pokemon have multiple forms and those forms are included in this dataset
In gen 7, the Pokemon type doesn't start with capital letter like all gen before it, Pandas will count this as a different type

We need to do some cleaning in the dataset before it is ready to use.

I prefer the column name to be in capital letter so I'm gonna change it, it's just my preference though, you can leave them be if you want.

In [5]:

pokedata.columns = pokedata.columns.str.upper()

Now let's capitalize only the first letter of the Pokemon type

In [6]:

pokedata['TYPE 1'] = pokedata['TYPE 1'].str.capitalize()
pokedata['TYPE 2'] = pokedata['TYPE 2'].str.capitalize()

Let's remove the duplicate Pokemon

In [7]:

pokedata.drop_duplicates('#', keep='first', inplace=True)

Some Pokemon doesn't have secondary type so they have NaN (null values) in the Type 2 column. Let's fill in the null values in the Type 2 column by replacing it with None

In [8]:

pokedata['TYPE 2'].fillna(value='None', inplace=True)

Now let's take a look at the first and last 5 rows of the dataset one more time

In [9]:

pokedata.head()

Out[9]:

	#	NAME	TYPE 1	TYPE 2	TOTAL	HP	ATTACK	DEFENSE	SP. ATK	SP. DEF	SPEED	GENERATION	LEGENDARY
0	1	Bulbasaur	Grass	Poison	318	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	405	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	525	80	82	83	100	100	80	1	False
4	4	Charmander	Fire	None	309	39	52	43	60	50	65	1	False
5	5	Charmeleon	Fire	None	405	58	64	58	80	65	80	1	False

In [10]:

pokedata.tail()

Out[10]:

	#	NAME	TYPE 1	TYPE 2	TOTAL	HP	ATTACK	DEFENSE	SP. ATK	SP. DEF	SPEED	GENERATION	LEGENDARY
881	803	Poipole	Poison	None	420	67	73	67	73	67	73	7	False
882	804	Naganadel	Poison	Dragon	540	73	73	73	127	73	121	7	False
883	805	Stakataka	Rock	Steel	570	61	131	211	53	101	13	7	False
884	806	Blachepalon	Fire	Ghost	570	53	127	53	151	79	107	7	False
885	807	Zeraora	Electric	None	600	88	112	75	102	80	143	7	True

The data is much cleaner and now it's ready to use. Now we can do some analysis and visualization.

Pokemon count in each generation¶

First, let's verify how many Pokemon are there in this dataset

In [11]:

pokedata['#'].count()

Out[11]:

The number of Pokemon matched with the data from Bulbapedia. Now let's see how Pokemon numbers are distributed in each gen.

In [12]:

sns.factorplot(
    x='GENERATION', 
    data=pokedata,
    size=5,
    aspect=1.2,
    kind='count'
).set_axis_labels('Generation', '# of Pokemon')

plt.show()

The number of Pokemon doesn't seem to have noticeable trend, except that until gen 7, odd-numbered generations always have more Pokemon compared to even-numbered generation

Legendary Pokemon count¶

From Bulbapedia : Legendary Pokémon are a group of incredibly rare and often very powerful Pokémon, generally featured prominently in the legends and myths of the Pokémon world.¶

In this post, we'll simplify the categorization and count the mythical Pokemon and the ultra beasts as a legendary Pokemon. First, let's take a look at how rare they are, and then we can visualize the distribution between legendary and non legendary Pokemon

In [13]:

pokedata['LEGENDARY'].value_counts()

Out[13]:

False    749
True      58
Name: LEGENDARY, dtype: int64

In [14]:

fig = plt.figure(figsize=(7,7))

colours = ["aqua", "orange"]
pokeLeg = pokedata[pokedata['LEGENDARY']==True]
pokeNon = pokedata[pokedata['LEGENDARY']==False]

legDist = [pokeLeg['NAME'].count(),pokeNon['NAME'].count()]
legPie = plt.pie(legDist,
                 labels= ['Legendary', 'Non Legendary'], 
                 autopct ='%1.1f%%', 
                 shadow = True,
                 colors=colours,
                 startangle = 45,
                 explode=(0, 0.1))

So only 7.2% out of 807 Pokemon is a legendary Pokemon. Now let's see how they are distributed in each gen.

In [15]:

colours = ["aqua", "orange"]
g = sns.factorplot(
    x='GENERATION', 
    data=pokedata,
    kind='count', 
    hue='LEGENDARY',
    palette=colours, 
    size=5, 
    aspect=1.5,
    legend=False,
    ).set_axis_labels('Generation', '# of Pokemon')

g.ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.1),  shadow=True, ncol=2, labels=['NON LEGENDARY','LEGENDARY'])
plt.show()

I initially thought that the number of legendary pokemon always correlate to the number of pokemon in that gen, looks like that wasn't the case. There doesn't seem to be any noticeable trend either

Pokemon Type Distribution¶

There are 18 types of Pokemon in total as of generation 7. Some Pokemon have only 1 type, while other have secondary type. For example, Charmander is a Fire type, while Bulbasaur is both a Grass type and a Poison type

First, let's take a look of all the 18 types

In [16]:

pokedata['TYPE 1'].unique()

Out[16]:

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

Now let's see what is the most common primary and secondary type of Pokemon.

In [17]:

fig = plt.figure(figsize=(15,15))

fig.add_subplot(211)
pokedata['TYPE 1'].value_counts().plot(kind='pie', 
                                       autopct='%1.1f%%',
                                       pctdistance=0.9)

fig.add_subplot(212)
pokedata['TYPE 2'].value_counts().plot(kind='pie', 
                                       autopct='%1.1f%%',
                                       pctdistance=0.9)

plt.show()

We can already see which type of Pokemon is the most and least common, but pie chart is not the ideal choice if the number of slices is too many, so let's just use barplot.

In [18]:

sns.factorplot(
    y='TYPE 1',
    data=pokedata,
    kind='count',
    order=pokedata['TYPE 1'].value_counts().index,
    size=4,
    aspect=1.5,
    color='green'
).set_axis_labels('# of Pokemon', 'Type 1')

sns.factorplot(
    y='TYPE 2',
    data=pokedata,
    kind='count',
    order=pokedata['TYPE 2'].value_counts().index,
    size=4,
    aspect=1.5,
    color='purple'
).set_axis_labels('# of Pokemon', 'Type 2');

There are lots of information that can be derived from the above charts, some of the interesting things are:

Almost half of all Pokemon don't have secondary type.
While Flying is the most common secondary type, it is the least common primary type. It kind of make sense if you think about it, when you see Moltres, the first thing that comes to your mind would be Fire rather than Flying. Or when you see Dragonite, you'll always identify him as a Dragon-type creature rather than Flying-type creature.
Water, Normal, and Grass as the most common primary type is to be expected, but I didn't expect Psychic type Pokemon to be that common.

Pokemon type combinations¶

We've already seen what is the most and least common type of Pokemon, it will be also interesting to see all the type combination of the Pokemon, note that we will not include Pokemon that doesn't have secondary type

In [19]:

plt.subplots(figsize=(10, 10))

sns.heatmap(
    pokedata[pokedata['TYPE 2']!='None'].groupby(['TYPE 1', 'TYPE 2']).size().unstack(),
    linewidths=1,
    annot=True,
    cmap="Blues"
)

plt.xticks(rotation=35)
plt.show()