Introduction to Pandas and Seaborn (Pokemon Dataset Part 1)

Posted on Tue 17 July 2018 in posts

In this post, we'll take a look at the Pokemon dataset that is available in Kaggle and we'll try to have fun practicing data analysis and visualization using Pandas and Seaborn. We won't be doing some advanced analysis, just a simple analysis such as counting, filtering, cleaning the data, and some basic plotting techniques

Note : The dataset only contains all the Pokemon until gen 6, the gen 7 dataset can be taken from here

Let's start by importing all the important packages

In [1]:
import pandas as pd   
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('bmh')
%matplotlib inline
plt.rcParams['figure.dpi'] = 100

Now we can load the dataset, let's name the dataframe as pokedata and take a look at the first and last 5 rows of the dataset to get a general knowledge of the data.

In [2]:
pokedata = pd.read_csv("../input/Pokemon_all.csv")
In [3]:
pokedata.head()                    
Out[3]:
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 120 80 1 False
4 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
In [4]:
pokedata.tail()
Out[4]:
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
881 803 Poipole poison NaN 420 67 73 67 73 67 73 7 False
882 804 Naganadel poison dragon 540 73 73 73 127 73 121 7 False
883 805 Stakataka rock steel 570 61 131 211 53 101 13 7 False
884 806 Blachepalon fire ghost 570 53 127 53 151 79 107 7 False
885 807 Zeraora electric NaN 600 88 112 75 102 80 143 7 True

Cleaning the dataset

If we look carefully at the 10 rows in the dataset above, we can see some problems in the dataset.

  • Some Pokemon have NaN values (null values) in the column Type 2
  • Some Pokemon have multiple forms and those forms are included in this dataset
  • In gen 7, the Pokemon type doesn't start with capital letter like all gen before it, Pandas will count this as a different type

We need to do some cleaning in the dataset before it is ready to use.

I prefer the column name to be in capital letter so I'm gonna change it, it's just my preference though, you can leave them be if you want.

In [5]:
pokedata.columns = pokedata.columns.str.upper()

Now let's capitalize only the first letter of the Pokemon type

In [6]:
pokedata['TYPE 1'] = pokedata['TYPE 1'].str.capitalize()
pokedata['TYPE 2'] = pokedata['TYPE 2'].str.capitalize()

Let's remove the duplicate Pokemon

In [7]:
pokedata.drop_duplicates('#', keep='first', inplace=True)

Some Pokemon doesn't have secondary type so they have NaN (null values) in the Type 2 column. Let's fill in the null values in the Type 2 column by replacing it with None

In [8]:
pokedata['TYPE 2'].fillna(value='None', inplace=True)

Now let's take a look at the first and last 5 rows of the dataset one more time

In [9]:
pokedata.head()
Out[9]:
# NAME TYPE 1 TYPE 2 TOTAL HP ATTACK DEFENSE SP. ATK SP. DEF SPEED GENERATION LEGENDARY
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
4 4 Charmander Fire None 309 39 52 43 60 50 65 1 False
5 5 Charmeleon Fire None 405 58 64 58 80 65 80 1 False
In [10]:
pokedata.tail()
Out[10]:
# NAME TYPE 1 TYPE 2 TOTAL HP ATTACK DEFENSE SP. ATK SP. DEF SPEED GENERATION LEGENDARY
881 803 Poipole Poison None 420 67 73 67 73 67 73 7 False
882 804 Naganadel Poison Dragon 540 73 73 73 127 73 121 7 False
883 805 Stakataka Rock Steel 570 61 131 211 53 101 13 7 False
884 806 Blachepalon Fire Ghost 570 53 127 53 151 79 107 7 False
885 807 Zeraora Electric None 600 88 112 75 102 80 143 7 True

The data is much cleaner and now it's ready to use. Now we can do some analysis and visualization.

Pokemon count in each generation

First, let's verify how many Pokemon are there in this dataset

In [11]:
pokedata['#'].count()
Out[11]:
807

The number of Pokemon matched with the data from Bulbapedia. Now let's see how Pokemon numbers are distributed in each gen.

In [12]:
sns.factorplot(
    x='GENERATION', 
    data=pokedata,
    size=5,
    aspect=1.2,
    kind='count'
).set_axis_labels('Generation', '# of Pokemon')

plt.show()

The number of Pokemon doesn't seem to have noticeable trend, except that until gen 7, odd-numbered generations always have more Pokemon compared to even-numbered generation

Legendary Pokemon count


In this post, we'll simplify the categorization and count the mythical Pokemon and the ultra beasts as a legendary Pokemon. First, let's take a look at how rare they are, and then we can visualize the distribution between legendary and non legendary Pokemon

In [13]:
pokedata['LEGENDARY'].value_counts()
Out[13]:
False    749
True      58
Name: LEGENDARY, dtype: int64
In [14]:
fig = plt.figure(figsize=(7,7))

colours = ["aqua", "orange"]
pokeLeg = pokedata[pokedata['LEGENDARY']==True]
pokeNon = pokedata[pokedata['LEGENDARY']==False]

legDist = [pokeLeg['NAME'].count(),pokeNon['NAME'].count()]
legPie = plt.pie(legDist,
                 labels= ['Legendary', 'Non Legendary'], 
                 autopct ='%1.1f%%', 
                 shadow = True,
                 colors=colours,
                 startangle = 45,
                 explode=(0, 0.1))

So only 7.2% out of 807 Pokemon is a legendary Pokemon. Now let's see how they are distributed in each gen.

In [15]:
colours = ["aqua", "orange"]
g = sns.factorplot(
    x='GENERATION', 
    data=pokedata,
    kind='count', 
    hue='LEGENDARY',
    palette=colours, 
    size=5, 
    aspect=1.5,
    legend=False,
    ).set_axis_labels('Generation', '# of Pokemon')

g.ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.1),  shadow=True, ncol=2, labels=['NON LEGENDARY','LEGENDARY'])
plt.show()

I initially thought that the number of legendary pokemon always correlate to the number of pokemon in that gen, looks like that wasn't the case. There doesn't seem to be any noticeable trend either

Pokemon Type Distribution

There are 18 types of Pokemon in total as of generation 7. Some Pokemon have only 1 type, while other have secondary type. For example, Charmander is a Fire type, while Bulbasaur is both a Grass type and a Poison type

First, let's take a look of all the 18 types

In [16]:
pokedata['TYPE 1'].unique()
Out[16]:
array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

Now let's see what is the most common primary and secondary type of Pokemon.

In [17]:
fig = plt.figure(figsize=(15,15))

fig.add_subplot(211)
pokedata['TYPE 1'].value_counts().plot(kind='pie', 
                                       autopct='%1.1f%%',
                                       pctdistance=0.9)

fig.add_subplot(212)
pokedata['TYPE 2'].value_counts().plot(kind='pie', 
                                       autopct='%1.1f%%',
                                       pctdistance=0.9)

plt.show()

We can already see which type of Pokemon is the most and least common, but pie chart is not the ideal choice if the number of slices is too many, so let's just use barplot.

In [18]:
sns.factorplot(
    y='TYPE 1',
    data=pokedata,
    kind='count',
    order=pokedata['TYPE 1'].value_counts().index,
    size=4,
    aspect=1.5,
    color='green'
).set_axis_labels('# of Pokemon', 'Type 1')

sns.factorplot(
    y='TYPE 2',
    data=pokedata,
    kind='count',
    order=pokedata['TYPE 2'].value_counts().index,
    size=4,
    aspect=1.5,
    color='purple'
).set_axis_labels('# of Pokemon', 'Type 2');

There are lots of information that can be derived from the above charts, some of the interesting things are:

  • Almost half of all Pokemon don't have secondary type.
  • While Flying is the most common secondary type, it is the least common primary type. It kind of make sense if you think about it, when you see Moltres, the first thing that comes to your mind would be Fire rather than Flying. Or when you see Dragonite, you'll always identify him as a Dragon-type creature rather than Flying-type creature.
  • Water, Normal, and Grass as the most common primary type is to be expected, but I didn't expect Psychic type Pokemon to be that common.

Pokemon type combinations

We've already seen what is the most and least common type of Pokemon, it will be also interesting to see all the type combination of the Pokemon, note that we will not include Pokemon that doesn't have secondary type

In [19]:
plt.subplots(figsize=(10, 10))

sns.heatmap(
    pokedata[pokedata['TYPE 2']!='None'].groupby(['TYPE 1', 'TYPE 2']).size().unstack(),
    linewidths=1,
    annot=True,
    cmap="Blues"
)

plt.xticks(rotation=35)
plt.show()