Introduction to Pandas and Seaborn (Pokemon Dataset Part 1)
Posted on Tue 17 July 2018 in posts
In this post, we'll take a look at the Pokemon dataset that is available in Kaggle and we'll try to have fun practicing data analysis and visualization using Pandas and Seaborn. We won't be doing some advanced analysis, just a simple analysis such as counting, filtering, cleaning the data, and some basic plotting techniques¶
Note : The dataset only contains all the Pokemon until gen 6, the gen 7 dataset can be taken from here¶
Let's start by importing all the important packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('bmh')
%matplotlib inline
plt.rcParams['figure.dpi'] = 100
Now we can load the dataset, let's name the dataframe as pokedata
and take a look at the first and last 5 rows of the dataset to get a general knowledge of the data.
pokedata = pd.read_csv("../input/Pokemon_all.csv")
pokedata.head()
pokedata.tail()
Cleaning the dataset¶
If we look carefully at the 10 rows in the dataset above, we can see some problems in the dataset.
- Some Pokemon have
NaN
values (null values) in the columnType 2
- Some Pokemon have multiple forms and those forms are included in this dataset
- In gen 7, the Pokemon type doesn't start with capital letter like all gen before it, Pandas will count this as a different type
We need to do some cleaning in the dataset before it is ready to use.
I prefer the column name to be in capital letter so I'm gonna change it, it's just my preference though, you can leave them be if you want.
pokedata.columns = pokedata.columns.str.upper()
Now let's capitalize only the first letter of the Pokemon type
pokedata['TYPE 1'] = pokedata['TYPE 1'].str.capitalize()
pokedata['TYPE 2'] = pokedata['TYPE 2'].str.capitalize()
Let's remove the duplicate Pokemon
pokedata.drop_duplicates('#', keep='first', inplace=True)
Some Pokemon doesn't have secondary type so they have NaN
(null values) in the Type 2
column. Let's fill in the null
values in the Type 2
column by replacing it with None
pokedata['TYPE 2'].fillna(value='None', inplace=True)
Now let's take a look at the first and last 5 rows of the dataset one more time
pokedata.head()
pokedata.tail()
The data is much cleaner and now it's ready to use. Now we can do some analysis and visualization.
pokedata['#'].count()
The number of Pokemon matched with the data from Bulbapedia
. Now let's see how Pokemon numbers are distributed in each gen.
sns.factorplot(
x='GENERATION',
data=pokedata,
size=5,
aspect=1.2,
kind='count'
).set_axis_labels('Generation', '# of Pokemon')
plt.show()
The number of Pokemon doesn't seem to have noticeable trend, except that until gen 7, odd-numbered generations always have more Pokemon compared to even-numbered generation
Legendary Pokemon count¶
From Bulbapedia : Legendary Pokémon are a group of incredibly rare and often very powerful Pokémon, generally featured prominently in the legends and myths of the Pokémon world.¶
In this post, we'll simplify the categorization and count the mythical Pokemon and the ultra beasts as a legendary Pokemon. First, let's take a look at how rare they are, and then we can visualize the distribution between legendary and non legendary Pokemon
pokedata['LEGENDARY'].value_counts()
fig = plt.figure(figsize=(7,7))
colours = ["aqua", "orange"]
pokeLeg = pokedata[pokedata['LEGENDARY']==True]
pokeNon = pokedata[pokedata['LEGENDARY']==False]
legDist = [pokeLeg['NAME'].count(),pokeNon['NAME'].count()]
legPie = plt.pie(legDist,
labels= ['Legendary', 'Non Legendary'],
autopct ='%1.1f%%',
shadow = True,
colors=colours,
startangle = 45,
explode=(0, 0.1))
So only 7.2% out of 807
Pokemon is a legendary Pokemon. Now let's see how they are distributed in each gen.
colours = ["aqua", "orange"]
g = sns.factorplot(
x='GENERATION',
data=pokedata,
kind='count',
hue='LEGENDARY',
palette=colours,
size=5,
aspect=1.5,
legend=False,
).set_axis_labels('Generation', '# of Pokemon')
g.ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.1), shadow=True, ncol=2, labels=['NON LEGENDARY','LEGENDARY'])
plt.show()
I initially thought that the number of legendary pokemon always correlate to the number of pokemon in that gen, looks like that wasn't the case. There doesn't seem to be any noticeable trend either
Pokemon Type Distribution¶
There are 18 types of Pokemon in total as of generation 7. Some Pokemon have only 1 type, while other have secondary type. For example, Charmander is a Fire type, while Bulbasaur is both a Grass type and a Poison type
First, let's take a look of all the 18 types
pokedata['TYPE 1'].unique()
Now let's see what is the most common primary and secondary type of Pokemon.
fig = plt.figure(figsize=(15,15))
fig.add_subplot(211)
pokedata['TYPE 1'].value_counts().plot(kind='pie',
autopct='%1.1f%%',
pctdistance=0.9)
fig.add_subplot(212)
pokedata['TYPE 2'].value_counts().plot(kind='pie',
autopct='%1.1f%%',
pctdistance=0.9)
plt.show()
We can already see which type of Pokemon is the most and least common, but pie chart is not the ideal choice if the number of slices is too many, so let's just use barplot.
sns.factorplot(
y='TYPE 1',
data=pokedata,
kind='count',
order=pokedata['TYPE 1'].value_counts().index,
size=4,
aspect=1.5,
color='green'
).set_axis_labels('# of Pokemon', 'Type 1')
sns.factorplot(
y='TYPE 2',
data=pokedata,
kind='count',
order=pokedata['TYPE 2'].value_counts().index,
size=4,
aspect=1.5,
color='purple'
).set_axis_labels('# of Pokemon', 'Type 2');
There are lots of information that can be derived from the above charts, some of the interesting things are:
- Almost half of all Pokemon don't have secondary type.
- While Flying is the most common secondary type, it is the least common primary type. It kind of make sense if you think about it, when you see Moltres, the first thing that comes to your mind would be Fire rather than Flying. Or when you see Dragonite, you'll always identify him as a Dragon-type creature rather than Flying-type creature.
- Water, Normal, and Grass as the most common primary type is to be expected, but I didn't expect Psychic type Pokemon to be that common.
Pokemon type combinations¶
We've already seen what is the most and least common type of Pokemon, it will be also interesting to see all the type combination of the Pokemon, note that we will not include Pokemon that doesn't have secondary type
plt.subplots(figsize=(10, 10))
sns.heatmap(
pokedata[pokedata['TYPE 2']!='None'].groupby(['TYPE 1', 'TYPE 2']).size().unstack(),
linewidths=1,
annot=True,
cmap="Blues"
)
plt.xticks(rotation=35)
plt.show()