first, load the csv file from the kaggle
import pandas as pd
_train = pd.read_csv('../input/titanic/train.csv')
_test = pd.read_csv('../input/titanic/test.csv')
train = _train.copy()
test = _test.copy()
to find out if there is NaN values in the Dataframe, I use df.isnull().sum() function
train.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
In case of "Cabin" column, there are too many NaN values, which could make prediction improper
Thus, I erased the column from the dataframe
train.drop('Cabin', axis=1, inplace=True)
Moreover, I have to modify 'Age' column and 'Embarked' column using some values. Let's think about it later.
Since I wanted to find out any relationship between some features and Survival Rate I made a function which displays survival rate per some features
import matplotlib.pyplot as plt
import seaborn as sns
def bar_survival(feature):
_survived = train[train['Survived']==1][feature].value_counts()
_death = train[train['Survived']==0][feature].value_counts()
survived = _survived.rename("Survived")
death = _death.rename("Death")
data = pd.DataFrame([survived, death]).transpose()
Percentage = []
for i in range(data.shape[0]):
tmp_sum = data.iloc[i].sum()
tmp_sur = data.iloc[i][0]
tmp_per = round(tmp_sur/tmp_sum*100, 0)
Percentage.append(tmp_per)
data.loc[: , feature] = pd.Series(Percentage, index=data.index)
data_per = data.drop(['Survived', 'Death'],axis=1)
data_per.plot(kind='bar', stacked=True, figsize=(5,5))
bar_survival('Pclass')
it seems that 'Pclass' with number 3 has low survival rate compared to other class
bar_survival('Sex')
It seems that male are a lot more prone to death, according to the following plt
bar_survival('Parch')
There aren't any dramatic difference in this feature
bar_survival('SibSp')
this plt also gives no insight
bar_survival('Age')
it seems very young childereb and very old people are prone to survival. I should use this feature to make my prediction