first, load the csv file from the kaggle
import pandas as pd_train = pd.read_csv('../input/titanic/train.csv')_test = pd.read_csv('../input/titanic/test.csv')train = _train.copy()test = _test.copy()to find out if there is NaN values in the Dataframe, I use df.isnull().sum() function
train.isnull().sum()PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
In case of "Cabin" column, there are too many NaN values, which could make prediction improper
Thus, I erased the column from the dataframe
train.drop('Cabin', axis=1, inplace=True)Moreover, I have to modify 'Age' column and 'Embarked' column using some values. Let's think about it later.
Since I wanted to find out any relationship between some features and Survival Rate I made a function which displays survival rate per some features
import matplotlib.pyplot as pltimport seaborn as snsdef bar_survival(feature): _survived = train[train['Survived']==1][feature].value_counts() _death = train[train['Survived']==0][feature].value_counts() survived = _survived.rename("Survived") death = _death.rename("Death") data = pd.DataFrame([survived, death]).transpose() Percentage = [] for i in range(data.shape[0]): tmp_sum = data.iloc[i].sum() tmp_sur = data.iloc[i][0] tmp_per = round(tmp_sur/tmp_sum*100, 0) Percentage.append(tmp_per) data.loc[: , feature] = pd.Series(Percentage, index=data.index) data_per = data.drop(['Survived', 'Death'],axis=1) data_per.plot(kind='bar', stacked=True, figsize=(5,5)) bar_survival('Pclass')it seems that 'Pclass' with number 3 has low survival rate compared to other class
bar_survival('Sex')It seems that male are a lot more prone to death, according to the following plt
bar_survival('Parch')There aren't any dramatic difference in this feature
bar_survival('SibSp')this plt also gives no insight
bar_survival('Age')it seems very young childereb and very old people are prone to survival. I should use this feature to make my prediction