first, load the csv file from the kaggle

import pandas as pd

to find out if there is NaN values in the Dataframe, I use df.isnull().sum() function

train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In case of "Cabin" column, there are too many NaN values, which could make prediction improper

Thus, I erased the column from the dataframe

train.drop('Cabin', axis=1, inplace=True)

Moreover, I have to modify 'Age' column and 'Embarked' column using some values. Let's think about it later.

Since I wanted to find out any relationship between some features and Survival Rate I made a function which displays survival rate per some features

import matplotlib.pyplot as plt

bar_survival('Pclass')

it seems that 'Pclass' with number 3 has low survival rate compared to other class

bar_survival('Sex')

It seems that male are a lot more prone to death, according to the following plt

bar_survival('Parch')

There aren't any dramatic difference in this feature

bar_survival('SibSp')

this plt also gives no insight

bar_survival('Age')

it seems very young childereb and very old people are prone to survival. I should use this feature to make my prediction

​

지나간것은 지나간대로

타이타닉 프로젝트 (1) - 자료 분석

티스토리툴바