본문 바로가기
기계학습/캐글

타이타닉 프로젝트 (1) - 자료 분석

by tryotto 2020. 3. 12.

first, load the csv file from the kaggle

In[1]:

to find out if there is NaN values in the Dataframe, I use df.isnull().sum() function

In[2]:
Out[2]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In case of "Cabin" column, there are too many NaN values, which could make prediction improper

Thus, I erased the column from the dataframe

In[3]:

Moreover, I have to modify 'Age' column and 'Embarked' column using some values. Let's think about it later.

Since I wanted to find out any relationship between some features and Survival Rate I made a function which displays survival rate per some features

In[4]:
In[5]:

it seems that 'Pclass' with number 3 has low survival rate compared to other class

In[6]:

It seems that male are a lot more prone to death, according to the following plt

In[7]:

There aren't any dramatic difference in this feature

In[8]:

this plt also gives no insight

In[9]:

it seems very young childereb and very old people are prone to survival. I should use this feature to make my prediction

In[]: