캐글 타이타닉 1-1. Titanic Survival Prediction(타이타닉 생존자 예측), Checking Features

2020. 12. 20. 21:11Kaggle(캐글)/Kaggle competition

728x90
반응형

Kaggle에 올린 첫 노트북입니다. 전반적인 데이터 분석의 흐름을 알기 위해서 모델의 정확성보단 배운 내용을 활용하는 것에 초점을 맞췄습니다.

링크 : www.kaggle.com/choihanbin/titanic-survival-prediction-eda-ensemble

 

Titanic Survival Prediction(EDA, Ensemble)

Explore and run machine learning code with Kaggle Notebooks | Using data from Titanic - Machine Learning from Disaster

www.kaggle.com

 

 

This notebook have Three steps for solve this problem(predict to survive):

  1. Checking Features by EDA
  2. Feature engineering
  3. Modeling

This notebook is writed for understading overall data analysis process. First, we will check this analysis's goal,to predict if a passenger survived the of the Titanic or not.While we are working on this process, we have to keep in mind about this goal.

1. Checking Features by EDA

1.1 Road data set and check features roughly.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
# Road training and test set
train = pd.read_csv('./titanic/train.csv')
test = pd.read_csv('./titanic/test.csv')

train.head()

# Check the train data set
train.info()

Checking Feature

Before we start to analyze Titanic problem, we have to consider each features in this data set. If we don't work this process, we can't perform to analyze very well. Because we will handle features by combining, deleting, it in Feature engineering stage.

Feature description

  • PassengerId : PassnegerId
  • Survived : Survival (0 : not survived, 1 : survived)
  • Pclass : Ticket class (1 : 1st, 2 : 2nd, 3 : 3rd)
  • Name : Passeger's name
  • Sex : Passenger's Sex
  • Age : Passenger's Age
  • SibSp : Passenger's siblings or spouses aboard the Titanic
  • Parch : Passenger's parents or chidren aboard the Titanic
  • Ticket : Ticket number
  • fare : Passenger fare
  • Cabin : Cabin number
  • Embarked : Port of Embarkation (C : Cherbourg, Q : Queenstown, S : Southampton)

Now we can classify this features by description and type.

Input Feature

Categorical Feature : Name, Sex, Ticket, Cabin, Embarked

Ordinal Feature : Pclass

Numeric Feature : PassengerId, Age, SibSp, Parch, Fare

Target Feature

  • Survived

 

# Correlation matrix
sns.heatmap(train.corr(), annot = True, cmap = 'RdYlGn', linewidth = 0.2)
fig = plt.gcf()
fig.set_size_inches(10, 8)
plt.show()

Correlation Heatmap

We can see feature correlation between numeric features. There are some noticeable correlation between numeric features.

  • Fare and Pclass has negative correlation. We can guess because Pclass is higher(1 than 2, 3), then Fare is higher
  • Survived and Pclass has negative correlation. We will check this relation by making some plots.
  • Age and Pclass has negative correlation. Although Pclass has very negative correlation with Survived, Age has no correlation with Survived in this heatmap. So we also check that later.
  • SibSp and Parch has positive correlation. We can guess because there are many families on the Titanic. So We can handle these features.
  • Age with SibSp and Parch has negative correlation.
  • Other than that, there are some noticeable correlation between Fare and Survived and Parch and Fare, ect.

We will see that by making some plots.

728x90
반응형