2020. 12. 20. 21:11ㆍKaggle(캐글)/Kaggle competition
Kaggle에 올린 첫 노트북입니다. 전반적인 데이터 분석의 흐름을 알기 위해서 모델의 정확성보단 배운 내용을 활용하는 것에 초점을 맞췄습니다.
링크 : www.kaggle.com/choihanbin/titanic-survival-prediction-eda-ensemble
This notebook have Three steps for solve this problem(predict to survive):
- Checking Features by EDA
- Feature engineering
- Modeling
This notebook is writed for understading overall data analysis process. First, we will check this analysis's goal,to predict if a passenger survived the of the Titanic or not.While we are working on this process, we have to keep in mind about this goal.
1. Checking Features by EDA
1.1 Road data set and check features roughly.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
# Road training and test set
train = pd.read_csv('./titanic/train.csv')
test = pd.read_csv('./titanic/test.csv')
train.head()
# Check the train data set
train.info()
Checking Feature
Before we start to analyze Titanic problem, we have to consider each features in this data set. If we don't work this process, we can't perform to analyze very well. Because we will handle features by combining, deleting, it in Feature engineering stage.
Feature description
- PassengerId : PassnegerId
- Survived : Survival (0 : not survived, 1 : survived)
- Pclass : Ticket class (1 : 1st, 2 : 2nd, 3 : 3rd)
- Name : Passeger's name
- Sex : Passenger's Sex
- Age : Passenger's Age
- SibSp : Passenger's siblings or spouses aboard the Titanic
- Parch : Passenger's parents or chidren aboard the Titanic
- Ticket : Ticket number
- fare : Passenger fare
- Cabin : Cabin number
- Embarked : Port of Embarkation (C : Cherbourg, Q : Queenstown, S : Southampton)
Now we can classify this features by description and type.
Input Feature
Categorical Feature : Name, Sex, Ticket, Cabin, Embarked
Ordinal Feature : Pclass
Numeric Feature : PassengerId, Age, SibSp, Parch, Fare
Target Feature
- Survived
# Correlation matrix
sns.heatmap(train.corr(), annot = True, cmap = 'RdYlGn', linewidth = 0.2)
fig = plt.gcf()
fig.set_size_inches(10, 8)
plt.show()
Correlation Heatmap
We can see feature correlation between numeric features. There are some noticeable correlation between numeric features.
- Fare and Pclass has negative correlation. We can guess because Pclass is higher(1 than 2, 3), then Fare is higher
- Survived and Pclass has negative correlation. We will check this relation by making some plots.
- Age and Pclass has negative correlation. Although Pclass has very negative correlation with Survived, Age has no correlation with Survived in this heatmap. So we also check that later.
- SibSp and Parch has positive correlation. We can guess because there are many families on the Titanic. So We can handle these features.
- Age with SibSp and Parch has negative correlation.
- Other than that, there are some noticeable correlation between Fare and Survived and Parch and Fare, ect.
We will see that by making some plots.
'Kaggle(캐글) > Kaggle competition' 카테고리의 다른 글
캐글 타이타닉 1-2. Titanic Survival Prediction(타이타닉 생존자 예측), EDA Features (0) | 2020.12.28 |
---|