캐글 타이타닉 1-2. Titanic Survival Prediction(타이타닉 생존자 예측), EDA Features

2020. 12. 28. 18:46Kaggle(캐글)/Kaggle competition

728x90
반응형

Kaggle에 올린 첫 노트북입니다. 전반적인 데이터 분석의 흐름을 알기 위해서 모델의 정확성보단 배운 내용을 활용하는 것에 초점을 맞췄습니다.

링크 : www.kaggle.com/choihanbin/titanic-survival-prediction-eda-ensemble

 

 

1.2 EDA Features

1.2.1 Sex with Survived feature

# Count plots a feature's number without Survival and feature's number with Survival

def count_subplots(data, feature1, hue = 'Survived', ylim = None, xlim = None):
    f, ax = plt.subplots(2, figsize = (18, 15))
    sns.countplot(feature1,  data = data, ax = ax[0])
    ax[0].set_title('{} Count Plot'.format(feature1), size = 20)
    ax[0].set_xlabel(feature1, size = 15)
    ax[0].set_ylabel('Count', size = 15)
    ax[0].tick_params(labelsize = 15)
    
    sns.countplot(feature1, hue = hue, data = data, ax = ax[1])
    ax[1].set_title('{} Count Plot'.format(feature1), size = 20)
    ax[1].set_xlabel(feature1, size = 15)
    ax[1].set_ylabel('Count')
    ax[1].tick_params(labelsize = 15)
    if hue == 'Survived':
        ax[1].legend(['Not Survived', 'Survived'], loc = 'upper right', prop = {'size' : 15})
        
    if ylim != None:
        plt.ylim(ylim)
    if xlim != None:
        plt.xlim(xlim)    
    
    plt.show()

count_subplots(train, 'Sex')

 

 

We can see that male's survival rate is lower than female's. Although number of male is about twice than number of female on the Titanic. However number of survived female is about twice than number of survived male.

So we can see that Sex feature is important feature to predict survival

 

1.2.2 Pclass with Survived feature

count_subplots(train, 'Pclass')

 

We can see that number of 3rd class's passengers is half of passenger on the Titanic. Also almost of them account for not surviving. On the other hand, more than half of 1st class's passengers is survived.

 

It shows that Pclass feature is also important features in this practice.

1.2.3 SibSp and Parch with Survived feature

print('SibSp Count Plot', end = '\n{}\n\n'.format('-'*100))
count_subplots(train, 'SibSp')
print('Parch Count Plot', end = '\n{}\n\n'.format('-'*100))
count_subplots(train, 'Parch')

We can see that shape of plot between SibSp and Survived is very similar to shape of plot between Parch and Survived. We can guess that if we combine thier features, we can create number of family's member and it can be important feature.(SibSp + Parch + 1)

So we will combine their features

1.2.4 Embarked and Survived features

count_subplots(train, 'Embarked')

We can see that passenger who boarded from Cherbourg has many chance to survive than other.

So can we consider that feature is a important feature to predict survival?

1.2.5 Embarked and Pclass feature

# Factor plot function for comparing each features. 

def factor_plots(data, feature1, feature2 = None, col = None, hue = None, kind = 'point', ylim = None, xlim = None):
        g = sns.factorplot(feature1, feature2, col = col, hue = hue, kind = kind, data = data)
        #if feature2 != None:
        #    plt.title("{} and {}'s {} plot".format(feature1, feature2, kind))
        #else:
        #    plt.title("{}'s {} plot".format(feature1, kind))
        fig = plt.gcf()
        fig.set_size_inches(13, 4)
        
        if ylim != None:
            plt.ylim(ylim)
        if xlim != None:
            plt.xlim(xlim)
            
        plt.show()

factor_plots(train, 'Embarked',  kind = 'count', hue = 'Survived', col = 'Pclass')

Now we can see why Embarked 'C' has higher survival rate than others. There are many 1st class passengers from Cherbourg

 

1.2.6 Pclass, Sex and Embarked with Survived feature

factor_plots(train, 'Pclass', 'Survived', hue = 'Sex',  kind = 'point', col = 'Embarked')

We can see that regardless of aboarded port(Embarked) 1st, 2nd passengers has higher survival rate than 3rd passengers without men who aboarded from Queenstown(Embarked = Q). And we can see one more that female has higher survival rate than male.

We can see one more that Sex, Pclass are important features to predict survival.

1.2.7 Age, Sex with survived feature

factor_plots(train, 'Survived', 'Age', hue = 'Sex', kind = 'violin')

When we checked the correlation heatmap, we saw that there are no correlation between Age and Survived. However we can see that the youngest and oldest passengers is higher to survive than others. That's why there are no correlation between Age and Survived. the youngest passengers has higher survival rate than middle-aged passengers. And middle-aged passengers has lower survival rate than the oldest passengers. So it can't be checked correlation between them.

 

So if we engineer Age feature, we can use it to predict survival

 

1.2.8 Fare and Pclass with Survived feature

factor_plots(train, 'Pclass', 'Fare', hue = 'Survived', kind = 'bar')

we can think that if Fare is higher, Pclass is higher. So we create plot that xaix is Pclass, yaxis is Fare and hue is Survived. Our though is somewhat right.

1.2.9 Age and Fare distplot

Checking the numerical features about skewness

f, ax = plt.subplots(2, figsize = (18, 8))
sns.distplot(train['Age'], label = 'Age', hist = False, ax = ax[0])
sns.distplot(train['Fare'], label = 'Fare', hist = False, ax = ax[1])
plt.show()

We can see that Fare is positively skewed. As we confirmed before, Fare and Pclass has some correlation with Survived. However as we will engineer Age feature, we will engineer Fare feature too.

 

So we will engineer Fare feature.

728x90
반응형