Home Train Test Split
Post
Cancel

Train Test Split

1
2
import pandas as pd
df = pd.read_csv("train.csv")
1
df.sample(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
36336403Asim, Mr. Adolamale35.000SOTON/O.Q. 31013107.050NaNS
50850903Olsen, Mr. Henry Margidomale28.000C 400122.525NaNS
72372402Hodges, Mr. Henry Pricemale50.00025064313.000NaNS
111211Bonnell, Miss. Elizabethfemale58.00011378326.550C103S
18318412Becker, Master. Richard Fmale1.02123013639.000F4S
1
2
corr = df.corr() # correlation
corr.Survived.sort_values(ascending=False)
1
2
3
4
5
6
7
8
Survived       1.000000
Fare           0.257307
Parch          0.081629
PassengerId   -0.005007
SibSp         -0.035322
Age           -0.077221
Pclass        -0.338481
Name: Survived, dtype: float64
1
y = df['Survived']
1
x = df.drop(['Survived', 'PassengerId'], axis=1) # drop column

Split

sklearn.model selection의 train_test_split

  • train_test_split(X, y, test_size, shuffle…) (함수에 대해 자세히 알아보려면 shift+tab 키를 입력)
    • test_size: 전체 데이터 중 test_size 만큼 test data로 할당 (예를 들어, test_size = 0.3이면, 전체 데이터의 30%는 x_test와 y_test에 할당, 나머지는 x_train, y_train에 할당)
1
from sklearn.model_selection import train_test_split
1
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
1