1
2
import pandas as pd
df = pd.read_csv("train.csv")
1
df.sample(5)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
363 | 364 | 0 | 3 | Asim, Mr. Adola | male | 35.0 | 0 | 0 | SOTON/O.Q. 3101310 | 7.050 | NaN | S |
508 | 509 | 0 | 3 | Olsen, Mr. Henry Margido | male | 28.0 | 0 | 0 | C 4001 | 22.525 | NaN | S |
723 | 724 | 0 | 2 | Hodges, Mr. Henry Price | male | 50.0 | 0 | 0 | 250643 | 13.000 | NaN | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.550 | C103 | S |
183 | 184 | 1 | 2 | Becker, Master. Richard F | male | 1.0 | 2 | 1 | 230136 | 39.000 | F4 | S |
1
2
corr = df.corr() # correlation
corr.Survived.sort_values(ascending=False)
1
2
3
4
5
6
7
8
Survived 1.000000
Fare 0.257307
Parch 0.081629
PassengerId -0.005007
SibSp -0.035322
Age -0.077221
Pclass -0.338481
Name: Survived, dtype: float64
1
y = df['Survived']
1
x = df.drop(['Survived', 'PassengerId'], axis=1) # drop column
Split
sklearn.model selection의 train_test_split
- train_test_split(X, y, test_size, shuffle…) (함수에 대해 자세히 알아보려면 shift+tab 키를 입력)
- test_size: 전체 데이터 중 test_size 만큼 test data로 할당 (예를 들어, test_size = 0.3이면, 전체 데이터의 30%는 x_test와 y_test에 할당, 나머지는 x_train, y_train에 할당)
1
from sklearn.model_selection import train_test_split
1
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
1