1
import pandas as pd
Basic Dataframe Handling
1
2
# read_csv 결과를 df에 저장
df = pd.read_csv("train.csv")
1
2
# shape (행,열)
df.shape
1
(891, 12)
1
2
# size (크기 = 행 X 열)
df.size
1
10692
1
2
# numerical description
df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
1
2
# What data is incomplete
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
1
2
# random data sample
df.sample()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
849 | 850 | 1 | 1 | Goldenberg, Mrs. Samuel L (Edwiga Grabowska) | female | NaN | 1 | 0 | 17453 | 89.1042 | C92 | C |
1
df.columns
1
2
3
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
Accessing columns
조회
- 배열과 비슷하게 인덱스로 접근 가능: df[Column 이름]
- df.Column 이름 으로도 접근 가능
1
df['Sex']
1
2
3
4
5
6
7
8
9
10
11
12
0 male
1 female
2 female
3 female
4 male
...
886 male
887 female
888 female
889 male
890 male
Name: Sex, Length: 891, dtype: object
1
df.Cabin
1
2
3
4
5
6
7
8
9
10
11
12
0 NaN
1 C85
2 NaN
3 C123
4 NaN
...
886 NaN
887 B42
888 NaN
889 C148
890 NaN
Name: Cabin, Length: 891, dtype: object
수정
- 아래와 같이 해당 Column의 모든 값들을 특정 값으로 update 할 수 있다.
1
df['Cabin'] = 0
1
df['Cabin']
1
2
3
4
5
6
7
8
9
10
11
12
0 0
1 0
2 0
3 0
4 0
..
886 0
887 0
888 0
889 0
890 0
Name: Cabin, Length: 891, dtype: int64
삭제
- del 혹은 drop을 통해 해당 column을 삭제할 수 있다.
1
del df['Cabin'] # df.drop("Cabin", axis = 1, inplace = True)
1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB
Crosstab
- 빈도표
- dataframe의 column을 parameter로 넘겨주면 빈도표 생성
- pandas.crosstab(기준 데이터, 빈도 데이터)
Crosstab 구조
- pandas.crosstab(index, columns, rownames, colnames, margins, normalize)
1
2
# 성별로 Survived 비교
pd.crosstab(df["Sex"], df["Survived"])
Survived | 0 | 1 |
---|---|---|
Sex | ||
female | 81 | 233 |
male | 468 | 109 |
1
2
# 행 이름, 열 이름 부여
pd.crosstab(df["Sex"], df["Survived"], rownames=['row'], colnames=['col'])
col | 0 | 1 |
---|---|---|
row | ||
female | 81 | 233 |
male | 468 | 109 |
1
df.columns
1
2
3
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Embarked'],
dtype='object')
- 아래와 같이 기준 데이터 및 빈도 데이터를 여러개 설정할 수 있다.
- margins = True로 설정하면 총합계를 나타낸다.
1
pd.crosstab([df.Sex, df.Age], df.Survived, margins=True)
Survived | 0 | 1 | All | |
---|---|---|---|---|
Sex | Age | |||
female | 0.75 | 0 | 2 | 2 |
1.0 | 0 | 2 | 2 | |
2.0 | 4 | 2 | 6 | |
3.0 | 1 | 1 | 2 | |
4.0 | 0 | 5 | 5 | |
... | ... | ... | ... | ... |
male | 70.5 | 1 | 0 | 1 |
71.0 | 2 | 0 | 2 | |
74.0 | 1 | 0 | 1 | |
80.0 | 0 | 1 | 1 | |
All | 424 | 290 | 714 |
146 rows × 3 columns
1
pd.crosstab(df.Sex, [df.Survived, df.Pclass])
Survived | 0 | 1 | ||||
---|---|---|---|---|---|---|
Pclass | 1 | 2 | 3 | 1 | 2 | 3 |
Sex | ||||||
female | 3 | 6 | 72 | 91 | 70 | 72 |
male | 77 | 91 | 300 | 45 | 17 | 47 |
- 만들어진 crosstab에 접근 가능
1
2
ct = pd.crosstab(df["Pclass"], df["Survived"])
ct.head()
Survived | 0 | 1 |
---|---|---|
Pclass | ||
1 | 80 | 136 |
2 | 97 | 87 |
3 | 372 | 119 |
1
ct.loc[(1,0)]
1
80