Home Basic Dataframe
Post
Cancel

Basic Dataframe

1
import pandas as pd

Basic Dataframe Handling

1
2
# read_csv 결과를 df에 저장
df = pd.read_csv("train.csv") 
1
2
# shape (행,열) 
df.shape 
1
(891, 12)
1
2
# size (크기 = 행 X 열)
df.size
1
10692
1
2
# numerical description
df.describe()
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
1
2
# What data is incomplete
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
1
2
# random data sample
df.sample()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
84985011Goldenberg, Mrs. Samuel L (Edwiga Grabowska)femaleNaN101745389.1042C92C
1
df.columns
1
2
3
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Accessing columns

조회

  • 배열과 비슷하게 인덱스로 접근 가능: df[Column 이름]
  • df.Column 이름 으로도 접근 가능
1
df['Sex']
1
2
3
4
5
6
7
8
9
10
11
12
0        male
1      female
2      female
3      female
4        male
        ...  
886      male
887    female
888    female
889      male
890      male
Name: Sex, Length: 891, dtype: object
1
df.Cabin
1
2
3
4
5
6
7
8
9
10
11
12
0       NaN
1       C85
2       NaN
3      C123
4       NaN
       ... 
886     NaN
887     B42
888     NaN
889    C148
890     NaN
Name: Cabin, Length: 891, dtype: object

수정

  • 아래와 같이 해당 Column의 모든 값들을 특정 값으로 update 할 수 있다.
1
df['Cabin'] = 0
1
df['Cabin']
1
2
3
4
5
6
7
8
9
10
11
12
0      0
1      0
2      0
3      0
4      0
      ..
886    0
887    0
888    0
889    0
890    0
Name: Cabin, Length: 891, dtype: int64

삭제

  • del 혹은 drop을 통해 해당 column을 삭제할 수 있다.
1
del df['Cabin'] # df.drop("Cabin", axis = 1, inplace = True)
1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB

Crosstab

  • 빈도표
  • dataframe의 column을 parameter로 넘겨주면 빈도표 생성
  • pandas.crosstab(기준 데이터, 빈도 데이터)

Crosstab 구조

  • pandas.crosstab(index, columns, rownames, colnames, margins, normalize)
1
2
# 성별로 Survived 비교
pd.crosstab(df["Sex"], df["Survived"])
Survived01
Sex
female81233
male468109
1
2
# 행 이름, 열 이름 부여
pd.crosstab(df["Sex"], df["Survived"], rownames=['row'], colnames=['col'])
col01
row
female81233
male468109
1
df.columns
1
2
3
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked'],
      dtype='object')
  • 아래와 같이 기준 데이터 및 빈도 데이터를 여러개 설정할 수 있다.
  • margins = True로 설정하면 총합계를 나타낸다.
1
pd.crosstab([df.Sex, df.Age], df.Survived, margins=True)
Survived01All
SexAge
female0.75022
1.0022
2.0426
3.0112
4.0055
...............
male70.5101
71.0202
74.0101
80.0011
All424290714

146 rows × 3 columns

1
pd.crosstab(df.Sex, [df.Survived, df.Pclass])
Survived01
Pclass123123
Sex
female3672917072
male7791300451747
  • 만들어진 crosstab에 접근 가능
1
2
ct = pd.crosstab(df["Pclass"], df["Survived"])
ct.head()
Survived01
Pclass
180136
29787
3372119
1
ct.loc[(1,0)]
1
80