Published 2021. 8. 24. 17:07

ML - pandas 기본

💡 AI/ML

1. 판다스(pandas)

파이썬에서 데이터 처리를 위해 존재하는 가장 인기있는 라이브러리이다. 대부분의 데이터 셋은 2차원이다.

1.1. 판다스의 구성요소

DataFrame : Column x Row 로 구성된 2차원 데이터 셋

Series : 1개의 Column 만으로 구성된 1차원 데이터 셋

Index

1.2. 기본 API

read_csv()
head()
shape
info()
describe()
Value_counts()
Sort_values()

1.3. DataFrame의 생성

딕셔너리 형태로 만든다.

key 가 컬럼명으로 들어가고, 나머지 value가 나머지 값들로 들어가게 된다.

dic1 = {'Name': ['Chulmin', 'Eunkyung','Jinwoong','Soobeom'],
        'Year': [2011, 2016, 2015, 2015],
        'Gender': ['Male', 'Female', 'Male', 'Male']
       }
# 딕셔너리를 DataFrame으로 변환
data_df = pd.DataFrame(dic1)
print(data_df)
print("#"*30)

# 새로운 컬럼명을 추가
data_df = pd.DataFrame(dic1, columns=["Name", "Year", "Gender", "Age"])
print(data_df)
print("#"*30)

# 인덱스를 새로운 값으로 할당. 
data_df = pd.DataFrame(dic1, index=['one','two','three','four'])
print(data_df)
print("#"*30)

>>
   Gender      Name  Year
0    Male   Chulmin  2011
1  Female  Eunkyung  2016
2    Male  Jinwoong  2015
3    Male   Soobeom  2015
##############################
       Name  Year  Gender  Age
0   Chulmin  2011    Male  NaN
1  Eunkyung  2016  Female  NaN
2  Jinwoong  2015    Male  NaN
3   Soobeom  2015    Male  NaN
##############################
       Gender      Name  Year
one      Male   Chulmin  2011
two    Female  Eunkyung  2016
three    Male  Jinwoong  2015
four     Male   Soobeom  2015
##############################

2. titanic 데이터

2.1. head()

titanic_df = pd.read_csv('titanic_train.csv')
print('titanic 변수 type:',type(titanic_df))

>>
titanic 변수 type: <class 'pandas.core.frame.DataFrame'>

csv 파일은 이제 DataFrame 이 된다.

csv 뜻 : comma 로 분리된 파일

tsv : tab 으로 분리된 파일

맨 왼쪽은 컬럼명이 없는 인덱스이다.

2.2. shape

DataFrame의 행(Row)와 열(Column) 크기를 가지고 있는 속성

print('DataFrame 크기: ', titanic_df.shape)

>>
DataFrame 크기:  (891, 12)

2.3. info()

DataFrame내의 컬럼명, 데이터 타입, Null건수, 데이터 건수 정보를 제공합니다.

titanic_df.info()

>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object # 데이터 타입 object는 그냥 string으로 이해하면 된다.
Sex            891 non-null object
Age            714 non-null float64  # 177개 정도는 Null 값이다
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB ## 메모리를 항상 주시하는 습관을 들여라

2.4. describe()

데이터값들의 평균,표준편차,4분위 분포도를 제공합니다. 숫자형 컬럼들에 대해서 해당 정보를 제공합니다.

titanic_df.describe()

2.5. value_counts()

동일한 개별 데이터 값이 몇건이 있는지 정보를 제공합니다. 즉 개별 데이터값의 분포도를 제공합니다.

주의할 점은 value_counts()는 Series객체에서만 호출 될 수 있으므로 반드시 DataFrame을 단일 컬럼으로 입력하여 Series로 변환한 뒤 호출합니다.

value_counts = titanic_df['Pclass'].value_counts()
print(value_counts)

>>
3    491
1    216
2    184
Name: Pclass, dtype: int64

2.6. sort_values()

by=기준이 되는 컬럼

ascending=True 또는 False로 오름차순/내림차순으로 정렬

# Pclass를 기준으로 오름차순으로 정렬
# titanic_df.sort_values(by='Pclass', ascending=True)

# Name과 Age를 Age기준으로 정렬
# titanic_df[['Name','Age']].sort_values(by='Age')

# 여러개의 컬럼을 정렬하기
# Name, Age, Pclass를 Pclass와 Age 기준으로 정렬
titanic_df[['Name','Age','Pclass']].sort_values(by=['Pclass','Age'])

3. DataFrame과 리스트, 딕셔너리, 넘파이 ndarray 상호 변환

3.1. list → DataFrame 변환
ndarray → DataFrame 변환

col_name1=['col1']
list1 = [1, 2, 3]
array1 = np.array(list1)

print('array1 shape:', array1.shape )
df_list1 = pd.DataFrame(list1, columns=col_name1)
print('1차원 리스트로 만든 DataFrame:\n', df_list1)
df_array1 = pd.DataFrame(array1, columns=col_name1)
print('1차원 ndarray로 만든 DataFrame:\n', df_array1)

>>
array1 shape: (3,)
1차원 리스트로 만든 DataFrame:
    col1
0     1
1     2
2     3
1차원 ndarray로 만든 DataFrame:
    col1
0     1
1     2
2     3

# 3개의 컬럼명이 필요함. 
col_name2=['col1', 'col2', 'col3']

# 2행x3열 형태의 리스트와 ndarray 생성 한 뒤 이를 DataFrame으로 변환. 
list2 = [[1, 2, 3],
         [11, 12, 13]]
array2 = np.array(list2)
print('array2 shape:', array2.shape )
df_list2 = pd.DataFrame(list2, columns=col_name2)
print('2차원 리스트로 만든 DataFrame:\n', df_list2)
df_array1 = pd.DataFrame(array2, columns=col_name2)
print('2차원 ndarray로 만든 DataFrame:\n', df_array1)

>>
array2 shape: (2, 3)
2차원 리스트로 만든 DataFrame:
    col1  col2  col3
0     1     2     3
1    11    12    13
2차원 ndarray로 만든 DataFrame:
    col1  col2  col3
0     1     2     3
1    11    12    13

3.2. Dict(딕셔너리) → DataFrame 변환

# Key는 컬럼명으로 매핑, Value는 리스트 형(또는 ndarray)
dict = {'col1':[1, 11], 'col2':[2, 22], 'col3':[3, 33]}
df_dict = pd.DataFrame(dict)
print('딕셔너리로 만든 DataFrame:\n', df_dict)

>>
딕셔너리로 만든 DataFrame:
    col1  col2  col3
0     1     2     3
1    11    22    33

3.3. DataFrame → ndarray 변환

# DataFrame을 ndarray로 변환
array3 = df_dict.values
print('df_dict.values 타입:', type(array3), 'df_dict.values shape:', array3.shape)
print(array3)

>>
df_dict.values 타입: <class 'numpy.ndarray'> df_dict.values shape: (2, 3)
[[ 1  2  3]
 [11 22 33]]

3.4. DataFrame → list 변환

# DataFrame을 리스트로 변환
list3 = df_dict.values.tolist()
print('df_dict.values.tolist() 타입:', type(list3))
print(list3)

>>
df_dict.values.tolist() 타입: <class 'list'>
[[1, 2, 3], [11, 22, 33]]

3.5. DataFrame → Dict(딕셔너리) 변환

# DataFrame을 딕셔너리로 변환
dict3 = df_dict.to_dict('list')
print('\n df_dict.to_dict() 타입:', type(dict3))
print(dict3)

>>
df_dict.to_dict() 타입: <class 'dict'>
{'col1': [1, 11], 'col2': [2, 22], 'col3': [3, 33]}

4. DataFrame의 drop()

로우를 삭제할 때는 axis = 0 (디폴트)

컬럼을 삭제할 때는 axis = 1

inplace = False : 원본 데이터 유지 (디폴트)

inplace = True : 원본 데이터 변경

titanic_drop_df = titanic_df.drop('Age_0', axis=1 )
titanic_drop_df.head(3)

titanic_df.head(3)

drop_result = titanic_df.drop(['Age_0', 'Age_by_10', 'Family_No'], axis=1, inplace=True)
print(' inplace=True 로 drop 후 반환된 값:',drop_result)
titanic_df.head(3)

inplace=True 로 하면 반환된 값은 None 이다.

print('#### before axis 0 drop ####')
print(titanic_df.head(6))

titanic_df.drop([0,1,2], axis=0, inplace=True)

print('#### after axis 0 drop ####')
print(titanic_df.head(3))

인덱스 0,1,2 row 가 사라졌다.

5. Index

인덱스는 연산에서 제외된다. 식별자로서만 역할한다. RDBMS의 primary key와는 유사하지만 조금 다르다.

DataFrame.Index : index 객체만 추출

5.1. reset_index()

다만, index를 이용하여 무엇인가를 하고 싶을 때,

DataFrame 및 Series에 reset_index( ) 메서드를 수행하면

새롭게 인덱스를 연속 숫자 형으로 할당하며 기존 인덱스는 ‘index’라는 새로운 컬럼 명으로 추가합니다.

titanic_reset_df = titanic_df.reset_index(inplace=False)
titanic_reset_df.head(3)

6. 데이터 셀렉션 및 필터링

ix[] : 명칭 기반과 위치 기반 인덱싱을 함께 제공
loc[] : 명칭 기반 인덱싱
iloc[] : 위치 기반 인덱싱
불린 인덱싱 : 명칭 기반, 위치 기반 필요 업시 조건식을 []안에 기입하여 필터링

print('단일 컬럼 데이터 추출:\n', titanic_df[ 'Pclass' ].head(3))
print('\n여러 컬럼들의 데이터 추출:\n', titanic_df[ ['Survived', 'Pclass'] ].head(3))
# 여러개의 컬럼을 보려면 컬럼들을 리스트로 감싸면 된다.

print('[ ] 안에 숫자 index는 KeyError 오류 발생:\n', titanic_df[0])

>>
0    3
1    1
2    3
Name: Pclass, dtype: int64

여러 컬럼들의 데이터 추출:
    Survived  Pclass
0         0       3
1         1       1
2         1       3

6.1. ix[ ] : 명칭 기반과 위치 기반 인덱싱을 함께 제공

print('컬럼 위치 기반 인덱싱 데이터 추출:',titanic_df.ix[0,2])
print('컬럼명 기반 인덱싱 데이터 추출:',titanic_df.ix[0,'Pclass'])

>>
컬럼 위치 기반 인덱싱 데이터 추출: 3
컬럼명 기반 인덱싱 데이터 추출: 3

6.2. iloc[ ] : 위치 기반 인덱싱

data_df.iloc[0, 0]

>> 'Male'

# 아래 코드는 오류를 발생합니다. 
data_df.iloc[0, 'Name']

# data_df 를 reset_index() 로 새로운 숫자형 인덱스를 생성
data_df_reset = data_df.reset_index()
data_df_reset = data_df_reset.rename(columns={'index':'old_index'})

# index 값에 1을 더해서 1부터 시작하는 새로운 index값 생성
data_df_reset.index = data_df_reset.index+1

data_df_reset.head()

data_df_reset.iloc[0, 1]

>> 'Male'

6.3. loc[ ] : 명칭 기반 인덱싱

data_df.loc['one', 'Name']

>> 'Chulmin'

data_df_reset.loc[1, 'Name'] # 1이 명칭이 돼서 loc가 가능하다.

>> 'Chulmin'

# 아래 코드는 오류를 발생합니다. 
data_df_reset.loc[0, 'Name']

loc 인덱싱시에 명칭기반 해석시, 1:2에서 앞의수는 포함 뒤에수는 미포함 아닌가? 왜 2개가 나올까?

그래서 loc[] 가 사용에 매우 유의해야 한다. 일반적인 파이썬의 : 범위는 맨 마지막을 포함하지 않지만 loc[1:2, 'Name'] 과 같이 행(즉 인덱스)에 1:2와 같이 범위를 부여하면 맨 마지막을 포함한다.

왜 이렇게 만들었는지 처음에 의문이 들었지만, loc[]가 명칭기반이기 때문에 아무래도 연속값이 아닌 카테고리와 같은 값을 입력하기를 기대하면서 그런 처리를 하지 않았나 생각이 든다. 즉 인덱스가 숫자값이 아니라 'Chulmin', 'Eunkyung'과 같은 값이길 기대하였기 때문으로 판단된다.

헷갈림을 방지 하기 위해 loc[]를 사용할 때는 행 위치 인덱스에 가급적이면 1:2와 같이 숫자형의 특정 범위를 지정하지 않도록 유의할 필요가 있다.

6.4. 불린 인덱싱 : 명칭 기반, 위치 기반 필요 없이 조건식을 [ ]안에 기입하여 필터링

아래처럼 Boolean 값을 가진 series 형태로 반환이 된다.

series 형태 = 1차원

Boolean 값을 가진 객체를 [ ] 안에 넣으면 원하는 데이터를 필터링해서 가져올 수 있다.

titanic_df['Age'] > 60

# var1 = titanic_df['Age'] > 60
# print(type(var1))

>>
0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Age, Length: 891, dtype: bool

titanic_df[titanic_df['Age'] > 60][['Name','Age']].head(3)
# 많은 컬럼을 가져오고, 그 후 두개의 컬럼만 선택

titanic_df[['Name','Age']][titanic_df['Age'] > 60].head(3)
# 두개의 컬럼만 가져오는 데이터 프레임을 반환한 뒤에
# 다시 그중에서 Age > 60 인것 가져오기

# loc 써서 가져오기
titanic_df.loc[titanic_df['Age'] > 60, ['Name','Age']].head(3)
# 행 인덱스에 불린 인덱스 쓰고
# 컬럼 위치에다가

논리 연산자로 결합된 조건식도 불린 인덱싱으로 적용 가능하다.

titanic_df[ (titanic_df['Age'] > 60) & (titanic_df['Pclass']==1) & (titanic_df['Sex']=='female')]

조건식은 변수로도 할당 가능하다. 복잡한 조건식은 변수로 할당하여 가득성을 향상 할 수 있다.

cond1 = titanic_df['Age'] > 60
cond2 = titanic_df['Pclass']==1
cond3 = titanic_df['Sex']=='female'
titanic_df[ cond1 & cond2 & cond3]

7. Aggregation

mean(), sum(), count()

axis=0 : 행 방향, 아래 방향↓ 이므로 한 column의 합

axis=1 : 열 방향, 오른 방향→ 이므로 모든 row 의 합

titanic_df[['Age', 'Fare']].mean(axis=0)

>>
Age     29.699118
Fare    32.204208
dtype: float64

titanic_df[['Age', 'Fare']].mean(axis=1)

>>
0      14.62500
1      54.64165
2      16.96250
3      44.05000
4      21.52500
         ...   
886    20.00000
887    24.50000
888    23.45000
889    28.00000
890    19.87500
Length: 891, dtype: float64

groupby()

'💡 AI > ML' 카테고리의 다른 글

ML - 교차 검증 (0)	2021.08.26
ML - 예측 프로세스 (0)	2021.08.25
ML - 머신러닝과 numpy 기본 (0)	2021.08.23
기계학습기초2 정리 (0)	2021.06.16
기계학습기초1 정리 (0)	2021.06.16