2020-06-236 minutes read (About 839 words)

비지도학습 : PCA 주성분 분석

비지도 학습을 사용해 데이터를 변환하는 이유는 여러가지이다. 가장 일반적으로는 데이터를 시각화 압축, 추가 처리를 위해 정보가 더 잘 드러나도록 하기 위해서이다.

주성분 분석(PCA)

PCA의 본질은 탐색적 분석이다. 즉, 변인을 탐색해서 변환을 통해 주성분을 결정하는 방법이다. 주성분이란 데이터를 구성하는 특성 중 데이터를 가장 잘 설명하는 특성을 말한다. 데이터의 특성이 많을때 중요하다고 판단되는 일부 특성을 활용하여 데이터를 설명 또는 모델링하고자 할때 PCA를 사용한다. 그러다보니 주성분을 알아보는 것 외에 차원을 축소하는 기능을 한다고 볼 수 있는 것이다.

1 2	import mglearn mglearn.plots.plot_pca_illustration()

왼쪽 위 그래프에서 분산이 가장 큰 방향을 찾는다. 바로 성분 1이다. 이 방향은 데이터에서 가장 많은 정보를 담고 있다. 즉, 특성들의 상관관계가 가장 큰 방향이다. 그 후 첫 번째 방향과 직간인 방향 중에서 가장 많은 정보를 담은 방향을 찾는다. 해당 그래프에서 주성분을 하나로 선정한다면 2번째 성분은 제거된다. 그리고 다시 원래 모양으로 회전시킨다. 따라서 이러한 변환은 데이터에서 노이즈를 제거하거나 주성분에서 유지되는 정보를 시각화는데 종종 사용된다.

예시)

판다스에서 가장 유명한 붓꽃 데이터를 불러온다. 그 후 각 특성의 데이터를 스케일링한 후 주성분 분석을 실시하였다

1 2	iris_df = sns.load_dataset('iris') iris_df.tail(3)

	sepal_length	sepal_width	petal_length	petal_width	species
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

from sklearn.preprocessing import StandardScaler  # 표준화 패키지 라이브러리 
x = df.drop(['target'], axis=1).values # 독립변인들의 value값만 추출
y = df['target'].values # 종속변인 추출

x = StandardScaler().fit_transform(x) # x객체에 x를 표준화한 데이터를 저장

features = ['sepal length', 'sepal width', 'petal length', 'petal width']
pd.DataFrame(x, columns=features).head()

	sepal length	sepal width	petal length	petal width
0	-0.900681	1.032057	-1.341272	-1.312977
1	-1.143017	-0.124958	-1.341272	-1.312977
2	-1.385353	0.337848	-1.398138	-1.312977
3	-1.506521	0.106445	-1.284407	-1.312977
4	-1.021849	1.263460	-1.341272	-1.312977

from sklearn.decomposition import PCA
pca = PCA(n_components=2) # 주성분을 몇개로 할지 결정
printcipalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data=printcipalComponents, columns = ['principal component1', 'principal component2'])
# 주성분으로 이루어진 데이터 프레임 구성
principalDf

	principal component1	principal component2
0	-2.264542	0.505704
1	-2.086426	-0.655405
2	-2.367950	-0.318477
3	-2.304197	-0.575368
4	-2.388777	0.674767
...	...	...
145	1.870522	0.382822
146	1.558492	-0.905314
147	1.520845	0.266795
148	1.376391	1.016362
149	0.959299	-0.022284

150 rows × 2 columns

1	pca.explained_variance_ratio_

array([0.72770452, 0.23030523])

주성분을 2개로 했을때는 첫번째 성분이 데이터를 72%를 설명한다.

1	pca.components_

array([[ 0.52237162, -0.26335492,  0.58125401,  0.56561105],
       [ 0.37231836,  0.92555649,  0.02109478,  0.06541577]])

위의 주성분 2개에서 각 변수 중요도

finalDf = pd.concat([principalDf, df[['target']]], axis = 1)
fig = plt.figure(figsize = (8, 8))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize=20)

targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf['target'] == target
    ax.scatter(finalDf.loc[indicesToKeep, 'principal component1']
               , finalDf.loc[indicesToKeep, 'principal component2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

1
2

1
2

1
2

1
2

1
2

You need to set install_url to use ShareThis. Please set it in _config.yml.

Alipay

Buy me a coffee Patreon

You forgot to set the business or currency_code for Paypal. Please set it in _config.yml.

Wechat

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.

비지도학습 : PCA 주성분 분석

주성분 분석(PCA)

예시)

Like this article? Support the author with

Comments

Links

Recent

Archives

Tags

Subscribe to Updates